This article describes the analysis of regression-discontinuity designs (RDDs) using the
Path models with observed composites based on multiple items (e.g., mean or sum score of the items) are commonly used to test interaction effects. Under this practice, researchers generally assume that the observed composites are measured without errors. In this study, we reviewed and evaluated two alternative methods within the structural equation modeling (SEM) framework, namely, the reliability-adjusted product indicator (RAPI) method and the latent moderated structural equations (LMS) method, which can both flexibly take into account measurement errors. Results showed that both these methods generally produced unbiased estimates of the interaction effects. On the other hand, the path model—without considering measurement errors—led to substantial bias and a low confidence interval coverage rate of nonzero interaction effects. Other findings and implications for future studies are discussed.
In psychometric practice, the parameter estimates of a standard item-response theory (IRT) model can become biased when item-response data, of persons’ individual responses to test items, contain outliers relative to the model. Also, the manual removal of outliers can be a time-consuming and difficult task. Besides, removing outliers leads to data information loss in parameter estimation. To address these concerns, a Bayesian IRT model that includes person and latent item-response outlier parameters, in addition to person ability and item parameters, is proposed and illustrated, and is defined by item characteristic curves (ICCs) that are each specified by a robust, Student’s t-distribution function. The outlier parameters and the robust ICCs enable the model to automatically identify item-response outliers, and to make estimates of the person ability and item parameters more robust to outliers. Hence, under this IRT model, it is unnecessary to remove outliers from the data analysis. Our IRT model is illustrated through the analysis of two data sets, involving dichotomous- and polytomous-response items, respectively.
Test items scored as polytomous have the potential to display multidimensionality across rating scale score categories. This article uses a multidimensional nominal response model (MNRM) to examine the possibility that the proficiency dimension/dimensional composite best measured by a polytomously scored item may vary by score category, an issue not generally considered in multidimensional item response theory (MIRT). Some practical considerations in exploring rubric-related multidimensionality, including potential consequences of not attending to it, are illustrated through simulation examples. A real data application is applied in the study of item format effects using the 2007 administration of Trends in Mathematics and Science Study (TIMSS) among eighth graders in the United States.
Cluster randomized trials involving participants nested within intact treatment and control groups are commonly performed in various educational, psychological, and biomedical studies. However, recruiting and retaining intact groups present various practical, financial, and logistical challenges to evaluators and often, cluster randomized trials are performed with a low number of clusters (~20 groups). Although multilevel models are often used to analyze nested data, researchers may be concerned of potentially biased results due to having only a few groups under study. Cluster bootstrapping has been suggested as an alternative procedure when analyzing clustered data though it has seen very little use in educational and psychological studies. Using a Monte Carlo simulation that varied the number of clusters, average cluster size, and intraclass correlations, we compared standard errors using cluster bootstrapping with those derived using ordinary least squares regression and multilevel models. Results indicate that cluster bootstrapping, though more computationally demanding, can be used as an alternative procedure for the analysis of clustered data when treatment effects at the group level are of primary interest. Supplementary material showing how to perform cluster bootstrapped regressions using R is also provided.
This article examines the interdependency of two context effects that are known to occur regularly in large-scale assessments: item position effects and effects of test-taking effort on the probability of correctly answering an item. A microlongitudinal design was used to measure test-taking effort over the course of a large-scale assessment of 60 min. Two components of test-taking effort were investigated: initial effort and change in effort. Both components of test-taking effort significantly affected the probability to solve an item. In addition, it was found that participants’ current test-taking effort diminished considerably across the course of the test. Furthermore, a substantial linear position effect was found, which indicated that item difficulty increased during the test. This position effect varied considerably across persons. Concerning the interplay of position effects and test-taking effort, it was found that only the change in effort moderates the position effect and that persons differ with respect to this moderation effect. The consequences of these results concerning the reliability and validity of large-scale assessments are discussed.
This article outlines a procedure for examining the degree to which a common factor may be dominating additional factors in a multicomponent measuring instrument consisting of binary items. The procedure rests on an application of the latent variable modeling methodology and accounts for the discrete nature of the manifest indicators. The method provides point and interval estimates (a) of the proportion of the variance explained by all factors, which is due to the common (global) factor and (b) of the proportion of the variance explained by all factors, which is due to some or all other (local) factors. The discussed approach can also be readily used as a means of assessing approximate unidimensionality when considering application of unidimensional versus multidimensional item response modeling. The procedure is similarly utilizable in case of highly discrete (e.g., Likert-type) ordinal items, and is illustrated with a numerical example.
The Model With Internal Restrictions on Item Difficulty (MIRID; Butter, 1994) has been useful for investigating cognitive behavior in terms of the processes that lead to that behavior. The main objective of the MIRID model is to enable one to test how component processes influence the complex cognitive behavior in terms of the item parameters. The original MIRID model is, indeed, a fairly restricted model for a number of reasons. One of these restrictions is that the model treats items as fixed and does not fit measurement contexts where the concept of the random items is needed. In this article, random item approaches to the MIRID model are proposed, and both simulation and empirical studies to test and illustrate the random item MIRID models are conducted. The simulation and empirical studies show that the random item MIRID models provide more accurate estimates when substantial random errors exist, and thus these models may be more beneficial.
The assumption of local independence is central to all item response theory (IRT) models. Violations can lead to inflated estimates of reliability and problems with construct validity. For the most widely used fit statistic Q3, there are currently no well-documented suggestions of the critical values which should be used to indicate local dependence (LD), and for this reason, a variety of arbitrary rules of thumb are used. In this study, an empirical data example and Monte Carlo simulation were used to investigate the different factors that can influence the null distribution of residual correlations, with the objective of proposing guidelines that researchers and practitioners can follow when making decisions about LD during scale development and validation. A parametric bootstrapping procedure should be implemented in each separate situation to obtain the critical value of LD applicable to the data set, and provide example critical values for a number of data structure situations. The results show that for the Q3 fit statistic, no single critical value is appropriate for all situations, as the percentiles in the empirical null distribution are influenced by the number of items, the sample size, and the number of response categories. Furthermore, the results show that LD should be considered relative to the average observed residual correlation, rather than to a uniform value, as this results in more stable percentiles for the null distribution of an adjusted fit statistic.
The correlated trait–correlated method (CTCM) model for the analysis of multitrait–multimethod (MTMM) data is known to suffer convergence and admissibility (C&A) problems. We describe a little known and seldom applied reparameterized version of this model (CTCM-R) based on Rindskopf’s reparameterization of the simpler confirmatory factor analysis model. In a Monte Carlo study, we compare the CTCM, CTCM-R, and the correlated trait–correlated uniqueness (CTCU) models in terms of C&A, model fit, and parameter estimation bias. The CTCM-R model largely avoided C&A problems associated with the more traditional CTCM model, producing C&A solutions nearly as often as the CTCU model, but also avoiding parameter estimation biases known to plague the CTCU model. As such, the CTCM-R model is an attractive alternative for the analysis of MTMM data.
Researchers designing multisite and cluster randomized trials of educational interventions will usually conduct a power analysis in the planning stage of the study. To conduct the power analysis, researchers often use estimates of intracluster correlation coefficients and effect sizes derived from an analysis of survey data. When there is heterogeneity in treatment effects across the clusters in the study, these parameters will need to be adjusted to produce an accurate power analysis for a hierarchical trial design. The relevant adjustment factors are derived and presented in the current article. The adjustment factors depend upon the covariance between treatment effects and cluster-specific average values of the outcome variable, illustrating the need for better information about this parameter. The results in the article also facilitate understanding of the relative power of multisite and cluster randomized studies conducted on the same population by showing how the parameters necessary to compute power in the two types of designs are related. This is accomplished by relating parameters defined by linear mixed model specifications to parameters defined in terms of potential outcomes.
A first-order latent growth model assesses change in an unobserved construct from a single score and is commonly used across different domains of educational research. However, examining change using a set of multiple response scores (e.g., scale items) affords researchers several methodological benefits not possible when using a single score. A curve of factors (CUFFS) model assesses change in a construct from multiple response scores but its use in the social sciences has been limited. In this article, we advocate the CUFFS for analyzing a construct’s latent trajectory over time, with an emphasis on applying this model to educational research. First, we present a review of longitudinal factorial invariance, a condition necessary for ensuring that the measured construct is the same across time points. Next, we introduce the CUFFS model, followed by an illustration of testing factorial invariance and specifying a univariate and a bivariate CUFFS model to longitudinal data. To facilitate implementation, we include syntax for specifying these statistical methods using the free statistical software R.
Meta-analysis is a statistical technique that allows an analyst to synthesize effect sizes from multiple primary studies. To estimate meta-analysis models, the open-source statistical environment R is quickly becoming a popular choice. The meta-analytic community has contributed to this growth by developing numerous packages specific to meta-analysis. The purpose of this study is to locate all publicly available meta-analytic R packages. We located 63 packages via a comprehensive online search. To help elucidate these functionalities to the field, we describe each of the packages, recommend applications for researchers interested in using R for meta-analyses, provide a brief tutorial of two meta-analysis packages, and make suggestions for future meta-analytic R package creators.
Concurrent calibration using anchor items has proven to be an effective alternative to separate calibration and linking for developing large item banks, which are needed to support continuous testing. In principle, anchor-item designs and estimation methods that have proven effective with dominance item response theory (IRT) models, such as the 3PL model, should also lead to accurate parameter recovery with ideal point IRT models, but surprisingly little research has been devoted to this issue. This study, therefore, had two purposes: (a) to develop software for concurrent calibration with, what is now the most widely used ideal point model, the generalized graded unfolding model (GGUM); (b) to compare the efficacy of different GGUM anchor-item designs and develop empirically based guidelines for practitioners. A Monte Carlo study was conducted to compare the efficacy of three anchor-item designs in vertical and horizontal linking scenarios. The authors found that a block-interlaced design provided the best parameter recovery in nearly all conditions. The implications of these findings for concurrent calibration with the GGUM and practical recommendations for pretest designs involving ideal point computer adaptive testing (CAT) applications are discussed.
Forced-choice item response theory (IRT) models are being more widely used as a way of reducing response biases in noncognitive research and operational testing contexts. As applications have increased, there has been a growing need for methods to link parameters estimated in different examinee groups as a prelude to measurement equivalence testing. This study compared four linking methods for the Zinnes and Griggs (ZG) pairwise preference ideal point model. A Monte Carlo simulation compared test characteristic curve (TCC) linking, item characteristic curve (ICC) linking, mean/mean (M/M) linking, and mean/sigma (M/S) linking. The results indicated that ICC linking and the simpler M/M and M/S methods performed better than TCC linking, and there were no substantial differences among the top three approaches. In addition, in the absence of possible contamination of the common (anchor) item subset due to differential item functioning, five items should be adequate for estimating the metric transformation coefficients. Our article presents the necessary equations for ZG linking and provides recommendations for practitioners who may be interested in developing and using pairwise preference measures for research and selection purposes.
Causal mediation analysis is the study of mechanisms—variables measured between a treatment and an outcome that partially explain their causal relationship. The past decade has seen an explosion of research in causal mediation analysis, resulting in both conceptual and methodological advancements. However, many of these methods have been out of reach for applied quantitative researchers, due to their complexity and the difficulty of implementing them in standard statistical software distributions. The
A latent variable modeling method for studying measurement invariance when evaluating latent constructs with multiple binary or binary scored items with no guessing is outlined. The approach extends the continuous indicator procedure described by Raykov and colleagues, utilizes similarly the false discovery rate approach to multiple testing, and permits one to locate violations of measurement invariance in loading or threshold parameters. The discussed method does not require selection of a reference observed variable and is directly applicable for studying differential item functioning with one- or two-parameter item response models. The extended procedure is illustrated on an empirical data set.
Constructed-response items are commonly used in educational and psychological testing, and the answers to those items are typically scored by human raters. In the current rater monitoring processes, validity scoring is used to ensure that the scores assigned by raters do not deviate severely from the standards of rating quality. In this article, an adaptive rater monitoring approach that may potentially improve the efficiency of current rater monitoring practice is proposed. Based on the Rasch partial credit model and known development in multidimensional computerized adaptive testing, two essay selection methods—namely, the D-optimal method and the Single Fisher information method—are proposed. These two methods intend to select the most appropriate essays based on what is already known about a rater’s performance. Simulation studies, using a simulated essay bank and a cloned real essay bank, show that the proposed adaptive rater monitoring methods can recover rater parameters with much fewer essay questions. Future challenges and potential solutions are discussed in the end.
An increasing concern of producers of educational assessments is fraudulent behavior during the assessment (van der Linden, 2009). Benefiting from item preknowledge (e.g., Eckerly, 2017; McLeod, Lewis, & Thissen, 2003) is one type of fraudulent behavior. This article suggests two new test statistics for detecting individuals who may have benefited from item preknowledge; the statistics can be used for both nonadaptive and adaptive assessments that may include either or both of dichotomous and polytomous items. Each new statistic has an asymptotic standard normal n distribution. It is demonstrated in detailed simulation studies that the Type I error rates of the new statistics are close to the nominal level and the values of power of the new statistics are larger than those of an existing statistic for addressing the same problem.
Recent research has explored the use of models adapted from Mokken scale analysis as a nonparametric approach to evaluating rating quality in educational performance assessments. A potential limiting factor to the widespread use of these techniques is the requirement for complete data, as practical constraints in operational assessment systems often limit the use of complete rating designs. In order to address this challenge, this study explores the use of missing data imputation techniques and their impact on Mokken-based rating quality indicators related to rater monotonicity, rater scalability, and invariant rater ordering. Simulated data and real data from a rater-mediated writing assessment were modified to reflect varying levels of missingness, and four imputation techniques were used to impute missing ratings. Overall, the results indicated that simple imputation techniques based on rater and student means result in generally accurate recovery of rater monotonicity indices and rater scalability coefficients. However, discrepancies between violations of invariant rater ordering in the original and imputed data are somewhat unpredictable across imputation methods. Implications for research and practice are discussed.