MetaTOC: Stay on top of what’s happening in your field, easily

Mathematical

Add a subject

Found 316 papers in 4 journals. Select specific journals to follow »

Issue Information.

Journal of Educational Measurement. September 03, 2018

--- - - Journal of Educational Measurement, Volume 55, Issue 3, Page 355-356, Fall 2018.

September 03, 2018 doi: 10.1111/jedm.12152 open full text
Development of Information Functions and Indices for the GGUM‐RANK Multidimensional Forced Choice IRT Model.
Seang‐Hwane Joo, Philseok Lee, Stephen Stark.
Journal of Educational Measurement. September 03, 2018

--- - |2 Abstract This research derived information functions and proposed new scalar information indices to examine the quality of multidimensional forced choice (MFC) items based on the RANK model. We also explored how GGUM‐RANK information, latent trait recovery, and reliability varied across three MFC formats: pairs (two response alternatives), triplets (three alternatives), and tetrads (four alternatives). As expected, tetrad and triplet measures provided substantially more information than pairs, and MFC items composed of statements with high discrimination parameters were most informative. The methods and findings of this study will help practitioners to construct better MFC items, make informed projections about reliability with different MFC formats, and facilitate the development of MFC triplet‐ and tetrad‐based computerized adaptive tests. - Journal of Educational Measurement, Volume 55, Issue 3, Page 357-372, Fall 2018.

September 03, 2018 doi: 10.1111/jedm.12183 open full text
A Comparison of Procedures for Estimating Person Reliability Parameters in the Graded Response Model.
David M. LaHuis, Kinsey B. Bryant‐Lees, Shotaro Hakoyama, Tyler Barnes, Andrea Wiemann.
Journal of Educational Measurement. September 03, 2018

--- - |2 Abstract Person reliability parameters (PRPs) model temporary changes in individuals’ attribute level perceptions when responding to self‐report items (higher levels of PRPs represent less fluctuation). PRPs could be useful in measuring careless responding and traitedness. However, it is unclear how well current procedures for estimating PRPs can recover parameter estimates. This study assesses these procedures in terms of mean error (ME), average absolute difference (AAD), and reliability using simulated data with known values. Several prior distributions for PRPs were compared across a number of conditions. Overall, our results revealed little differences between using the χ or lognormal distributions as priors for estimated PRPs. Both distributions produced estimates with reasonable levels of ME; however, the AAD of the estimates was high. AAD did improve slightly as the number of items increased, suggesting that increasing the number of items would ameliorate this problem. Similarly, a larger number of items were necessary to produce reasonable levels of reliability. Based on our results, several conclusions are drawn and implications for future research are discussed. - Journal of Educational Measurement, Volume 55, Issue 3, Page 421-432, Fall 2018.

September 03, 2018 doi: 10.1111/jedm.12186 open full text
The Impact of Multidimensionality on Extraction of Latent Classes in Mixture Rasch Models.
Yoonsun Jang, Seock‐Ho Kim, Allan S. Cohen.
Journal of Educational Measurement. September 03, 2018

--- - |2 Abstract This study investigates the effect of multidimensionality on extraction of latent classes in mixture Rasch models. In this study, two‐dimensional data were generated under varying conditions. The two‐dimensional data sets were analyzed with one‐ to five‐class mixture Rasch models. Results of the simulation study indicate the mixture Rasch model tended to extract more latent classes than the number of dimensions simulated, particularly when the multidimensional structure of the data was more complex. In addition, the number of extracted latent classes decreased as the dimensions were more highly correlated regardless of multidimensional structure. An analysis of the empirical multidimensional data also shows that the number of latent classes extracted by the mixture Rasch model is larger than the number of dimensions measured by the test. - Journal of Educational Measurement, Volume 55, Issue 3, Page 403-420, Fall 2018.

September 03, 2018 doi: 10.1111/jedm.12185 open full text
Subjective Priors for Item Response Models: Application of Elicitation by Design.
Allison Ames, Elizabeth Smith.
Journal of Educational Measurement. September 03, 2018

--- - |2 Abstract Bayesian methods incorporate model parameter information prior to data collection. Eliciting information from content experts is an option, but has seen little implementation in Bayesian item response theory (IRT) modeling. This study aims to use ethical reasoning content experts to elicit prior information and incorporate this information into Markov Chain Monte Carlo (MCMC) estimation. A six‐step elicitation approach is followed, with relevant details at each stage for two IRT items parameters: difficulty and guessing. Results indicate that using content experts is the preferred approach, rather than noninformative priors, for both parameter types. The use of a noninformative prior for small samples provided dramatically different results when compared to results from content expert–elicited priors. The WAMBS (When to worry and how to Avoid the Misuse of Bayesian Statistics) checklist is used to aid in comparisons. - Journal of Educational Measurement, Volume 55, Issue 3, Page 373-402, Fall 2018.

September 03, 2018 doi: 10.1111/jedm.12184 open full text
Detecting Differential Item Discrimination (DID) and the Consequences of Ignoring DID in Multilevel Item Response Models.
Woo‐yeol Lee, Sun‐Joo Cho.
Journal of Educational Measurement. September 01, 2017

Cross‐level invariance in a multilevel item response model can be investigated by testing whether the within‐level item discriminations are equal to the between‐level item discriminations. Testing the cross‐level invariance assumption is important to understand constructs in multilevel data. However, in most multilevel item response model applications, the cross‐level invariance is assumed without testing of the cross‐level invariance assumption. In this study, the detection methods of differential item discrimination (DID) over levels and the consequences of ignoring DID are illustrated and discussed with the use of multilevel item response models. Simulation results showed that the likelihood ratio test (LRT) performed well in detecting global DID at the test level when some portion of the items exhibited DID. At the item level, the Akaike information criterion (AIC), the sample‐size adjusted Bayesian information criterion (saBIC), LRT, and Wald test showed a satisfactory rejection rate (>.8) when some portion of the items exhibited DID and the items had lower intraclass correlations (or higher DID magnitudes). When DID was ignored, the accuracy of the item discrimination estimates and standard errors was mainly problematic. Implications of the findings and limitations are discussed.

September 01, 2017 doi: 10.1111/jedm.12148 open full text
Modeling Skipped and Not‐Reached Items Using IRTrees.
Dries Debeer, Rianne Janssen, Paul Boeck.
Journal of Educational Measurement. September 01, 2017

When dealing with missing responses, two types of omissions can be discerned: items can be skipped or not reached by the test taker. When the occurrence of these omissions is related to the proficiency process the missingness is nonignorable. The purpose of this article is to present a tree‐based IRT framework for modeling responses and omissions jointly, taking into account that test takers as well as items can contribute to the two types of omissions. The proposed framework covers several existing models for missing responses, and many IRTree models can be estimated using standard statistical software. Further, simulated data is used to show that ignoring missing responses is less robust than often considered. Finally, as an illustration of its applicability, the IRTree approach is applied to data from the 2009 PISA reading assessment.

September 01, 2017 doi: 10.1111/jedm.12147 open full text
Structured Constructs Models Based on Change‐Point Analysis.
Hyo Jeong Shin, Mark Wilson, In‐Hee Choi.
Journal of Educational Measurement. September 01, 2017

This study proposes a structured constructs model (SCM) to examine measurement in the context of a multidimensional learning progression (LP). The LP is assumed to have features that go beyond a typical multidimentional IRT model, in that there are hypothesized to be certain cross‐dimensional linkages that correspond to requirements between the levels of the different dimensions. The new model builds on multidimensional item response theory models and change‐point analysis to add cut‐score and discontinuity parameters that embody these substantive requirements. This modeling strategy allows us to place the examinees in the appropriate LP level and simultaneously to model the hypothesized requirement relations. Results from a simulation study indicate that the proposed change‐point SCM recovers the generating parameters well. When the hypothesized requirement relations are ignored, the model fit tends to become worse, and the model parameters appear to be more biased. Moreover, the proposed model can be used to find validity evidence to support or disprove initial theoretical hypothesized links in the LP through empirical data. We illustrate the technique with data from an assessment system designed to measure student progress in a middle‐school statistics and modeling curriculum.

September 01, 2017 doi: 10.1111/jedm.12146 open full text
Optimal Linking Design for Response Model Parameters.
Michelle D. Barrett, Wim J. Linden.
Journal of Educational Measurement. September 01, 2017

Linking functions adjust for differences between identifiability restrictions used in different instances of the estimation of item response model parameters. These adjustments are necessary when results from those instances are to be compared. As linking functions are derived from estimated item response model parameters, parameter estimation error automatically propagates into linking error. This article explores an optimal linking design approach in which mixed‐integer programming is used to select linking items to minimize linking error. Results indicate that the method holds promise for selection of linking items.

September 01, 2017 doi: 10.1111/jedm.12145 open full text
Detecting Item Drift in Large‐Scale Testing.
Hongwen Guo, Frederic Robin, Neil Dorans.
Journal of Educational Measurement. September 01, 2017

The early detection of item drift is an important issue for frequently administered testing programs because items are reused over time. Unfortunately, operational data tend to be very sparse and do not lend themselves to frequent monitoring analyses, particularly for on‐demand testing. Building on existing residual analyses, the authors propose an item index that requires only moderate‐to‐small sample sizes to form data for time‐series analysis. Asymptotic results are presented to facilitate statistical significance tests. The authors show that the proposed index combined with time‐series techniques may be useful in detecting and predicting item drift. Most important, this index is related to a well‐known differential item functioning analysis so that a meaningful effect size can be proposed for item drift detection.

September 01, 2017 doi: 10.1111/jedm.12144 open full text
Person‐Fit Statistics for Joint Models for Accuracy and Speed.
Jean‐Paul Fox, Sukaesi Marianti.
Journal of Educational Measurement. June 01, 2017

Response accuracy and response time data can be analyzed with a joint model to measure ability and speed of working, while accounting for relationships between item and person characteristics. In this study, person‐fit statistics are proposed for joint models to detect aberrant response accuracy and/or response time patterns. The person‐fit tests take the correlation between ability and speed into account, as well as the correlation between item characteristics. They are posited as Bayesian significance tests, which have the advantage that the extremeness of a test statistic value is quantified by a posterior probability. The person‐fit tests can be computed as by‐products of a Markov chain Monte Carlo algorithm. Simulation studies were conducted in order to evaluate their performance. For all person‐fit tests, the simulation studies showed good detection rates in identifying aberrant patterns. A real data example is given to illustrate the person‐fit statistics for the evaluation of the joint model.

June 01, 2017 doi: 10.1111/jedm.12143 open full text
Evaluating Statistical Targets for Assembling Parallel Mixed‐Format Test Forms.
Dries Debeer, Usama S. Ali, Peter W. Rijn.
Journal of Educational Measurement. June 01, 2017

Test assembly is the process of selecting items from an item pool to form one or more new test forms. Often new test forms are constructed to be parallel with an existing (or an ideal) test. Within the context of item response theory, the test information function (TIF) or the test characteristic curve (TCC) are commonly used as statistical targets to obtain this parallelism. In a recent study, Ali and van Rijn proposed combining the TIF and TCC as statistical targets, rather than using only a single statistical target. In this article, we propose two new methods using this combined approach, and compare these methods with single statistical targets for the assembly of mixed‐format tests. In addition, we introduce new criteria to evaluate the parallelism of multiple forms. The results show that single statistical targets can be problematic, while the combined targets perform better, especially in situations with increasing numbers of polytomous items. Implications of using the combined target are discussed.

June 01, 2017 doi: 10.1111/jedm.12142 open full text
A New Statistic for Detection of Aberrant Answer Changes.
Sandip Sinharay, Minh Q. Duong, Scott W. Wood.
Journal of Educational Measurement. June 01, 2017

As noted by Fremer and Olson, analysis of answer changes is often used to investigate testing irregularities because the analysis is readily performed and has proven its value in practice. Researchers such as Belov, Sinharay and Johnson, van der Linden and Jeon, van der Linden and Lewis, and Wollack, Cohen, and Eckerly have suggested several statistics for detection of aberrant answer changes. This article suggests a new statistic that is based on the likelihood ratio test. An advantage of the new statistic is that it follows the standard normal distribution under the null hypothesis of no aberrant answer changes. It is demonstrated in a detailed simulation study that the Type I error rate of the new statistic is very close to the nominal level and the power of the new statistic is satisfactory in comparison to those of several existing statistics for detecting aberrant answer changes. The new statistic and several existing statistics were shown to provide useful information for a real data set. Given the increasing interest in analysis of answer changes, the new statistic promises to be useful to measurement practitioners.

June 01, 2017 doi: 10.1111/jedm.12141 open full text
Stabilizing Conditional Standard Errors of Measurement in Scale Score Transformations.
Tim Moses, YoungKoung Kim.
Journal of Educational Measurement. June 01, 2017

The focus of this article is on scale score transformations that can be used to stabilize conditional standard errors of measurement (CSEMs). Three transformations for stabilizing the estimated CSEMs are reviewed, including the traditional arcsine transformation, a recently developed general variance stabilization transformation, and a new method proposed in this article involving cubic transformations. Two examples are provided and the three scale score transformations are compared in terms of how well they stabilize CSEMs estimated from compound binomial and item response theory (IRT) models. Advantages of the cubic transformation are demonstrated with respect to CSEM stabilization and other scaling criteria (e.g., scale score distributions that are more symmetric).

June 01, 2017 doi: 10.1111/jedm.12140 open full text
Dual‐Objective Item Selection Criteria in Cognitive Diagnostic Computerized Adaptive Testing.
Hyeon‐Ah Kang, Susu Zhang, Hua‐Hua Chang.
Journal of Educational Measurement. June 01, 2017

The development of cognitive diagnostic‐computerized adaptive testing (CD‐CAT) has provided a new perspective for gaining information about examinees' mastery on a set of cognitive attributes. This study proposes a new item selection method within the framework of dual‐objective CD‐CAT that simultaneously addresses examinees' attribute mastery status and overall test performance. The new procedure is based on the Jensen‐Shannon (JS) divergence, a symmetrized version of the Kullback‐Leibler divergence. We show that the JS divergence resolves the noncomparability problem of the dual information index and has close relationships with Shannon entropy, mutual information, and Fisher information. The performance of the JS divergence is evaluated in simulation studies in comparison with the methods available in the literature. Results suggest that the JS divergence achieves parallel or more precise recovery of latent trait variables compared to the existing methods and maintains practical advantages in computation and item pool usage.

June 01, 2017 doi: 10.1111/jedm.12139 open full text
Structural Zeros and Their Implications With Log‐Linear Bivariate Presmoothing Under the Internal‐Anchor Design.
Hyung Jin Kim, Robert L. Brennan, Won‐Chan Lee.
Journal of Educational Measurement. June 01, 2017

In equating, when common items are internal and scoring is conducted in terms of the number of correct items, some pairs of total scores (X) and common‐item scores (V) can never be observed in a bivariate distribution of X and V; these pairs are called structural zeros. This simulation study examines how equating results compare for different approaches to handling structural zeros. The study considers four approaches: the no‐smoothing, unique‐common, total‐common, and adjusted total‐common approaches. This study led to four main findings: (1) the total‐common approach generally had the worst results; (2) for relatively small effect sizes, the unique‐common approach generally had the smallest overall error; (3) for relatively large effect sizes, the adjusted total‐common approach generally had the smallest overall error; and, (4) if sole interest focuses on reducing bias only, the adjusted total‐common approach was generally preferable. These results suggest that, when common items are internal and log‐linear bivariate presmoothing is performed, structural zeros should be maintained, even if there is some loss in the moment preservation property.

June 01, 2017 doi: 10.1111/jedm.12138 open full text
Statistically Modeling Individual Students’ Learning Over Successive Collaborative Practice Opportunities.
Jennifer Olsen, Vincent Aleven, Nikol Rummel.
Journal of Educational Measurement. March 06, 2017

Within educational data mining, many statistical models capture the learning of students working individually. However, not much work has been done to extend these statistical models of individual learning to a collaborative setting, despite the effectiveness of collaborative learning activities. We extend a widely used model (the additive factors model) to account for the effect of collaboration on individual learning, including having the help of a partner and getting to observe/help a partner. We find evidence that models that include these collaborative features have a better fit than the original models for performance data and that learning rates estimated using the extended models provide insights into how collaboration benefits individual students’ learning outcomes.

March 06, 2017 doi: 10.1111/jedm.12137 open full text
Mapping an Experiment‐Based Assessment of Collaborative Behavior Onto Collaborative Problem Solving in PISA 2015: A Cluster Analysis Approach for Collaborator Profiles.
Katharina Herborn, Maida Mustafić, Samuel Greiff.
Journal of Educational Measurement. March 06, 2017

Collaborative problem solving (CPS) assessment is a new academic research field with a number of educational implications. In 2015, the Programme for International Student Assessment (PISA) assessed CPS with a computer‐simulated human‐agent (H‐A) approach that claimed to measure 12 individual CPS skills for the first time. After reviewing the approach, we conceptually embedded a computer‐based collaborative behavior assessment (COLBAS) into the overarching PISA 2015 CPS approach. COLBAS is an H‐A CPS assessment instrument that can be used to measure certain aspects of CPS. In addition, we applied a model‐based cluster analysis to the embedded COLBAS aspects. The analysis revealed three types of student collaborator profiles that differed in cognitive performance and motivation: (a) passive low‐performing (non‐)collaborators, (b) active high‐performing collaborators, and (c) compensating collaborators.

March 06, 2017 doi: 10.1111/jedm.12135 open full text
Modeling Data From Collaborative Assessments: Learning in Digital Interactive Social Networks.
Mark Wilson, Perman Gochyyev, Kathleen Scalise.
Journal of Educational Measurement. March 06, 2017

This article summarizes assessment of cognitive skills through collaborative tasks, using field test results from the Assessment and Teaching of 21st Century Skills (ATC21S) project. This project, sponsored by Cisco, Intel, and Microsoft, aims to help educators around the world enable students with the skills to succeed in future career and college goals. In this article, ATC21S collaborative assessments focus on the project's “ICT Literacy—Learning in digital networks” learning progression. The article includes a description of the development of the learning progression, as well as examples and the logic behind the instrument construction. Assessments took place in random pairs of students in a demonstration digital environment. Modeling of results employed unidimensional and multidimensional item response models, with and without random effects for groups. The results indicated that, based on this data set, the models that take group into consideration in both the unidimensional and the multidimensional analyses fit better. However, the group‐level variances were substantially higher than the individual‐level variances. This indicates that a total individual estimate of group plus individual is likely a more informative estimate than individual alone but also that the performances of the pairs dominated the performances of the individuals. Implications are discussed in the results and conclusions.

March 06, 2017 doi: 10.1111/jedm.12134 open full text
Measuring Student Engagement During Collaboration.
Peter F. Halpin, Alina A. von Davier, Jiangang Hao, Lei Liu.
Journal of Educational Measurement. March 06, 2017

This article addresses performance assessments that involve collaboration among students. We apply the Hawkes process to infer whether the actions of one student are associated with increased probability of further actions by his/her partner(s) in the near future. This leads to an intuitive notion of engagement among collaborators, and we consider a model‐based index that can be used to quantify this notion. The approach is illustrated using a simulation‐based task designed for science education, in which pairs of collaborators interact using online chat. We also consider the empirical relationship between chat engagement and task performance, finding that less engaged collaborators were less likely to revise their responses after being given an opportunity to share their work with their partner.

March 06, 2017 doi: 10.1111/jedm.12133 open full text
Modeling Collaborative Interaction Patterns in a Simulation‐Based Task.
Jessica J. Andrews, Deirdre Kerr, Robert J. Mislevy, Alina Davier, Jiangang Hao, Lei Liu.
Journal of Educational Measurement. March 06, 2017

Simulations and games offer interactive tasks that can elicit rich data, providing evidence of complex skills that are difficult to measure with more conventional items and tests. However, one notable challenge in using such technologies is making sense of the data generated in order to make claims about individuals or groups. This article presents a novel methodological approach that uses the process data and performance outcomes from a simulation‐based collaborative science assessment to explore the propensities of dyads to interact in accordance with certain interaction patterns. Further exploratory analyses examine how the approach can be used to answer important questions in collaboration research regarding gender and cultural differences in collaborative behavior and how interaction patterns relate to performance outcomes.

March 06, 2017 doi: 10.1111/jedm.12132 open full text
Assessing Students in Human‐to‐Agent Settings to Inform Collaborative Problem‐Solving Learning.
Yigal Rosen.
Journal of Educational Measurement. March 06, 2017

In order to understand potential applications of collaborative problem‐solving (CPS) assessment tasks, it is necessary to examine empirically the multifaceted student performance that may be distributed across collaboration methods and purposes of the assessment. Ideally, each student should be matched with various types of group members and must apply the skills in varied contexts and tasks. One solution to these assessment demands is to use computer‐based (virtual) agents to serve as the collaborators in the interactions with students. This article proposes a human‐to‐agent (H‐A) approach for formative CPS assessment and describes an international pilot study aimed to provide preliminary empirical findings on the use of H‐A CPS assessment to inform collaborative learning. Overall, the findings showed promise in terms of using a H‐A CPS assessment task as a formative tool for structuring effective groups in the context of CPS online learning.

March 06, 2017 doi: 10.1111/jedm.12131 open full text
Designs for Operationalizing Collaborative Problem Solving for Automated Assessment.
Claire Scoular, Esther Care, Friedrich W. Hesse.
Journal of Educational Measurement. March 06, 2017

Collaborative problem solving is a complex skill set that draws on social and cognitive factors. The construct remains in its infancy due to lack of empirical evidence that can be drawn upon for validation. The differences and similarities between two large‐scale initiatives that reflect this state of the art, in terms of underlying assumptions about the construct and approach to task development, are outlined. The goal is to clarify how definitions of the nature of the construct impact the approach to design of assessment tasks. Illustrations of two different approaches to the development of a task designed to elicit behaviors that manifest the construct are presented. The method highlights the degree to which these approaches might constrain a comprehensive assessment of the construct.

March 06, 2017 doi: 10.1111/jedm.12130 open full text
Further Study of the Choice of Anchor Tests in Equating.
Tammy J. Trierweiler, Charles Lewis, Robert L. Smith.
Journal of Educational Measurement. December 01, 2016

In this study, we describe what factors influence the observed score correlation between an (external) anchor test and a total test. We show that the anchor to full‐test observed score correlation is based on two components: the true score correlation between the anchor and total test, and the reliability of the anchor test. Findings using an analytical approach suggest that making an anchor test a miditest does not generally maximize the anchor to total test correlation. Results are discussed in the context of what conditions maximize the correlations between the anchor and total test.

December 01, 2016 doi: 10.1111/jedm.12128 open full text
Autoscoring Essays Based on Complex Networks.
Xiaohua Ke, Yongqiang Zeng, Haijiao Luo.
Journal of Educational Measurement. December 01, 2016

This article presents a novel method, the Complex Dynamics Essay Scorer (CDES), for automated essay scoring using complex network features. Texts produced by college students in China were represented as scale‐free networks (e.g., a word adjacency model) from which typical network features, such as the in‐/out‐degrees, clustering coefficient (CC), and dynamic networks, were obtained. The CDES integrates the classical concepts of network feature representation and essay score series variation. Several experiments indicated that the network measures different essay qualities and can be clearly demonstrated to develop complex networks for autoscoring tasks. The average agreement of the CDES and human rater scores was 86.5%, and the average Pearson correlation was .77. The results indicate that the CDES produced functional complex systems and autoscored Chinese essays in a method consistent with human raters. Our research suggests potential applications in other areas of educational assessment.

December 01, 2016 doi: 10.1111/jedm.12127 open full text
Asymptotic Standard Errors of Observed‐Score Equating With Polytomous IRT Models.
Björn Andersson.
Journal of Educational Measurement. December 01, 2016

In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length.

December 01, 2016 doi: 10.1111/jedm.12126 open full text
Diagnostic Profiles: A Standard Setting Method for Use With a Cognitive Diagnostic Model.
Gary Skaggs, Serge F. Hein, Jesse L. M. Wilkins.
Journal of Educational Measurement. December 01, 2016

This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not‐mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard.

December 01, 2016 doi: 10.1111/jedm.12125 open full text
A Short Note on the Relationship Between Pass Rate and Multiple Attempts.
Ying Cheng, Cheng Liu.
Journal of Educational Measurement. December 01, 2016

For a certification, licensure, or placement exam, allowing examinees to take multiple attempts at the test could effectively change the pass rate. Change in the pass rate can occur without any change in the underlying latent trait, and can be an artifact of multiple attempts and imperfect reliability of the test. By deriving formulae to compute the pass rate under two definitions, this article provides tools for testing practitioners to compute and evaluate the change in the expected pass rate when a certain (maximum) number of attempts are allowed without any change in the latent trait. This article also includes a simulation study that considers change in ability and differential motivation of examinees to retake the test. Results indicate that the general trend shown by the analytical results is maintained—that is, the marginal expected pass rate increases with more attempts when the testing volume is defined as the total number of test takers, and decreases with more attempts when the testing volume is defined as the total number of test attempts.

December 01, 2016 doi: 10.1111/jedm.12124 open full text
Effect Size Measures for Differential Item Functioning in a Multidimensional IRT Model.
Youngsuk Suh.
Journal of Educational Measurement. December 01, 2016

This study adapted an effect size measure used for studying differential item functioning (DIF) in unidimensional tests and extended the measure to multidimensional tests. Two effect size measures were considered in a multidimensional item response theory model: signed weighted P‐difference and unsigned weighted P‐difference. The performance of the effect size measures was investigated under various simulation conditions including different sample sizes and DIF magnitudes. As another way of studying DIF, the χ2 difference test was included to compare the result of statistical significance (statistical tests) with that of practical significance (effect size measures). The adequacy of existing effect size criteria used in unidimensional tests was also evaluated. Both effect size measures worked well in estimating true effect sizes, identifying DIF types, and classifying effect size categories. Finally, a real data analysis was conducted to support the simulation results.

December 01, 2016 doi: 10.1111/jedm.12123 open full text
The Analysis of the Regression-Discontinuity Design in R.
Thoemmes, F., Liao, W., Jin, Z.
Journal of Educational and Behavioral Statistics. November 30, 2016

This article describes the analysis of regression-discontinuity designs (RDDs) using the R packages rdd, rdrobust, and rddtools. We discuss similarities and differences between these packages and provide directions on how to use them effectively. We use real data from the Carolina Abecedarian Project to show how an analysis of an RDD can be performed from start to finish.

November 30, 2016 doi: 10.3102/1076998616680587 open full text
Evaluation of Two Methods for Modeling Measurement Errors When Testing Interaction Effects With Observed Composite Scores.
Hsiao, Y.-Y., Kwok, O.-M., Lai, M. H. C.
Educational and Psychological Measurement. November 29, 2016

Path models with observed composites based on multiple items (e.g., mean or sum score of the items) are commonly used to test interaction effects. Under this practice, researchers generally assume that the observed composites are measured without errors. In this study, we reviewed and evaluated two alternative methods within the structural equation modeling (SEM) framework, namely, the reliability-adjusted product indicator (RAPI) method and the latent moderated structural equations (LMS) method, which can both flexibly take into account measurement errors. Results showed that both these methods generally produced unbiased estimates of the interaction effects. On the other hand, the path model—without considering measurement errors—led to substantial bias and a low confidence interval coverage rate of nonzero interaction effects. Other findings and implications for future studies are discussed.

November 29, 2016 doi: 10.1177/0013164416679877 open full text
A Bayesian Robust IRT Outlier-Detection Model.
Öztürk, N. K., Karabatsos, G.
Applied Psychological Measurement. November 27, 2016

In psychometric practice, the parameter estimates of a standard item-response theory (IRT) model can become biased when item-response data, of persons’ individual responses to test items, contain outliers relative to the model. Also, the manual removal of outliers can be a time-consuming and difficult task. Besides, removing outliers leads to data information loss in parameter estimation. To address these concerns, a Bayesian IRT model that includes person and latent item-response outlier parameters, in addition to person ability and item parameters, is proposed and illustrated, and is defined by item characteristic curves (ICCs) that are each specified by a robust, Student’s t-distribution function. The outlier parameters and the robust ICCs enable the model to automatically identify item-response outliers, and to make estimates of the person ability and item parameters more robust to outliers. Hence, under this IRT model, it is unnecessary to remove outliers from the data analysis. Our IRT model is illustrated through the analysis of two data sets, involving dichotomous- and polytomous-response items, respectively.

November 27, 2016 doi: 10.1177/0146621616679394 open full text
Exploring Rubric-Related Multidimensionality in Polytomously Scored Test Items.
Bolt, D. M., Adams, D. J.
Applied Psychological Measurement. November 24, 2016

Test items scored as polytomous have the potential to display multidimensionality across rating scale score categories. This article uses a multidimensional nominal response model (MNRM) to examine the possibility that the proficiency dimension/dimensional composite best measured by a polytomously scored item may vary by score category, an issue not generally considered in multidimensional item response theory (MIRT). Some practical considerations in exploring rubric-related multidimensionality, including potential consequences of not attending to it, are illustrated through simulation examples. A real data application is applied in the study of item format effects using the 2007 administration of Trends in Mathematics and Science Study (TIMSS) among eighth graders in the United States.

November 24, 2016 doi: 10.1177/0146621616677715 open full text
Using Cluster Bootstrapping to Analyze Nested Data With a Few Clusters.
Huang, F. L.
Educational and Psychological Measurement. November 23, 2016

Cluster randomized trials involving participants nested within intact treatment and control groups are commonly performed in various educational, psychological, and biomedical studies. However, recruiting and retaining intact groups present various practical, financial, and logistical challenges to evaluators and often, cluster randomized trials are performed with a low number of clusters (~20 groups). Although multilevel models are often used to analyze nested data, researchers may be concerned of potentially biased results due to having only a few groups under study. Cluster bootstrapping has been suggested as an alternative procedure when analyzing clustered data though it has seen very little use in educational and psychological studies. Using a Monte Carlo simulation that varied the number of clusters, average cluster size, and intraclass correlations, we compared standard errors using cluster bootstrapping with those derived using ordinary least squares regression and multilevel models. Results indicate that cluster bootstrapping, though more computationally demanding, can be used as an alternative procedure for the analysis of clustered data when treatment effects at the group level are of primary interest. Supplementary material showing how to perform cluster bootstrapped regressions using R is also provided.

November 23, 2016 doi: 10.1177/0013164416678980 open full text
Item Position Effects Are Moderated by Changes in Test-Taking Effort.
Weirich, S., Hecht, M., Penk, C., Roppelt, A., Böhme, K.
Applied Psychological Measurement. November 21, 2016

This article examines the interdependency of two context effects that are known to occur regularly in large-scale assessments: item position effects and effects of test-taking effort on the probability of correctly answering an item. A microlongitudinal design was used to measure test-taking effort over the course of a large-scale assessment of 60 min. Two components of test-taking effort were investigated: initial effort and change in effort. Both components of test-taking effort significantly affected the probability to solve an item. In addition, it was found that participants’ current test-taking effort diminished considerably across the course of the test. Furthermore, a substantial linear position effect was found, which indicated that item difficulty increased during the test. This position effect varied considerably across persons. Concerning the interplay of position effects and test-taking effort, it was found that only the change in effort moderates the position effect and that persons differ with respect to this moderation effect. The consequences of these results concerning the reliability and validity of large-scale assessments are discussed.

November 21, 2016 doi: 10.1177/0146621616676791 open full text
A Review of Longitudinal Analysis: Modeling Within-Person Fluctuation and Change.
Wang, C.
Journal of Educational and Behavioral Statistics. November 21, 2016

There is no abstract available for this paper.

November 21, 2016 doi: 10.3102/1076998616679361 open full text
On Studying Common Factor Dominance and Approximate Unidimensionality in Multicomponent Measuring Instruments With Discrete Items.
Raykov, T., Marcoulides, G. A.
Educational and Psychological Measurement. November 19, 2016

This article outlines a procedure for examining the degree to which a common factor may be dominating additional factors in a multicomponent measuring instrument consisting of binary items. The procedure rests on an application of the latent variable modeling methodology and accounts for the discrete nature of the manifest indicators. The method provides point and interval estimates (a) of the proportion of the variance explained by all factors, which is due to the common (global) factor and (b) of the proportion of the variance explained by all factors, which is due to some or all other (local) factors. The discussed approach can also be readily used as a means of assessing approximate unidimensionality when considering application of unidimensional versus multidimensional item response modeling. The procedure is similarly utilizable in case of highly discrete (e.g., Likert-type) ordinal items, and is illustrated with a numerical example.

November 19, 2016 doi: 10.1177/0013164416678650 open full text
Random Item MIRID Modeling and Its Application.
Lee, Y., Wilson, M.
Applied Psychological Measurement. November 19, 2016

The Model With Internal Restrictions on Item Difficulty (MIRID; Butter, 1994) has been useful for investigating cognitive behavior in terms of the processes that lead to that behavior. The main objective of the MIRID model is to enable one to test how component processes influence the complex cognitive behavior in terms of the item parameters. The original MIRID model is, indeed, a fairly restricted model for a number of reasons. One of these restrictions is that the model treats items as fixed and does not fit measurement contexts where the concept of the random items is needed. In this article, random item approaches to the MIRID model are proposed, and both simulation and empirical studies to test and illustrate the random item MIRID models are conducted. The simulation and empirical studies show that the random item MIRID models provide more accurate estimates when substantial random errors exist, and thus these models may be more beneficial.

November 19, 2016 doi: 10.1177/0146621616675835 open full text
Critical Values for Yens Q3: Identification of Local Dependence in the Rasch Model Using Residual Correlations.
Christensen, K. B., Makransky, G., Horton, M.
Applied Psychological Measurement. November 16, 2016

The assumption of local independence is central to all item response theory (IRT) models. Violations can lead to inflated estimates of reliability and problems with construct validity. For the most widely used fit statistic Q₃, there are currently no well-documented suggestions of the critical values which should be used to indicate local dependence (LD), and for this reason, a variety of arbitrary rules of thumb are used. In this study, an empirical data example and Monte Carlo simulation were used to investigate the different factors that can influence the null distribution of residual correlations, with the objective of proposing guidelines that researchers and practitioners can follow when making decisions about LD during scale development and validation. A parametric bootstrapping procedure should be implemented in each separate situation to obtain the critical value of LD applicable to the data set, and provide example critical values for a number of data structure situations. The results show that for the Q₃ fit statistic, no single critical value is appropriate for all situations, as the percentiles in the empirical null distribution are influenced by the number of items, the sample size, and the number of response categories. Furthermore, the results show that LD should be considered relative to the average observed residual correlation, rather than to a uniform value, as this results in more stable percentiles for the null distribution of an adjusted fit statistic.

November 16, 2016 doi: 10.1177/0146621616677520 open full text
A Reformulated Correlated Trait-Correlated Method Model for Multitrait-Multimethod Data Effectively Increases Convergence and Admissibility Rates.
Fan, Y., Lance, C. E.
Educational and Psychological Measurement. November 11, 2016

The correlated trait–correlated method (CTCM) model for the analysis of multitrait–multimethod (MTMM) data is known to suffer convergence and admissibility (C&A) problems. We describe a little known and seldom applied reparameterized version of this model (CTCM-R) based on Rindskopf’s reparameterization of the simpler confirmatory factor analysis model. In a Monte Carlo study, we compare the CTCM, CTCM-R, and the correlated trait–correlated uniqueness (CTCU) models in terms of C&A, model fit, and parameter estimation bias. The CTCM-R model largely avoided C&A problems associated with the more traditional CTCM model, producing C&A solutions nearly as often as the CTCU model, but also avoiding parameter estimation biases known to plague the CTCU model. As such, the CTCM-R model is an attractive alternative for the analysis of MTMM data.

November 11, 2016 doi: 10.1177/0013164416677144 open full text
Coherent Power Analysis in Multilevel Studies Using Parameters From Surveys.
Rhoads, C.
Journal of Educational and Behavioral Statistics. November 09, 2016

Researchers designing multisite and cluster randomized trials of educational interventions will usually conduct a power analysis in the planning stage of the study. To conduct the power analysis, researchers often use estimates of intracluster correlation coefficients and effect sizes derived from an analysis of survey data. When there is heterogeneity in treatment effects across the clusters in the study, these parameters will need to be adjusted to produce an accurate power analysis for a hierarchical trial design. The relevant adjustment factors are derived and presented in the current article. The adjustment factors depend upon the covariance between treatment effects and cluster-specific average values of the outcome variable, illustrating the need for better information about this parameter. The results in the article also facilitate understanding of the relative power of multisite and cluster randomized studies conducted on the same population by showing how the parameters necessary to compute power in the two types of designs are related. This is accomplished by relating parameters defined by linear mixed model specifications to parameters defined in terms of potential outcomes.

November 09, 2016 doi: 10.3102/1076998616675607 open full text
Curve of Factors Model: A Latent Growth Modeling Approach for Educational Research.
Isiordia, M., Ferrer, E.
Educational and Psychological Measurement. November 08, 2016

A first-order latent growth model assesses change in an unobserved construct from a single score and is commonly used across different domains of educational research. However, examining change using a set of multiple response scores (e.g., scale items) affords researchers several methodological benefits not possible when using a single score. A curve of factors (CUFFS) model assesses change in a construct from multiple response scores but its use in the social sciences has been limited. In this article, we advocate the CUFFS for analyzing a construct’s latent trajectory over time, with an emphasis on applying this model to educational research. First, we present a review of longitudinal factorial invariance, a condition necessary for ensuring that the measured construct is the same across time points. Next, we introduce the CUFFS model, followed by an illustration of testing factorial invariance and specifying a univariate and a bivariate CUFFS model to longitudinal data. To facilitate implementation, we include syntax for specifying these statistical methods using the free statistical software R.

November 08, 2016 doi: 10.1177/0013164416677143 open full text
A Review of Meta-Analysis Packages in R.
Polanin, J. R., Hennessy, E. A., Tanner-Smith, E. E.
Journal of Educational and Behavioral Statistics. November 07, 2016

Meta-analysis is a statistical technique that allows an analyst to synthesize effect sizes from multiple primary studies. To estimate meta-analysis models, the open-source statistical environment R is quickly becoming a popular choice. The meta-analytic community has contributed to this growth by developing numerous packages specific to meta-analysis. The purpose of this study is to locate all publicly available meta-analytic R packages. We located 63 packages via a comprehensive online search. To help elucidate these functionalities to the field, we describe each of the packages, recommend applications for researchers interested in using R for meta-analyses, provide a brief tutorial of two meta-analysis packages, and make suggestions for future meta-analytic R package creators.

November 07, 2016 doi: 10.3102/1076998616674315 open full text
Evaluating Anchor-Item Designs for Concurrent Calibration With the GGUM.
Joo, S.-H., Lee, P., Stark, S.
Applied Psychological Measurement. November 03, 2016

Concurrent calibration using anchor items has proven to be an effective alternative to separate calibration and linking for developing large item banks, which are needed to support continuous testing. In principle, anchor-item designs and estimation methods that have proven effective with dominance item response theory (IRT) models, such as the 3PL model, should also lead to accurate parameter recovery with ideal point IRT models, but surprisingly little research has been devoted to this issue. This study, therefore, had two purposes: (a) to develop software for concurrent calibration with, what is now the most widely used ideal point model, the generalized graded unfolding model (GGUM); (b) to compare the efficacy of different GGUM anchor-item designs and develop empirically based guidelines for practitioners. A Monte Carlo study was conducted to compare the efficacy of three anchor-item designs in vertical and horizontal linking scenarios. The authors found that a block-interlaced design provided the best parameter recovery in nearly all conditions. The implications of these findings for concurrent calibration with the GGUM and practical recommendations for pretest designs involving ideal point computer adaptive testing (CAT) applications are discussed.

November 03, 2016 doi: 10.1177/0146621616673997 open full text
Linking Methods for the Zinnes-Griggs Pairwise Preference IRT Model.
Lee, P., Joo, S.-H., Stark, S.
Applied Psychological Measurement. November 03, 2016

Forced-choice item response theory (IRT) models are being more widely used as a way of reducing response biases in noncognitive research and operational testing contexts. As applications have increased, there has been a growing need for methods to link parameters estimated in different examinee groups as a prelude to measurement equivalence testing. This study compared four linking methods for the Zinnes and Griggs (ZG) pairwise preference ideal point model. A Monte Carlo simulation compared test characteristic curve (TCC) linking, item characteristic curve (ICC) linking, mean/mean (M/M) linking, and mean/sigma (M/S) linking. The results indicated that ICC linking and the simpler M/M and M/S methods performed better than TCC linking, and there were no substantial differences among the top three approaches. In addition, in the absence of possible contamination of the common (anchor) item subset due to differential item functioning, five items should be adequate for estimating the metric transformation coefficients. Our article presents the necessary equations for ZG linking and provides recommendations for practitioners who may be interested in developing and using pairwise preference measures for research and selection purposes.

November 03, 2016 doi: 10.1177/0146621616675836 open full text
Review: mediation Package in R.
Sales, A. C.
Journal of Educational and Behavioral Statistics. November 02, 2016

Causal mediation analysis is the study of mechanisms—variables measured between a treatment and an outcome that partially explain their causal relationship. The past decade has seen an explosion of research in causal mediation analysis, resulting in both conceptual and methodological advancements. However, many of these methods have been out of reach for applied quantitative researchers, due to their complexity and the difficulty of implementing them in standard statistical software distributions. The mediation package in R provides a set of simple commands that execute some of the newer causal mediation methods. This article will summarize some of the recent advances in mediation analysis, critically review the mediation package, and demonstrate, by example, some of its capabilities.

November 02, 2016 doi: 10.3102/1076998616670371 open full text
Examining Measurement Invariance and Differential Item Functioning With Discrete Latent Construct Indicators: A Note on a Multiple Testing Procedure.
Raykov, T., Dimitrov, D. M., Marcoulides, G. A., Li, T., Menold, N.
Educational and Psychological Measurement. October 25, 2016

A latent variable modeling method for studying measurement invariance when evaluating latent constructs with multiple binary or binary scored items with no guessing is outlined. The approach extends the continuous indicator procedure described by Raykov and colleagues, utilizes similarly the false discovery rate approach to multiple testing, and permits one to locate violations of measurement invariance in loading or threshold parameters. The discussed method does not require selection of a reference observed variable and is directly applicable for studying differential item functioning with one- or two-parameter item response models. The extended procedure is illustrated on an empirical data set.

October 25, 2016 doi: 10.1177/0013164416670984 open full text
Essay Selection Methods for Adaptive Rater Monitoring.
Wang, C., Song, T., Wang, Z., Wolfe, E.
Applied Psychological Measurement. October 25, 2016

Constructed-response items are commonly used in educational and psychological testing, and the answers to those items are typically scored by human raters. In the current rater monitoring processes, validity scoring is used to ensure that the scores assigned by raters do not deviate severely from the standards of rating quality. In this article, an adaptive rater monitoring approach that may potentially improve the efficiency of current rater monitoring practice is proposed. Based on the Rasch partial credit model and known development in multidimensional computerized adaptive testing, two essay selection methods—namely, the D-optimal method and the Single Fisher information method—are proposed. These two methods intend to select the most appropriate essays based on what is already known about a rater’s performance. Simulation studies, using a simulated essay bank and a cloned real essay bank, show that the proposed adaptive rater monitoring methods can recover rater parameters with much fewer essay questions. Future challenges and potential solutions are discussed in the end.

October 25, 2016 doi: 10.1177/0146621616672855 open full text
Detection of Item Preknowledge Using Likelihood Ratio Test and Score Test.
Sinharay, S.
Journal of Educational and Behavioral Statistics. October 24, 2016

An increasing concern of producers of educational assessments is fraudulent behavior during the assessment (van der Linden, 2009). Benefiting from item preknowledge (e.g., Eckerly, 2017; McLeod, Lewis, & Thissen, 2003) is one type of fraudulent behavior. This article suggests two new test statistics for detecting individuals who may have benefited from item preknowledge; the statistics can be used for both nonadaptive and adaptive assessments that may include either or both of dichotomous and polytomous items. Each new statistic has an asymptotic standard normal n distribution. It is demonstrated in detailed simulation studies that the Type I error rates of the new statistics are close to the nominal level and the values of power of the new statistics are larger than those of an existing statistic for addressing the same problem.

October 24, 2016 doi: 10.3102/1076998616673872 open full text
Exploring Incomplete Rating Designs With Mokken Scale Analysis.
Wind, S. A., Patil, Y. J.
Educational and Psychological Measurement. October 23, 2016

Recent research has explored the use of models adapted from Mokken scale analysis as a nonparametric approach to evaluating rating quality in educational performance assessments. A potential limiting factor to the widespread use of these techniques is the requirement for complete data, as practical constraints in operational assessment systems often limit the use of complete rating designs. In order to address this challenge, this study explores the use of missing data imputation techniques and their impact on Mokken-based rating quality indicators related to rater monotonicity, rater scalability, and invariant rater ordering. Simulated data and real data from a rater-mediated writing assessment were modified to reflect varying levels of missingness, and four imputation techniques were used to impute missing ratings. Overall, the results indicated that simple imputation techniques based on rater and student means result in generally accurate recovery of rater monotonicity indices and rater scalability coefficients. However, discrepancies between violations of invariant rater ordering in the original and imputed data are somewhat unpredictable across imputation methods. Implications for research and practice are discussed.

October 23, 2016 doi: 10.1177/0013164416675393 open full text