On the Consistency of Automatic Scoring with Large Language Models

Mingfeng Xue, Xingyao Xiao, Yunting Liu, Mark Wilson

Educational and Psychological Measurement

Published online on February 16, 2026

Abstract

Educational and Psychological Measurement, Ahead of Print.
Large language models (LLMs) have shown great potential in automatic scoring. However, due to model characteristics and variation in training materials and pipelines, scoring inconsistency can exist within an LLM and across LLMs when rating the same ...