*Result*: A Comparative Study of the Human, Automated Scoring Model, and GPT-4 Ratings of Young EFL Students' Writing

Title:
A Comparative Study of the Human, Automated Scoring Model, and GPT-4 Ratings of Young EFL Students' Writing
Language:
English
Authors:
Michael Suhan (ORCID 0000-0002-4734-3794), Mikyung Kim Wolf (ORCID 0000-0002-0622-7754)
Source:
Language Testing. 2026 43(1):66-78.
Availability:
SAGE Publications. 2455 Teller Road, Thousand Oaks, CA 91320. Tel: 800-818-7243; Tel: 805-499-9774; Fax: 800-583-2665; e-mail: journals@sagepub.com; Web site: https://sagepub.com
Peer Reviewed:
Y
Page Count:
13
Publication Date:
2026
Document Type:
*Academic Journal* Journal Articles<br />Reports - Research
Education Level:
Elementary Education
Secondary Education
Geographic Terms:
Assessment and Survey Identifiers:
DOI:
10.1177/02655322251346860
ISSN:
0265-5322
1477-0946
Entry Date:
2026
Accession Number:
EJ1493276
Database:
ERIC

*Further Information*

*Large language models, such as OpenAI's GPT-4, have the potential to revolutionize automated writing evaluation (AWE). The present study examines the performance of the GPT-4 model in evaluating the writing of young English as a foreign language learners. Responses to three constructed response tasks (n = 1908) on Educational Testing Service's (ETS's) TOEFL Junior® Writing test were scored using two different GPT-4 prompting methods. Scores predicted by the two GPT-4 prompting methods and the TOEFL Junior Writing test's operational AWE model were compared against human ratings to evaluate the performance of each method in terms of several measures. The results indicated that there are inconsistencies in the performance of the GPT-4 models depending on the measures, task types, and test forms, compared to the test's AWE model and human ratings. Implications of the findings for practice and further research are discussed.*

*As Provided*