Result: A Comparative Study of the Human, Automated Scoring Model, and GPT-4 Ratings of Young EFL Students' Writing

Title:

A Comparative Study of the Human, Automated Scoring Model, and GPT-4 Ratings of Young EFL Students' Writing

Language:

English

Authors:

Michael Suhan (ORCID 0000-0002-4734-3794), Mikyung Kim Wolf (ORCID 0000-0002-0622-7754)

Source:

Language Testing. 2026 43(1):66-78.

Availability:

SAGE Publications. 2455 Teller Road, Thousand Oaks, CA 91320. Tel: 800-818-7243; Tel: 805-499-9774; Fax: 800-583-2665; e-mail: journals@sagepub.com; Web site: https://sagepub.com

Peer Reviewed:

Page Count:

Publication Date:

2026

Document Type:

*Academic Journal* Journal Articles<br />Reports - Research

Education Level:

Elementary Education
Secondary Education

Descriptors:

Language Tests, Automation, Computer Assisted Testing, Scoring, Artificial Intelligence, Writing Tests, Writing Evaluation, Second Language Learning, English (Second Language), Evaluation Methods, Comparative Analysis, Natural Language Processing, Elementary School Students, Secondary School Students, Foreign Countries

Geographic Terms:

China, Hong Kong, Turkey, Indonesia

Assessment and Survey Identifiers:

Test of English as a Foreign Language

DOI:

10.1177/02655322251346860

ISSN:

0265-5322
1477-0946

Entry Date:

2026

Accession Number:

EJ1493276

Database:

ERIC

Further Information

*Large language models, such as OpenAI's GPT-4, have the potential to revolutionize automated writing evaluation (AWE). The present study examines the performance of the GPT-4 model in evaluating the writing of young English as a foreign language learners. Responses to three constructed response tasks (n = 1908) on Educational Testing Service's (ETS's) TOEFL Junior® Writing test were scored using two different GPT-4 prompting methods. Scores predicted by the two GPT-4 prompting methods and the TOEFL Junior Writing test's operational AWE model were compared against human ratings to evaluate the performance of each method in terms of several measures. The results indicated that there are inconsistencies in the performance of the GPT-4 models depending on the measures, task types, and test forms, compared to the test's AWE model and human ratings. Implications of the findings for practice and further research are discussed.*

*As Provided*

*Result*: A Comparative Study of the Human, Automated Scoring Model, and GPT-4 Ratings of Young EFL Students' Writing

*Further Information*

*Links*

*Additional functions*

Result: A Comparative Study of the Human, Automated Scoring Model, and GPT-4 Ratings of Young EFL Students' Writing

Further Information

Links

Additional functions