*Result*: A Comparative Study of the Human, Automated Scoring Model, and GPT-4 Ratings of Young EFL Students' Writing
Secondary Education
1477-0946
*Further Information*
*Large language models, such as OpenAI's GPT-4, have the potential to revolutionize automated writing evaluation (AWE). The present study examines the performance of the GPT-4 model in evaluating the writing of young English as a foreign language learners. Responses to three constructed response tasks (n = 1908) on Educational Testing Service's (ETS's) TOEFL Junior® Writing test were scored using two different GPT-4 prompting methods. Scores predicted by the two GPT-4 prompting methods and the TOEFL Junior Writing test's operational AWE model were compared against human ratings to evaluate the performance of each method in terms of several measures. The results indicated that there are inconsistencies in the performance of the GPT-4 models depending on the measures, task types, and test forms, compared to the test's AWE model and human ratings. Implications of the findings for practice and further research are discussed.*
*As Provided*