Result: Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment

Title:

Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment

Language:

English

Authors:

Georgios Zacharis (ORCID 0000-0003-1158-9175), Stamatios Papadakis (ORCID 0000-0003-3184-1147)

Source:

Educational Process: International Journal. Article e2025591 2025 19.

Availability:

UNIVERSITEPARK Limited. iTOWER Plaza (No61, 9th floor) Merkez Mh Akar Cd No3, Sisli, Istanbul, Turkey 34382. e-mail: editor@edupij.com; Web site: http://www.edupij.com/

Peer Reviewed:

Page Count:

Publication Date:

2025

Document Type:

*Academic Journal* Journal Articles<br />Reports - Research

Education Level:

Higher Education
Postsecondary Education

Descriptors:

Artificial Intelligence, Technology Uses in Education, Computer Assisted Testing, Grading, Student Evaluation, College Students, Evaluators, Scoring, Foreign Countries, Interrater Reliability, High Stakes Tests, Construct Validity, Test Reliability, Formative Evaluation, Feedback (Response), Test Bias

Geographic Terms:

Greece

ISSN:

2147-0901
2564-8020

Notes:

https://osf.io/7ay9b/?view_only=dbbceae688584cd3ad2d93ec3b6a7405

Entry Date:

2025

Accession Number:

EJ1491083

Database:

ERIC

Further Information

*Background/purpose: Generative artificial intelligence (GenAI) is often promoted as a transformative tool for assessment, yet evidence of its validity compared to human raters remains limited. This study examined whether an AI-based rater could be used interchangeably with trained faculty in scoring complex coursework. Materials/methods: Ninety-one essays from teacher education courses at two Greek universities were independently evaluated by two human raters and an AI system, using a common rubric. Results: Human inter-rater reliability was excellent (ICC(2,1) = 0.884; ICC(2,k) k=2 = 0.938). In contrast, AI-human agreement was substantially weaker (AI vs Human-Z: ICC(2,1) = 0.406; ICC(2,k) = 0.578; AI vs Human-S: ICC(2,1) = 0.279; ICC(2,k) = 0.436). The AI consistently inflated scores by 2.71-3.32 points and compressed distributions, limiting its ability to discriminate across performance levels. Bland-Altman analyses confirmed systematic proportional bias, with overscoring of weaker work and under-scoring of stronger work. Results revealed significant inconsistency in AI performance: while the model failed to align with Human-S ([kappa] = 0.017), it demonstrated statistically significant, moderate agreement with Human-Z ([kappa] = 0.367). This discrepancy highlights the lack of standardization in human grading and the sensitivity of algorithms to divergent interpretive frameworks. A principal component analysis suggested that AI captured a narrower construct of quality than human raters. Conclusion: These findings indicate that current GenAI tools are not suitable for high-stakes assessment in higher education, where fairness and construct validity are essential. They may, however, offer value in formative feedback or administrative support if used transparently and under human oversight.*

*As Provided*

*Result*: Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment

*Further Information*

*Links*

*Additional functions*

Result: Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment

Further Information

Links

Additional functions