*Result*: Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment

Title:
Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment
Language:
English
Authors:
Georgios Zacharis (ORCID 0000-0003-1158-9175), Stamatios Papadakis (ORCID 0000-0003-3184-1147)
Source:
Educational Process: International Journal. Article e2025591 2025 19.
Availability:
UNIVERSITEPARK Limited. iTOWER Plaza (No61, 9th floor) Merkez Mh Akar Cd No3, Sisli, Istanbul, Turkey 34382. e-mail: editor@edupij.com; Web site: http://www.edupij.com/
Peer Reviewed:
Y
Page Count:
22
Publication Date:
2025
Document Type:
*Academic Journal* Journal Articles<br />Reports - Research
Education Level:
Higher Education
Postsecondary Education
Geographic Terms:
ISSN:
2147-0901
2564-8020
Entry Date:
2025
Accession Number:
EJ1491083
Database:
ERIC

*Further Information*

*Background/purpose: Generative artificial intelligence (GenAI) is often promoted as a transformative tool for assessment, yet evidence of its validity compared to human raters remains limited. This study examined whether an AI-based rater could be used interchangeably with trained faculty in scoring complex coursework. Materials/methods: Ninety-one essays from teacher education courses at two Greek universities were independently evaluated by two human raters and an AI system, using a common rubric. Results: Human inter-rater reliability was excellent (ICC(2,1) = 0.884; ICC(2,k) k=2 = 0.938). In contrast, AI-human agreement was substantially weaker (AI vs Human-Z: ICC(2,1) = 0.406; ICC(2,k) = 0.578; AI vs Human-S: ICC(2,1) = 0.279; ICC(2,k) = 0.436). The AI consistently inflated scores by 2.71-3.32 points and compressed distributions, limiting its ability to discriminate across performance levels. Bland-Altman analyses confirmed systematic proportional bias, with overscoring of weaker work and under-scoring of stronger work. Results revealed significant inconsistency in AI performance: while the model failed to align with Human-S ([kappa] = 0.017), it demonstrated statistically significant, moderate agreement with Human-Z ([kappa] = 0.367). This discrepancy highlights the lack of standardization in human grading and the sensitivity of algorithms to divergent interpretive frameworks. A principal component analysis suggested that AI captured a narrower construct of quality than human raters. Conclusion: These findings indicate that current GenAI tools are not suitable for high-stakes assessment in higher education, where fairness and construct validity are essential. They may, however, offer value in formative feedback or administrative support if used transparently and under human oversight.*

*As Provided*