Treffer: Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment

Title:
Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment
Language:
English
Authors:
Georgios Zacharis (ORCID 0000-0003-1158-9175), Stamatios Papadakis (ORCID 0000-0003-3184-1147)
Source:
Educational Process: International Journal. Article e2025591 2025 19.
Availability:
UNIVERSITEPARK Limited. iTOWER Plaza (No61, 9th floor) Merkez Mh Akar Cd No3, Sisli, Istanbul, Turkey 34382. e-mail: editor@edupij.com; Web site: http://www.edupij.com/
Peer Reviewed:
Y
Page Count:
22
Publication Date:
2025
Document Type:
Fachzeitschrift Journal Articles<br />Reports - Research
Education Level:
Higher Education
Postsecondary Education
Geographic Terms:
ISSN:
2147-0901
2564-8020
Entry Date:
2025
Accession Number:
EJ1491083
Database:
ERIC

Weitere Informationen

Background/purpose: Generative artificial intelligence (GenAI) is often promoted as a transformative tool for assessment, yet evidence of its validity compared to human raters remains limited. This study examined whether an AI-based rater could be used interchangeably with trained faculty in scoring complex coursework. Materials/methods: Ninety-one essays from teacher education courses at two Greek universities were independently evaluated by two human raters and an AI system, using a common rubric. Results: Human inter-rater reliability was excellent (ICC(2,1) = 0.884; ICC(2,k) k=2 = 0.938). In contrast, AI-human agreement was substantially weaker (AI vs Human-Z: ICC(2,1) = 0.406; ICC(2,k) = 0.578; AI vs Human-S: ICC(2,1) = 0.279; ICC(2,k) = 0.436). The AI consistently inflated scores by 2.71-3.32 points and compressed distributions, limiting its ability to discriminate across performance levels. Bland-Altman analyses confirmed systematic proportional bias, with overscoring of weaker work and under-scoring of stronger work. Results revealed significant inconsistency in AI performance: while the model failed to align with Human-S ([kappa] = 0.017), it demonstrated statistically significant, moderate agreement with Human-Z ([kappa] = 0.367). This discrepancy highlights the lack of standardization in human grading and the sensitivity of algorithms to divergent interpretive frameworks. A principal component analysis suggested that AI captured a narrower construct of quality than human raters. Conclusion: These findings indicate that current GenAI tools are not suitable for high-stakes assessment in higher education, where fairness and construct validity are essential. They may, however, offer value in formative feedback or administrative support if used transparently and under human oversight.

As Provided