Treffer: Can AI Grade Like a Human? Validity, Reliability, and Fairness in University Coursework Assessment
Postsecondary Education
2564-8020
Weitere Informationen
Background/purpose: Generative artificial intelligence (GenAI) is often promoted as a transformative tool for assessment, yet evidence of its validity compared to human raters remains limited. This study examined whether an AI-based rater could be used interchangeably with trained faculty in scoring complex coursework. Materials/methods: Ninety-one essays from teacher education courses at two Greek universities were independently evaluated by two human raters and an AI system, using a common rubric. Results: Human inter-rater reliability was excellent (ICC(2,1) = 0.884; ICC(2,k) k=2 = 0.938). In contrast, AI-human agreement was substantially weaker (AI vs Human-Z: ICC(2,1) = 0.406; ICC(2,k) = 0.578; AI vs Human-S: ICC(2,1) = 0.279; ICC(2,k) = 0.436). The AI consistently inflated scores by 2.71-3.32 points and compressed distributions, limiting its ability to discriminate across performance levels. Bland-Altman analyses confirmed systematic proportional bias, with overscoring of weaker work and under-scoring of stronger work. Results revealed significant inconsistency in AI performance: while the model failed to align with Human-S ([kappa] = 0.017), it demonstrated statistically significant, moderate agreement with Human-Z ([kappa] = 0.367). This discrepancy highlights the lack of standardization in human grading and the sensitivity of algorithms to divergent interpretive frameworks. A principal component analysis suggested that AI captured a narrower construct of quality than human raters. Conclusion: These findings indicate that current GenAI tools are not suitable for high-stakes assessment in higher education, where fairness and construct validity are essential. They may, however, offer value in formative feedback or administrative support if used transparently and under human oversight.
As Provided