Part 3 of 3
The Reliability Primer โJul 2, 2026 ยท 8 min read
Overall Kappa can be strong while one student group fails
A worked example on why you must report reliability by subgroup, and a simple protocol any pilot can adopt.
Written by
Gradelab
A single overall Kappa hides everything a fairness reviewer cares about. In the 2023 Stanford HAI evaluation of automated essay scoring, several tools reached quadratic weighted Kappa above 0.75 in aggregate while dropping below 0.55 for English Learners on the same rubric.
The protocol
- Define subgroups before scoring. IEP status, English learner status, and response length quartile are the standard three.
- Score the full sample.
- Compute Kappa within each subgroup with a bootstrap 95 percent interval.
- Publish the whole table. Do not average.
What passes and what fails
If any subgroup Kappa is more than 0.10 lower than the overall Kappa, that is a fairness flag and requires a written mitigation before the tool goes live. This is the rule adopted by the New York City Department of Education for its 2024 essay tools review.
Discussion (0)
CommunityLoading comments...
Keep reading
All posts โ
Jul 2, 2026
How large a sample do you need before you trust an AI grader
A short, practical guide to sample size for a rubric grading pilot. Power, confidence intervals, and why 30 essays is almost never enough.

Jul 2, 2026
The Kappa Statistic: what agreement really means when an AI grades a student
Percent agreement flatters every AI grader on the market. Cohen's Kappa strips out the luck. Here is how the statistic works, where it breaks, and the thresholds that actually matter for classroom use.