Series · 3 parts

The Reliability Primer

A three part series on measuring whether an AI grader is actually working. Written for teachers, department chairs, and procurement teams.

1
The Kappa Statistic: what agreement really means when an AI grades a student
Percent agreement flatters every AI grader on the market. Cohen's Kappa strips out the luck. Here is how the statistic works, where it breaks, and the thresholds that actually matter for classroom use.
Jul 2, 2026 · 11 min
2
How large a sample do you need before you trust an AI grader
A short, practical guide to sample size for a rubric grading pilot. Power, confidence intervals, and why 30 essays is almost never enough.
Jul 2, 2026 · 7 min
3
Overall Kappa can be strong while one student group fails
A worked example on why you must report reliability by subgroup, and a simple protocol any pilot can adopt.
Jul 2, 2026 · 8 min