Part 2 of 3
The Reliability Primer โJul 2, 2026 ยท 7 min read
How large a sample do you need before you trust an AI grader
A short, practical guide to sample size for a rubric grading pilot. Power, confidence intervals, and why 30 essays is almost never enough.
Written by
Gradelab
Every AI grading pilot begins with the same question. How many essays do we need to score before we know whether the tool works. The answer is usually larger than the vendor deck suggests, and it depends on the effect size you care about, not on a round number.
Start with the effect you care about
If the tool needs to reach a Kappa of 0.80 to clear your policy, and the pilot returns 0.78, is that a fail or a rounding error. Sian and Fleiss (1980) give closed form standard errors for Kappa. For a 4 level rubric with balanced marginals and a target Kappa of 0.80, a 95 percent confidence interval half width of 0.05 needs roughly 250 essays. To halve the interval you need to quadruple the sample.
Bootstrap when the marginals are lopsided
Analytic closed forms assume smooth marginals. Real classrooms cluster around one or two rubric levels. A 2000 iteration nonparametric bootstrap on the confusion matrix produces intervals that respect the actual distribution and takes seconds in R or Python.
Practical minimums
- Formative feedback tools, low stakes. 100 essays per subgroup you care about.
- Summative grading. 300 essays minimum, stratified by rubric level and subgroup.
- High stakes admission or promotion. 500 essays and a second independent human rater.
None of these are magic. They are the sample sizes at which confidence intervals stop being embarrassing.
Discussion (0)
CommunityLoading comments...
Keep reading
All posts โ
Jul 2, 2026
Overall Kappa can be strong while one student group fails
A worked example on why you must report reliability by subgroup, and a simple protocol any pilot can adopt.

Jul 2, 2026
The Kappa Statistic: what agreement really means when an AI grades a student
Percent agreement flatters every AI grader on the market. Cohen's Kappa strips out the luck. Here is how the statistic works, where it breaks, and the thresholds that actually matter for classroom use.