โ† Blog

Jul 2, 2026 ยท 7 min read

How large a sample do you need before you trust an AI grader

A short, practical guide to sample size for a rubric grading pilot. Power, confidence intervals, and why 30 essays is almost never enough.

Written by

Gradelab
methodologystatisticspilots
Share

Every AI grading pilot begins with the same question. How many essays do we need to score before we know whether the tool works. The answer is usually larger than the vendor deck suggests, and it depends on the effect size you care about, not on a round number.

Start with the effect you care about

If the tool needs to reach a Kappa of 0.80 to clear your policy, and the pilot returns 0.78, is that a fail or a rounding error. Sian and Fleiss (1980) give closed form standard errors for Kappa. For a 4 level rubric with balanced marginals and a target Kappa of 0.80, a 95 percent confidence interval half width of 0.05 needs roughly 250 essays. To halve the interval you need to quadruple the sample.

Bootstrap when the marginals are lopsided

Analytic closed forms assume smooth marginals. Real classrooms cluster around one or two rubric levels. A 2000 iteration nonparametric bootstrap on the confusion matrix produces intervals that respect the actual distribution and takes seconds in R or Python.

Practical minimums

  • Formative feedback tools, low stakes. 100 essays per subgroup you care about.
  • Summative grading. 300 essays minimum, stratified by rubric level and subgroup.
  • High stakes admission or promotion. 500 essays and a second independent human rater.

None of these are magic. They are the sample sizes at which confidence intervals stop being embarrassing.

Sign in to react

Discussion (0)

Community
Sign in to join the discussion.

Loading comments...

About the author

Gradelab

More from this author โ†’

Keep reading

All posts โ†’