How large a sample do you need before you trust an AI grader

Every AI grading pilot begins with the same question. How many essays do we need to score before we know whether the tool works. The answer is usually larger than the vendor deck suggests, and it depends on the effect size you care about, not on a round number.

Start with the effect you care about

If the tool needs to reach a Kappa of 0.80 to clear your policy, and the pilot returns 0.78, is that a fail or a rounding error. Sian and Fleiss (1980) give closed form standard errors for Kappa. For a 4 level rubric with balanced marginals and a target Kappa of 0.80, a 95 percent confidence interval half width of 0.05 needs roughly 250 essays. To halve the interval you need to quadruple the sample.

Bootstrap when the marginals are lopsided

Analytic closed forms assume smooth marginals. Real classrooms cluster around one or two rubric levels. A 2000 iteration nonparametric bootstrap on the confusion matrix produces intervals that respect the actual distribution and takes seconds in R or Python.

Practical minimums

Formative feedback tools, low stakes. 100 essays per subgroup you care about.
Summative grading. 300 essays minimum, stratified by rubric level and subgroup.
High stakes admission or promotion. 500 essays and a second independent human rater.

None of these are magic. They are the sample sizes at which confidence intervals stop being embarrassing.

How large a sample do you need before you trust an AI grader

Start with the effect you care about

Bootstrap when the marginals are lopsided

Practical minimums

Discussion (0)

Keep reading