The Kappa Statistic: what agreement really means when an AI grades a student

Every AI grading vendor eventually shows you a slide that says the model agrees with human raters "over 90 percent of the time." It sounds decisive. It usually is not. Percent agreement is the weakest reliability number a researcher can report, because it counts the times two raters agreed by pure luck. The correction that took over psychology, medical diagnostics, and content analysis after 1960 is Jacob Cohen''s Kappa. If you are buying or building an AI grader, this is the number you should be asking for.

Why percent agreement lies

Imagine a two point rubric: pass or fail. If a teacher passes 80 percent of essays and an AI passes 80 percent of essays, they will agree on about 68 percent of essays even if the AI decides at random. Raise the base rate to 95 percent pass and random agreement climbs above 90 percent. The "accuracy" you see on a vendor slide is often mostly this base rate effect, not skill.

Cohen introduced Kappa in his 1960 paper A Coefficient of Agreement for Nominal Scales in Educational and Psychological Measurement to fix exactly this. Kappa asks a simple question: how much better than chance did the two raters do?

The formula, in one line

κ = (p_o − p_e) / (1 − p_e)

Here p_o is the observed agreement and p_e is the agreement you would expect if both raters were guessing according to their own marginal rates. When the two raters agree only as much as chance predicts, Kappa is 0. Perfect agreement gives 1. Systematic disagreement can push Kappa below 0, which is a rarer but real signal that the raters are pulling in opposite directions.

A worked example on a real rubric

Take a 200 essay sample scored on a four point analytic rubric. The teacher and the AI both mark most essays a 3, some 2s and 4s, and a handful of 1s. The confusion matrix looks like this:

Teacher \\ AI	1	2	3	4	Row total
1	6	3	1	0	10
2	2	28	10	0	40
3	0	7	82	11	100
4	0	0	12	38	50
Column total	8	38	105	49	200

Confusion matrix, n = 200 essays, four point analytic rubric

Observed agreement is the diagonal: (6 + 28 + 82 + 38) / 200 = 0.77. That is the number a vendor would print. Expected agreement uses the row and column totals: (10·8 + 40·38 + 100·105 + 50·49) / 200² = 14580 / 40000 = 0.3645. Plugging in, Kappa comes out to (0.77 − 0.3645) / (1 − 0.3645) = 0.638. A "77 percent match" is actually moderate to substantial agreement once you strip out the luck. That is a very different pitch.

Two raters compared through a confusion matrix — Fig. 1 / Rater comparison as a confusion matrix

Interpreting the number: Landis and Koch, cautiously

The benchmarks most reviewers cite come from Landis and Koch (1977) in Biometrics:

< 0.00 poor, worse than chance
0.01 to 0.20 slight
0.21 to 0.40 fair
0.41 to 0.60 moderate
0.61 to 0.80 substantial
0.81 to 1.00 almost perfect

These labels are convenient but the authors called them arbitrary in the same paper. In high stakes settings such as admissions, licensure, or grade promotion, a Kappa of 0.61 is not "good enough." McHugh (2012), writing in Biochemia Medica, argues that for medical decisions nothing below 0.80 should be trusted and that 0.90 is the working floor. Classroom formative feedback has more slack. Summative grading does not.

Weighted Kappa: when a near miss is not a miss

Analytic rubrics are ordinal, not nominal. A 4 scored as a 3 is a small error. A 4 scored as a 1 is a disaster. Plain Kappa treats both as the same disagreement. Cohen''s 1968 follow up introduced weighted Kappa, which penalises larger gaps more. Quadratic weights, where the penalty grows with the square of the distance, are standard for rubric scoring and are the default in most reliability packages including psych::cohen.kappa in R and sklearn.metrics.cohen_kappa_score(weights="quadratic") in Python. For any rubric with three or more ordered levels, weighted Kappa is the honest choice.

More than two raters: Fleiss and Krippendorff

Cohen''s Kappa is defined for exactly two raters. Real pilots almost never look like that. When three or more teachers score the same essays, or when a rubric row is missing for some essays, the correct statistics are Fleiss'' Kappa (1971) or Krippendorff''s Alpha. Alpha, popularised by Klaus Krippendorff in Content Analysis: An Introduction to Its Methodology, handles any number of raters, any level of measurement, and missing data. It has become the reporting standard in Communication Methods and Measures and much of the human coding literature. If a vendor benchmark averages three or more human graders, ask for Alpha, not an average of pairwise Kappas.

The Kappa paradoxes: why the number can lie back

Feinstein and Cicchetti published two influential critiques in the Journal of Clinical Epidemiology in 1990. They showed that Kappa can collapse toward 0 even when observed agreement is very high, if the class distribution is heavily skewed. This is the "high agreement, low Kappa" paradox. It appears constantly in AI grading pilots because most essays cluster around one or two rubric levels. The fix is not to abandon Kappa but to report it alongside the marginal distributions and the confusion matrix, so a reader can see whether the low Kappa reflects real disagreement or just a lopsided sample.

What to demand from an AI grading vendor

Report Cohen''s Kappa or Krippendorff''s Alpha, not raw agreement.
Use quadratic weighted Kappa for any ordinal rubric.
Publish the confusion matrix and the marginal distributions.
Give a 95 percent confidence interval. A Kappa of 0.72 with an interval of [0.55, 0.87] is not the same product as 0.72 with [0.69, 0.75]. Bootstrap intervals are cheap and appropriate.
Break the number down by rubric row. Overall Kappa hides the row that fails.
Report per subgroup: English learners, IEP students, longer responses. A model can look "substantial" overall and still fail one population.

The bottom line

Percent agreement is a marketing number. Cohen''s Kappa, weighted where the rubric is ordinal and extended to Alpha where more than two raters are involved, is the minimum a serious buyer should accept. The math is more than sixty years old and the tooling is trivial. There is no excuse for a vendor slide that does not carry it.

Primary references

Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1), 37 to 46.
Cohen, J. (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213 to 220.
Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159 to 174.
Feinstein, A. R., and Cicchetti, D. V. (1990). High agreement but low kappa. Journal of Clinical Epidemiology, 43(6), 543 to 549.
McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276 to 282.
Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology, 4th ed. SAGE.