โ† Blog

Jul 2, 2026 ยท 8 min read

Overall Kappa can be strong while one student group fails

A worked example on why you must report reliability by subgroup, and a simple protocol any pilot can adopt.

Written by

Gradelab
methodologyfairnesspilotsresearch
Share

A single overall Kappa hides everything a fairness reviewer cares about. In the 2023 Stanford HAI evaluation of automated essay scoring, several tools reached quadratic weighted Kappa above 0.75 in aggregate while dropping below 0.55 for English Learners on the same rubric.

The protocol

  1. Define subgroups before scoring. IEP status, English learner status, and response length quartile are the standard three.
  2. Score the full sample.
  3. Compute Kappa within each subgroup with a bootstrap 95 percent interval.
  4. Publish the whole table. Do not average.

What passes and what fails

If any subgroup Kappa is more than 0.10 lower than the overall Kappa, that is a fairness flag and requires a written mitigation before the tool goes live. This is the rule adopted by the New York City Department of Education for its 2024 essay tools review.

Sign in to react

Discussion (0)

Community
Sign in to join the discussion.

Loading comments...

About the author

Gradelab

More from this author โ†’

Keep reading

All posts โ†’