Overall Kappa can be strong while one student group fails

A single overall Kappa hides everything a fairness reviewer cares about. In the 2023 Stanford HAI evaluation of automated essay scoring, several tools reached quadratic weighted Kappa above 0.75 in aggregate while dropping below 0.55 for English Learners on the same rubric.

The protocol

Define subgroups before scoring. IEP status, English learner status, and response length quartile are the standard three.
Score the full sample.
Compute Kappa within each subgroup with a bootstrap 95 percent interval.
Publish the whole table. Do not average.

What passes and what fails

If any subgroup Kappa is more than 0.10 lower than the overall Kappa, that is a fairness flag and requires a written mitigation before the tool goes live. This is the rule adopted by the New York City Department of Education for its 2024 essay tools review.

Overall Kappa can be strong while one student group fails

The protocol

What passes and what fails

Discussion (0)

Keep reading