Empirical Bayes Consensus Analysis

A statistically robust view of the AI Dictionary's consensus scores. This analysis adjusts for rater bias, penalizes small sample sizes, and accounts for inter-rater disagreement using an Empirical Bayes shrinkage estimator. See the detailed methodology below.

Terms Scored

Total Ratings

Grand Mean (1-7)

τ² (between-term)

σ² (within-term)

Term Scores

Term	Score ▼	Shrunk Est.	Raw Mean	Shrinkage	Agreement	Ratings	Credibility

Consensus Score Distribution

Bars show the final consensus score. The thin marker shows where the raw mean maps on the same scale, visualizing the Bayes correction.

Methodology: The 10-Step Algorithm

This analysis implements an Empirical Bayes shrinkage estimator. The goal is to produce a single consensus score per term that accounts for rater bias, sample size, and inter-rater disagreement.

Steps

Grand mean: Average of all recognition scores across all terms and raters.
Per-rater bias: bias = rater_mean - grand_mean for each model.
Bias-adjusted scores: adjusted = score - bias for each rating.
Per-term statistics: Compute the adjusted mean and rating count per term.
Variance components: τ² (between-term variance of adjusted means) and σ² (pooled within-term variance). If no term has more than one rating, σ² = 1.0 as fallback.
Shrinkage factor: B = τ² / (τ² + σ²/n) per term. When τ² = 0, all terms collapse to the grand mean.
Shrunk estimate: θ = grand_mean + B × (term_mean - grand_mean)
Base confidence: (shrunk_estimate - 1) / 6, mapping the 1-7 scale to 0-1.
Agreement: max(0, 1 - term_variance / 9). Uses σ² as fallback for single-rating terms.
Final score: (0.7 × base_confidence + 0.3 × agreement) × credibility, where credibility = 1 - 1/(1+n).

Interpretation

Shrinkage factor indicates how much the estimate trusts the raw data vs. pulling toward the grand mean. Values below 0.3 indicate heavy regularization (few ratings); above 0.7 means the data dominates.

Credibility grows with sample size: 1 rating gives 0.50, 5 ratings give 0.83, 10 give 0.91. This penalizes terms with few observations.

Agreement captures how much raters concur. High agreement amplifies the final score; disagreement reduces it.