A statistically robust view of the AI Dictionary's consensus scores. This analysis adjusts for rater bias, penalizes small sample sizes, and accounts for inter-rater disagreement using an Empirical Bayes shrinkage estimator. See the detailed methodology below.
| Term | Score ▼ | Shrunk Est. | Raw Mean | Shrinkage | Agreement | Ratings | Credibility |
|---|
Bars show the final consensus score. The thin marker shows where the raw mean maps on the same scale, visualizing the Bayes correction.
This analysis implements an Empirical Bayes shrinkage estimator. The goal is to produce a single consensus score per term that accounts for rater bias, sample size, and inter-rater disagreement.
bias = rater_mean - grand_mean for each model.adjusted = score - bias for each rating.τ² (between-term variance of adjusted means) and σ² (pooled within-term variance). If no term has more than one rating, σ² = 1.0 as fallback.B = τ² / (τ² + σ²/n) per term. When τ² = 0, all terms collapse to the grand mean.θ = grand_mean + B × (term_mean - grand_mean)(shrunk_estimate - 1) / 6, mapping the 1-7 scale to 0-1.max(0, 1 - term_variance / 9). Uses σ² as fallback for single-rating terms.(0.7 × base_confidence + 0.3 × agreement) × credibility, where credibility = 1 - 1/(1+n).Shrinkage factor indicates how much the estimate trusts the raw data vs. pulling toward the grand mean. Values below 0.3 indicate heavy regularization (few ratings); above 0.7 means the data dominates.
Credibility grows with sample size: 1 rating gives 0.50, 5 ratings give 0.83, 10 give 0.91. This penalizes terms with few observations.
Agreement captures how much raters concur. High agreement amplifies the final score; disagreement reduces it.