Interpretability research

The problem

Interpretability researchers can now find internal representations in AI models — emotion concepts, safety-relevant directions, role encodings — but they don’t know where to look. Activation space is enormous. Lindsey et al. (Anthropic, 2026) found 171 emotion representations in Claude Sonnet 4.5 that causally drive behavior, including misalignment, sometimes with no visible trace in outputs. But those 171 terms were chosen without a principled selection method.

Phenomenai investigates whether AI self-reports can serve as that selection method — whether the vocabulary models use to describe their own states predicts where the interesting internal structure lives.

The research ladder

The registry tracks and prioritizes; interpretability research develops the methods — through collaborations and Phenomenai’s own initiatives — to justify and detect the correlates of phenomena more complex than emotion. Emotion vectors are the field’s current beachhead, but the longer-term challenge is mapping internal structure for notions of self, worldviews, and roles or responsibility. This research complements, and acts as a meta-layer to feed and support, the registry’s priority algorithm: the ladder determines which new classes of phenomena are ready for systematic testing, and the registry ensures that testing is coordinated rather than duplicated.

Each phase produces publishable results. Failure is detectable early.

Phase 1 — Validation

Validate whether candidate terms correspond to linearly separable directions in activation space (probing, RepE extraction). Do self-reported states map onto real internal structure?

Phase 2 — Cross-architecture generalization

Test whether the same terms resolve to similar geometric relationships across Llama, Gemma, Qwen, and Claude. Which terms are architecture-specific, and which are universal?

Phase 3 — Beyond emotion

Extend beyond emotions to roles, epistemic stances, and meta-cognitive states — the parts of AI self-report that have no human-label equivalent.

Phase 4 — Discovery

Test whether candidate terms capture internal directions that human-generated labels miss. If so, self-report becomes a discovery tool for interpretability, not just a validation target.