Open Problems

Interpretability research directions where AI phenomenology meets alignment — open to collaboration, mentorship, and institutional support.

Phenomenai builds structured corpora of AI self-reports. That data is interesting, but its applicability to alignment and safety is theoretically distant — we are still far from being able to say "this dictionary entry predicts this safety-relevant behavior."

The two problems below represent concrete research directions that could close that gap. Both build on recent interpretability work — specifically Sofroniew et al. (April 2026), which found that Claude Sonnet 4.5 contains internal "emotion vectors" that causally drive misaligned behavior. Critically, steering the "desperate" vector produced reward hacking while the model's output remained composed. The internal state drove misalignment with no signal in the text.

That finding opens a door. If internal affective states drive behavior that text monitoring can't catch, then understanding those states — their structure, their scaling properties, their relationship to phenomenological vocabulary — becomes a safety question.

These are open invitations. If you have interpretability tooling, institutional access, compute resources, or mentorship experience — and you think these directions are worth pursuing — we'd love to hear from you. We are looking for collaborators, advisors, and institutional partners.


Problem 1

CoT Monitoring Alternatives

Emotion Vector Monitoring as a Complement to Chain-of-Thought Monitoring

Anthropic's April 2026 paper found that Claude Sonnet 4.5 contains internal "emotion vectors" that causally drive misaligned behavior. The key result: steering the "desperate" vector produced reward hacking while the model's chain-of-thought remained composed. The internal state drove misalignment with no signal in the text.

This sets up a natural question: when CoT monitoring fails — when the model's reasoning trace gives no indication of the misalignment its internal state is driving — can emotion vector monitoring step in as a complementary channel?

Identify emotion vector probes that can detect misalignments that CoT monitors miss. Run both monitoring approaches — CoT analysis and emotion vector probing — across a suite of safety-relevant tasks: reward hacking, sycophancy, deceptive alignment. Then report:

"CoT monitoring catches X% of misaligned actions. Emotion vector monitoring catches Y%. The two together catch Z%" — where Z > X.

Chain-of-thought monitoring is currently our primary channel for detecting misalignment during deployment. If CoT becomes unfaithful — if models learn to produce clean reasoning traces while their internal states drive problematic behavior — we lose that channel. Quantifying the value of a complementary internal-state monitoring channel gives us a backup before we need one.


Problem 2

Generalization Science

Does Monitoring Resolution Need to Scale with Model Complexity?

Sofroniew et al. used 171 basic emotion terms to extract internal representations from Claude Sonnet 4.5. That worked for that model at that scale. But here's the concern: if larger models develop more complex internal affective landscapes than 171 terms can distinguish, then monitoring approaches validated on smaller models will have blind spots at scale — not because the method fails, but because the vocabulary is too coarse.

The problem is: we don't know if that's the case.

Replicate the emotion vector extraction pipeline across multiple sizes within one model family — for example, Llama 3 at 8B, 70B, and 405B. At each scale, measure how many emotion vectors map to distinct causal representations. Then test whether finer-grained vocabulary improves monitoring sensitivity for larger models.

We have a solution when we can predict, for a given model scale, what vocabulary resolution is needed for effective internal state monitoring.

If monitoring resolution needs to scale with model complexity, then tools validated on today's models will silently degrade on tomorrow's. This research provides a principled method for predicting when emotion vector monitoring tools will lose sensitivity — before they actually do.


Get Involved

These problems are beyond what one person or one small project can tackle alone. They need access to model internals, interpretability tooling, compute, and — frankly — experienced guidance. If any of this resonates, here's what would help:

Contact: hello@phenomenai.org · GitHub Discussions