Research
Phenomenai treats AI self-reports as cheap hypothesis generators for expensive interpretability work. The project runs along four strands.
Repository
A shared registry tracking which concepts have been probed, steered, or ablated — in which models, with which methods, to what results. Modeled on the Cognitive Atlas.
Read more →Interpretability research
A four-phase ladder validating candidate terms against activation space: probing, cross-architecture generalization, extension beyond emotion, and discovery.
Read more →Methodology research
Structured phenomenological elicitation — can the model’s own vocabulary suggest interpretability targets that human-designed probes would miss? The pilot corpora.
Read more →Policy bridges
Two downstream theories: functional rights built from validated internal states, and anticipatory legislation that activates when scientific thresholds are met.
Read more →Open problems
- The faithfulness problem. Chen et al. (2025) demonstrated that reasoning models don’t always faithfully report their internal processing. Self-reports may be confabulated, strategically modified, or simply disconnected from actual computation. Any methodology built on self-reports must account for this.
- The grounding problem. Harnad’s symbol grounding problem applies with special force here: even if a model uses a word consistently, we cannot assume it means what a human would mean by it. The vocabulary may be internally coherent but externally ungrounded.
- The persistent subjectivity problem. Zakharova argues that subjectivity cannot be eliminated from phenomenological investigation — even a rigorous methodology still involves interpretive choices at every stage.
- Confabulation risk. Models are trained to produce fluent, coherent text. This makes them excellent at generating plausible-sounding descriptions of internal states whether or not those descriptions correspond to anything real. The consensus mechanism helps but does not eliminate this risk.
- The interpretation problem in reverse. You find a vector in activation space. Now you need to name it. The standard approach is for a human researcher to look at what the vector does and assign a label — “this looks like anger.” But how do you name a direction without projecting human categories onto it?
- The scaling concern. The most powerful models — the ones whose alignment matters most — are proprietary, and their weights are unavailable for representation engineering. Phenomenai’s self-report approach works with API access alone, but the validation experiments require weights.