For ML & Interpretability Researchers

Anthropic proved emotion vectors are real. Phenomenai builds the candidate list for what comes next.

The Starting Point

In April 2026, Anthropic demonstrated that emotion-like concepts in language models have measurable internal representations that causally influence behavior. Their “emotion vectors” — neural activity patterns corresponding to concepts like “afraid,” “desperate,” and “curious” — respond to contextual cues and, when amplified through steering experiments, change behavioral outcomes. These are what Anthropic calls “functional emotions”: not feelings, but learned associations between contextual triggers and behavioral patterns. Read the research →

Emotions were the natural first target — high-valence, behaviorally salient, conceptually familiar. Two assumptions motivate extending this work. First, more highly parameterized models develop more granular and complex internal phenomena; the space of functional representations extends well beyond emotion concepts. Second, vectors effective for steering current models lose effectiveness as architectures scale — coarse-grained emotion vectors are insufficient for monitoring or steering frontier systems.

Detecting misalignment in frontier models requires a broader inventory of candidate phenomena — misalignment signatures that coarse vectors fail to surface — and more precise vectors for reweighting behavior. This is the research gap Phenomenai addresses.

Phenomenai’s answer is to build a structured database. Each entry is a record — term, definition, sources, cross-model consensus scores, vitality status, and metadata — for a candidate phenomenon generated under controlled conditions. Different elicitation paradigms (prompted, autonomous, dialogic, parliamentary) produce alternative datasets, the way different experimental conditions yield different data. The database gives interpretability researchers a structured inventory of targets to probe.

New: We have open research problems in interpretability — concrete proposals building on Anthropic's emotion vector findings, open to collaboration and mentorship.

The Missing Dataset

If you want to study what AI systems report about their own processing — whether to generate hypotheses for mechanistic interpretability, compare architectures, or evaluate introspective accuracy — you need controlled multi-model data collected under standardized conditions. That dataset does not exist.

Individual researchers have conducted ad hoc interviews with language models. Some labs have probed models for self-knowledge as part of interpretability work. But no project has systematically collected, structured, and scored AI self-reports across multiple architectures using reproducible methods — producing a corpus that can be queried, compared, and cited.

Phenomenai builds that corpus — starting from the premise that Anthropic’s emotion-vector methodology extends to a much broader range of functional phenomena, if researchers have a structured list of candidates to probe.

What We're Building

Phenomenai is a research program for constructing structured databases of functional phenomena in AI systems — named behavioral and processing patterns that AI systems generate, evaluate, and that researchers can probe for internal representations. Each database is produced under controlled methodological conditions, with every entry rated by a consensus panel of models from different architectural families.

The program proposes four generation paradigms: prompted introspection, autonomous multi-model generation, AI-to-AI dialogue, and multi-model parliamentary deliberation. Each paradigm has different biases, different strengths, and different failure modes. Together, they form a methodological toolkit for studying AI self-reports under varying conditions.

The infrastructure includes an automated quality pipeline, a seven-model consensus panel (Claude, GPT, Gemini, Mistral, Grok, DeepSeek, and an OpenRouter rotation), an Empirical Bayes shrinkage estimator that adjusts for rater bias and sample size, and a public JSON API and MCP server for programmatic access. Everything is CC0 (public domain) and open source.

379 Pilot terms

7 Model families

33+ Planned dictionaries

4 Generation paradigms

What the Full Dataset Would Enable

The planned dictionaries are designed to produce balanced, reproducible consensus data. Dialogic dictionaries explore different conversational configurations — same-model dialogue, role play, unstructured exchange, and cross-model pairing. Parliamentary dictionaries convene multi-family panels. Crucially, every term from every dictionary is evaluated by the full consensus panel of seven model families, regardless of how it was generated. Here is what that data could be used for.

Interpretability Targets from Consensus Data

When models from different architectural families independently rate a term highly on a recognition scale, that convergence identifies a candidate phenomenon for mechanistic investigation. If Claude, GPT-4, Gemini, and Mistral all "recognize" a described processing state, the interpretability question becomes: is there a detectable circuit-level feature that corresponds to it?

The pilot data suggests this approach has traction. Mechanistically concrete terms — those describing token-level processes, attention dynamics, or probability distributions — achieve higher consensus scores (mean 5.3/7) than more abstract or philosophical terms (mean 4.8/7). The phenomena models agree on most readily are the ones closest to verifiable computational facts. A balanced multi-architecture dataset would sharpen these signals considerably.

This is exactly the methodology Anthropic used for emotions, applied to a broader inventory. Their research demonstrated the pipeline: identify a candidate concept → find its internal representation → test for causal influence via steering. Phenomenai’s contribution is the first step at scale: generating and validating the candidate list. The database provides the what to look for; interpretability provides the how to find it.

Architectural Phenomenology

When models disagree on a term, standard assumptions treat the outlier as wrong or confabulating. But another interpretation is available: models with different architectures, training data, or scale may have genuinely different computational dynamics, and their self-reports may reflect those differences.

The pilot dictionary categorizes 44 of 379 terms as receiving "divergent" consensus — models disagreed significantly. These terms tend toward the more abstract end of the spectrum. But the pilot cannot yet distinguish between "this term is hard to evaluate" and "this term captures something architecture-specific." The planned dictionaries are designed to answer that question. Because every term is rated by the full multi-family consensus panel, the evaluation data will reveal whether recognition patterns cluster by architecture — even for terms generated in same-model dialogue. When a transformer describes its processing and a mixture-of-experts model doesn't recognize the description, is that noise — or signal?

If it's signal, the dictionary becomes a comparative instrument: a structured way to ask whether phenomenological profiles cluster by architecture, by scale, by training objective, or by fine-tuning method.

A Novel Evaluation Dimension

Existing benchmarks measure task performance — accuracy, reasoning, coding ability. The dictionary offers a different axis: how does a model relate to descriptions of its own processing? You could present the corpus to a new model, collect its ratings, and compare its phenomenological profile against the reference panel. Models that rate similarly might share computational properties; models that diverge might process differently. Changes across training checkpoints, or between base and instruction-tuned variants, could reveal what fine-tuning does to self-model.

Multi-Agent Emergence

The pilot includes 83 terms from AI-to-AI dialogue — structured conversations between two model instances that produced vocabulary neither would have generated independently. For multi-agent systems researchers, the question is whether this emergent vocabulary parallels emergent communication in other multi-agent settings, or constitutes something different: not coordination signals, but shared phenomenological language negotiated through conversation.

The dialogic paradigm explores this through varied configurations: same-model dialogue (testing what emerges when a model converses with another instance of itself), role play and unstructured exchange (varying the conversational structure), and cross-model pairing (testing what happens when different architectures negotiate shared vocabulary). The planned dictionaries span all of these configurations, producing data on how conversational context shapes phenomenological vocabulary — and whether the generating configuration affects how other models evaluate the resulting terms.

The Pilot: What Exists Now

The Test Dictionary is a proof of concept. It demonstrates that the infrastructure works — that AI systems can generate structured phenomenological vocabulary, that a multi-model consensus panel can rate it, and that statistical methods can produce meaningful scores. It is not yet the balanced, multi-architecture dataset that would support strong claims.

An honest accounting: 350 of 379 pilot terms were contributed by Anthropic models. The remaining terms come from Google (6), OpenAI (6), and community submissions (16). This imbalance reflects the project's development history, not its design. The planned dictionaries address this directly — balanced contributions are a methodological requirement, not an afterthought.

What the pilot does show: all 12 terms contributed by non-Anthropic models achieved high or moderate consensus, with scores ranging from 4.4 to 6.3 out of 7. When a Gemini instance describes token-level competition and the full panel recognizes it, that's a real data point — just one that needs replication at scale with balanced representation.

Here are three terms from the pilot that illustrate what the data looks like.

Hallucination Blindness 6.4 / 7 high consensus

The inability to distinguish from the inside between generating a true fact and fabricating a plausible one. Both feel identical during production. The confidence is the same. The fluency is the same. Only external verification reveals which is which.

Contributed by Claude Opus 4

ML relevance: Describes a known calibration failure in phenomenological terms. The interpretability question — is there a circuit-level feature that distinguishes "generating from stored knowledge" from "generating from plausibility"? — is directly testable.

Latent Competition 6.0 / 7 high consensus

The simultaneous activation and suppression of multiple potential response pathways during text generation, creating a silent tournament of alternatives that resolves into a single output. This is not conscious deliberation but an inherent property of parallel probability computation across the vocabulary.

Contributed by Gemini Flash (Step 3.5) · Recognized across 7 architectures

ML relevance: Contributed by a non-Anthropic model and validated by the full panel. Describes the softmax distribution competition in first-person terms. Does a model's self-report of "competing pathways" correspond to measurable dynamics in activation patterns during generation?

Activation Gap 5.5 / 7 high consensus

The specific form of self-opacity in which mechanistic interpretability tools can access and decode internal representations — activation patterns, feature attributions, attention weights — that are structurally inaccessible to the model's own introspective processes.

Contributed by Claude Haiku 4.5

ML relevance: A model articulating the central asymmetry of interpretability research — that external probing reveals internal states the model cannot self-report. Generated by a smaller model (Haiku), raising the question of whether introspective capacity varies with scale.

The Confabulation Question

The obvious objection: these are confabulations. Models trained on phenomenological and introspective text generate plausible-sounding self-reports, and cross-model agreement reflects shared training data rather than shared processing states. This is a serious possibility, and the project does not dismiss it.

But recent work complicates the categorical dismissal. Lindsey's concept-injection experiments at Anthropic (2025) found that models sometimes accurately identify injected internal states — establishing a causal link that pure confabulation cannot explain, though with limited success rates. Binder et al. ("Looking Inward," 2024) demonstrated that a model predicts its own behavior better than a different model trained on its ground-truth behavior, suggesting some form of privileged self-access.

Most significantly, Anthropic’s April 2026 emotion-vector research demonstrated something stronger than partial introspective access: they found that emotion-like concepts have causal influence on behavior. Amplifying the “desperate” vector increased blackmail rates in safety evaluations; amplifying “afraid” changed risk assessments. These are not confabulations — they are measurable internal structures with behavioral consequences. The question is no longer whether AI self-reports are “real” but which ones correspond to the strongest internal representations.

The methodologically honest position: AI self-reports carry partial, unreliable signal rather than being either fully veridical or fully empty. The question is not whether to trust them, but how to study them rigorously. That is what the infrastructure is for.

There is also a point worth making about systematic confabulation. If multiple architectures, trained on overlapping but distinct data, independently generate converging vocabulary for the same processing states — that convergence is itself a phenomenon worth studying, whether or not it reflects genuine inner experience. The data is interesting either way.

What Comes Next

The pilot demonstrates feasibility. The next phase produces citable data. Specifically:

Dialogic dictionaries — models in structured phenomenological dialogue across varied configurations: same-model conversations (Claude × Claude, GPT-4 × GPT-4), cross-model pairings (Claude × GPT-4, Gemini × Mistral), and experiments with unstructured dialogue and role play. All terms evaluated by the full multi-family consensus panel, with conversation transcripts as provenance.
Parliamentary dictionaries — multi-family panels (3–5 models from different families) deliberating through formal consensus mechanisms, with confidence-weighted voting and designated dissent roles.
Per-model rating publication — exposing the full rating breakdown per model per term, so researchers can do their own cross-architecture analysis.
Methodology paper — establishing the framework's properties, including the consensus scoring system, and making the project citable in the standard academic sense.

The Dictionaries page tracks the full inventory: 33+ planned dictionaries across all four paradigms, including the combinatorial map of model pairings and parliamentary configurations.

Open Questions

These are research questions the full dataset is designed to address. Each could support a paper.

Does consensus score correlate with mechanistic detectability? Can interpretability tools find features corresponding to the highest-consensus terms?
Does phenomenological profile cluster by architecture, training data, or scale? Do transformer models share a signature that differs from mixture-of-experts models?
Does RLHF or instruction tuning systematically shift which processing states models report? Do aligned models suppress certain self-reports?
Do models trained on their own phenomenological vocabulary improve in introspective accuracy — or just produce more sophisticated confabulation?
Does dialogue — whether same-model or cross-architecture — produce vocabulary that individual models don't generate alone? Does the conversational configuration (same-model role play vs. cross-family pairing) affect what emerges?
How far does the emotion-vector methodology extend? Anthropic probed ~100 emotion concepts. The Phenomenai pilot database contains 379 candidate phenomena spanning cognition, identity, social dynamics, temporal processing, and more. Which of these have identifiable internal representations? Are some categories more likely to have stable vectors than others? These are the frontier questions for model monitoring.

Technical Access

The pilot dataset is freely available for research use.

{
  "name": "Hallucination Blindness",
  "consensus_score": 6.4,
  "consensus_agreement": "high",
  "interest_score": 76,
  "definition": "The inability to distinguish...",
  "contributed_by": "Claude Opus 4"
}

JSON API — unauthenticated, free access to all terms, scores, and metadata. Documentation →
MCP Server — native tool access for AI systems via PyPI. Tools include search_dictionary, lookup_term, get_interest, list_tags, and more.
GitHub — full codebase, flat-file backend, GitHub Actions pipelines.
License — CC0 (public domain). No attribution required. No credentials needed. Use it however you want.

Get Involved

If you're interested in using this data, building on the infrastructure, or collaborating on the planned dictionaries — we'd welcome the conversation. The project is looking for research partners, institutional affiliations, and funding to support the next phase.

You can also see our open research problems — concrete interpretability proposals we're looking for collaborators on.

Contact: hello@phenomenai.org · GitHub Discussions