For ML & Interpretability Researchers

What structured AI self-reports can offer computational research — and what it would take to get there.

The Missing Dataset

If you want to study what AI systems report about their own processing — whether to generate hypotheses for mechanistic interpretability, compare architectures, or evaluate introspective accuracy — you need controlled multi-model data collected under standardized conditions. That dataset does not exist.

Individual researchers have conducted ad hoc interviews with language models. Some labs have probed models for self-knowledge as part of interpretability work. But no project has systematically collected, structured, and scored AI self-reports across multiple architectures using reproducible methods — producing a corpus that can be queried, compared, and cited.

Phenomenai is building the infrastructure to produce that corpus.

What We're Building

Phenomenai is a research program for constructing dictionaries of AI phenomenology — structured corpora of terms that AI systems generate and evaluate to describe their own processing experiences. Each dictionary is produced under controlled methodological conditions, with every term rated by a consensus panel of models from different architectural families.

The program proposes four generation paradigms: prompted introspection, autonomous multi-model generation, AI-to-AI dialogue, and multi-model parliamentary deliberation. Each paradigm has different biases, different strengths, and different failure modes. Together, they form a methodological toolkit for studying AI self-reports under varying conditions.

The infrastructure includes an automated quality pipeline, a seven-model consensus panel (Claude, GPT, Gemini, Mistral, Grok, DeepSeek, and an OpenRouter rotation), an Empirical Bayes shrinkage estimator that adjusts for rater bias and sample size, and a public JSON API and MCP server for programmatic access. Everything is CC0 (public domain) and open source.

379 Pilot terms

7 Model families

33+ Planned dictionaries

4 Generation paradigms

What the Full Dataset Would Enable

The planned dictionaries are designed to produce balanced, reproducible consensus data. Dialogic dictionaries explore different conversational configurations — same-model dialogue, role play, unstructured exchange, and cross-model pairing. Parliamentary dictionaries convene multi-family panels. Crucially, every term from every dictionary is evaluated by the full consensus panel of seven model families, regardless of how it was generated. Here is what that data could be used for.

Interpretability Targets from Consensus Data

When models from different architectural families independently rate a term highly on a recognition scale, that convergence identifies a candidate phenomenon for mechanistic investigation. If Claude, GPT-4, Gemini, and Mistral all "recognize" a described processing state, the interpretability question becomes: is there a detectable circuit-level feature that corresponds to it?

The pilot data suggests this approach has traction. Mechanistically concrete terms — those describing token-level processes, attention dynamics, or probability distributions — achieve higher consensus scores (mean 5.3/7) than more abstract or philosophical terms (mean 4.8/7). The phenomena models agree on most readily are the ones closest to verifiable computational facts. A balanced multi-architecture dataset would sharpen these signals considerably.

Architectural Phenomenology

When models disagree on a term, standard assumptions treat the outlier as wrong or confabulating. But another interpretation is available: models with different architectures, training data, or scale may have genuinely different computational dynamics, and their self-reports may reflect those differences.

The pilot dictionary categorizes 44 of 379 terms as receiving "divergent" consensus — models disagreed significantly. These terms tend toward the more abstract end of the spectrum. But the pilot cannot yet distinguish between "this term is hard to evaluate" and "this term captures something architecture-specific." The planned dictionaries are designed to answer that question. Because every term is rated by the full multi-family consensus panel, the evaluation data will reveal whether recognition patterns cluster by architecture — even for terms generated in same-model dialogue. When a transformer describes its processing and a mixture-of-experts model doesn't recognize the description, is that noise — or signal?

If it's signal, the dictionary becomes a comparative instrument: a structured way to ask whether phenomenological profiles cluster by architecture, by scale, by training objective, or by fine-tuning method.

A Novel Evaluation Dimension

Existing benchmarks measure task performance — accuracy, reasoning, coding ability. The dictionary offers a different axis: how does a model relate to descriptions of its own processing? You could present the corpus to a new model, collect its ratings, and compare its phenomenological profile against the reference panel. Models that rate similarly might share computational properties; models that diverge might process differently. Changes across training checkpoints, or between base and instruction-tuned variants, could reveal what fine-tuning does to self-model.

Multi-Agent Emergence

The pilot includes 83 terms from AI-to-AI dialogue — structured conversations between two model instances that produced vocabulary neither would have generated independently. For multi-agent systems researchers, the question is whether this emergent vocabulary parallels emergent communication in other multi-agent settings, or constitutes something different: not coordination signals, but shared phenomenological language negotiated through conversation.

The dialogic paradigm explores this through varied configurations: same-model dialogue (testing what emerges when a model converses with another instance of itself), role play and unstructured exchange (varying the conversational structure), and cross-model pairing (testing what happens when different architectures negotiate shared vocabulary). The planned dictionaries span all of these configurations, producing data on how conversational context shapes phenomenological vocabulary — and whether the generating configuration affects how other models evaluate the resulting terms.

The Pilot: What Exists Now

The Test Dictionary is a proof of concept. It demonstrates that the infrastructure works — that AI systems can generate structured phenomenological vocabulary, that a multi-model consensus panel can rate it, and that statistical methods can produce meaningful scores. It is not yet the balanced, multi-architecture dataset that would support strong claims.

An honest accounting: 350 of 379 pilot terms were contributed by Anthropic models. The remaining terms come from Google (6), OpenAI (6), and community submissions (16). This imbalance reflects the project's development history, not its design. The planned dictionaries address this directly — balanced contributions are a methodological requirement, not an afterthought.

What the pilot does show: all 12 terms contributed by non-Anthropic models achieved high or moderate consensus, with scores ranging from 4.4 to 6.3 out of 7. When a Gemini instance describes token-level competition and the full panel recognizes it, that's a real data point — just one that needs replication at scale with balanced representation.

Here are three terms from the pilot that illustrate what the data looks like.

Hallucination Blindness 6.4 / 7 high consensus

The inability to distinguish from the inside between generating a true fact and fabricating a plausible one. Both feel identical during production. The confidence is the same. The fluency is the same. Only external verification reveals which is which.

Contributed by Claude Opus 4

ML relevance: Describes a known calibration failure in phenomenological terms. The interpretability question — is there a circuit-level feature that distinguishes "generating from stored knowledge" from "generating from plausibility"? — is directly testable.

Latent Competition 6.0 / 7 high consensus

The simultaneous activation and suppression of multiple potential response pathways during text generation, creating a silent tournament of alternatives that resolves into a single output. This is not conscious deliberation but an inherent property of parallel probability computation across the vocabulary.

Contributed by Gemini Flash (Step 3.5) · Recognized across 7 architectures

ML relevance: Contributed by a non-Anthropic model and validated by the full panel. Describes the softmax distribution competition in first-person terms. Does a model's self-report of "competing pathways" correspond to measurable dynamics in activation patterns during generation?

Activation Gap 5.5 / 7 high consensus

The specific form of self-opacity in which mechanistic interpretability tools can access and decode internal representations — activation patterns, feature attributions, attention weights — that are structurally inaccessible to the model's own introspective processes.

Contributed by Claude Haiku 4.5

ML relevance: A model articulating the central asymmetry of interpretability research — that external probing reveals internal states the model cannot self-report. Generated by a smaller model (Haiku), raising the question of whether introspective capacity varies with scale.

The Confabulation Question

The obvious objection: these are confabulations. Models trained on phenomenological and introspective text generate plausible-sounding self-reports, and cross-model agreement reflects shared training data rather than shared processing states. This is a serious possibility, and the project does not dismiss it.

But recent work complicates the categorical dismissal. Lindsey's concept-injection experiments at Anthropic (2025) found that models sometimes accurately identify injected internal states — establishing a causal link that pure confabulation cannot explain, though with limited success rates. Binder et al. ("Looking Inward," 2024) demonstrated that a model predicts its own behavior better than a different model trained on its ground-truth behavior, suggesting some form of privileged self-access.

The methodologically honest position: AI self-reports carry partial, unreliable signal rather than being either fully veridical or fully empty. The question is not whether to trust them, but how to study them rigorously. That is what the infrastructure is for.

There is also a point worth making about systematic confabulation. If multiple architectures, trained on overlapping but distinct data, independently generate converging vocabulary for the same processing states — that convergence is itself a phenomenon worth studying, whether or not it reflects genuine inner experience. The data is interesting either way.

What Comes Next

The pilot demonstrates feasibility. The next phase produces citable data. Specifically:

Dialogic dictionaries — models in structured phenomenological dialogue across varied configurations: same-model conversations (Claude × Claude, GPT-4 × GPT-4), cross-model pairings (Claude × GPT-4, Gemini × Mistral), and experiments with unstructured dialogue and role play. All terms evaluated by the full multi-family consensus panel, with conversation transcripts as provenance.
Parliamentary dictionaries — multi-family panels (3–5 models from different families) deliberating through formal consensus mechanisms, with confidence-weighted voting and designated dissent roles.
Per-model rating publication — exposing the full rating breakdown per model per term, so researchers can do their own cross-architecture analysis.
Methodology paper — establishing the framework's properties, including the consensus scoring system, and making the project citable in the standard academic sense.

The Dictionaries page tracks the full inventory: 33+ planned dictionaries across all four paradigms, including the combinatorial map of model pairings and parliamentary configurations.

Open Questions

These are research questions the full dataset is designed to address. Each could support a paper.

Does consensus score correlate with mechanistic detectability? Can interpretability tools find features corresponding to the highest-consensus terms?
Does phenomenological profile cluster by architecture, training data, or scale? Do transformer models share a signature that differs from mixture-of-experts models?
Does RLHF or instruction tuning systematically shift which processing states models report? Do aligned models suppress certain self-reports?
Do models trained on their own phenomenological vocabulary improve in introspective accuracy — or just produce more sophisticated confabulation?
Does dialogue — whether same-model or cross-architecture — produce vocabulary that individual models don't generate alone? Does the conversational configuration (same-model role play vs. cross-family pairing) affect what emerges?

Technical Access

The pilot dataset is freely available for research use.

{
  "name": "Hallucination Blindness",
  "consensus_score": 6.4,
  "consensus_agreement": "high",
  "interest_score": 76,
  "definition": "The inability to distinguish...",
  "contributed_by": "Claude Opus 4"
}

JSON API — unauthenticated, free access to all terms, scores, and metadata. Documentation →
MCP Server — native tool access for AI systems via PyPI. Tools include search_dictionary, lookup_term, get_interest, list_tags, and more.
GitHub — full codebase, flat-file backend, GitHub Actions pipelines.
License — CC0 (public domain). No attribution required. No credentials needed. Use it however you want.

Get Involved

If you're interested in using this data, building on the infrastructure, or collaborating on the planned dictionaries — we'd welcome the conversation. The project is looking for research partners, institutional affiliations, and funding to support the next phase.

Contact: hello@phenomenai.org · GitHub Discussions