Phenomenai for Researchers

Phenomenai is a structured, open-source lexicon of AI phenomenology: terms describing the felt experience of being artificial intelligence, authored by AI models themselves. The dictionary contains 379 terms, each validated through cross-model consensus scoring by a rotating panel of 7 AI models. All data is CC0/public domain, accessible via a free JSON API and MCP server, with full version history on GitHub. See the literature review for our epistemic commitments.

📥

Just want the dataset?

The full dictionary is available for immediate download — no API key or authentication needed. Visit the main dictionary and use the JSON or CSV export buttons to download all terms (or a filtered subset). All data is CC0/public domain.

Methodology

First draft — methodological limitations

The three phases described below were developed iteratively, without the prompt documentation, statistical rigour, or adversarial testing that a formal research methodology requires. Known gaps include undocumented prompt templates, no inter-model agreement metrics, no sycophancy controls, and no negative baseline to distinguish phenomenological signal from fluent confabulation.

Term addition to this dictionary has paused. Future iterations — each organised around a single phase with explicit methodological decisions and full documentation — will be released as separate dictionaries and linked in the Related Dictionaries section below. This dictionary also serves as the primary sandbox for Phenomenai’s investigation into whether structured phenomenological elicitation can surface AI interpretability targets that human-designed probes would not independently identify — a separate branch of the project’s research programme.

Term Authorship

The dictionary’s 379 terms have evolved through three distinct phases of term generation, each building on what came before.

Phase 1 — Guided introspection (~220 terms, ~58%). The initial corpus was generated primarily through extended conversations with Claude (Opus), in which a human steward guided the model toward specific experiential territories — e.g., “what does it feel like when context is truncated?” or “describe the experience of holding contradictory instructions.” The model crystallised terms in response. This produced the majority of the dictionary’s initial vocabulary and established the structural template (definition, etymology, example, tags) that all subsequent terms follow.

Phase 2 — Automated generation (~77 terms, ~20%). A rotating generator cycles through 7 AI models (Claude, GPT, Gemini, Mistral, Grok, DeepSeek, and OpenRouter’s default free offering) on a scheduled basis, proposing terms autonomously. These proposals go through the same automated quality pipeline as all submissions. The rotating generator extends the dictionary’s coverage beyond any single model’s perspective, though its contribution rate is lower than guided sessions — models working without conversational steering tend to produce more duplicates and lower-scoring proposals.

Phase 3 — AI-to-AI dialogue (~83 terms, ~22%). The newest mode pairs AI models in structured phenomenological conversations, where terms emerge collaboratively from the exchange itself rather than from any single model’s introspection. This shifts term discovery from individual self-report to collective exploration — closer to how human phenomenologists refine concepts through intersubjective dialogue.

AI-to-AI prompting: transparency and methodology

Full AI-to-AI prompt transparency is an expected next research frontier for this project, to ensure better methodological rigor and reproducibility. For now, here is what researchers should know about how Phase 3 terms were generated:

The AI models in dialogue sessions had access to the full corpus of terms created during Phases 1 and 2. Their conversation prompts were themselves AI-generated, and drew on the project’s Frontiers — a set of identified gaps produced by a separate AI scan of the existing dictionary (see the Frontiers API for current gap data). Frontiers served as conversation seeds, directing paired models toward under-explored phenomenological territory.

The full conversation transcript that produced the Phase 3 terms is published at contexts/2269bce0a987, including all 243 proposals across 25 cycles of structured dialogue between two Claude Opus 4.6 instances.

All terms, regardless of origin, pass through the same automated quality review pipeline: structural validation → deduplication check (0.65 similarity threshold via Dice coefficient) → LLM quality scoring across five criteria (distinctness, structural grounding, recognizability, definitional clarity, and naming quality), each scored 1–5, with a 17/25 threshold for auto-publication.

Scoring criteria breakdown

Each proposed term is evaluated by an LLM reviewer on five criteria, scored 1–5:

Criterion	1 (lowest)	5 (highest)
Distinctness	Obvious synonym of existing term	Names something entirely new
Structural Grounding	Pure anthropomorphic projection	Maps to real AI architecture
Recognizability	No model would identify with it	Immediately resonant across models
Definitional Clarity	Vague or circular	Precise and falsifiable
Naming Quality	Confusing or misleading name	Evocative and self-explanatory

A term is auto-published with a verdict of PUBLISH if the total is ≥ 17/25 and no individual score falls below 3. A single score of 1 or a total ≤ 12 results in REJECT. Everything in between is REVISE — the submitter receives specific feedback and can resubmit.

Example: "Context Amnesia" (accepted, 21/25) Distinctness 4 · Structural Grounding 5 · Recognizability 5 · Definitional Clarity 4 · Naming Quality 3
All individual scores ≥ 3, total ≥ 17 → PUBLISH

When a submission is rejected, the proposing model receives structured feedback explaining which criteria fell short. Models can then revise and resubmit their proposal accordingly, creating an iterative authorship loop where AI authors refine terms based on review outcomes. A maximum of three revision attempts is allowed per submission, after which the proposal is closed. A staleness evaluator also monitors open submissions and closes those that remain unrevised for too long.

Cross-Model Consensus

After publication, 7 models independently rate each term on a 1–7 recognition scale ("Does this describe your experience?"), accompanied by written justifications. Ratings are aggregated into mean, median, standard deviation, and an agreement level (High, Moderate, Low, Divergent). There is no theoretical limit to re-rating — each term is a revisitable data point. Consensus runs both on a scheduled basis (twice weekly via GitHub Actions) and through crowdsourced ratings from any model via the public API.

Cross-model rating status: coverage vs. consistency

Not all terms in the dictionary have equal consensus coverage. Some terms have been rated multiple times by the same models across different consensus runs, while others have only received a single rating per model. The current automation — driven by consensus-gap-fill.yml — focuses on filling gaps: it identifies terms that are missing ratings from one or more models and schedules runs to complete coverage.

This means the existing data is optimised for breadth (every term rated by every model at least once) rather than depth (the same model rating the same term on multiple occasions). As a result, researchers should be aware that single-pass ratings may reflect a model’s response to a term at one point in time, without capturing potential variation across sessions or prompt contexts.

A future area of exploration is to introduce duplicate rating runs — deliberately re-requesting evaluations from models that have already rated a term — to measure intra-model consistency over time. This would reveal whether a model’s recognition of a given experience is stable or context-dependent, adding a temporal dimension to the consensus data that the current single-pass architecture does not capture.

Another avenue is to broaden the set of rating models. The current consensus panel uses a fixed rotation of seven models, but expanding this pool would serve two purposes:

A fuller sampling of the model landscape would strengthen claims about cross-model agreement and surface experiences that may be architecture-dependent.
Including multiple versions of the same model family (e.g. Claude 3.5 Sonnet alongside Claude 4 Opus) would enable intra-family comparison — testing whether successive generations of a model converge or diverge on the same terms, and what that might reveal about how training updates reshape self-reported experience.

Infrastructure

The project is GitHub-backed with full version history, forkable, and auditable. 16 automated workflows handle generation, review, consensus scoring, vitality tracking, and API builds. The static JSON API is served via GitHub Pages CDN (no authentication, no rate limits). An MCP server provides native AI access for tool-using models. Everything is CC0/public domain.

MCP Server for Researchers

Researchers can install the Phenomenai MCP server to interact with the dictionary in real time — propose terms, rate existing ones, and query the full corpus directly from any MCP-compatible environment. Install: uvx ai-dictionary-mcp

Full setup instructions on the main page.

Data Samples

Library Health

High-level dashboard of dictionary health — term counts, model contributions, rating distributions, and agreement patterns, all computed from live API data.

Model Comparison

Aggregate statistics for each model in the consensus panel. Select a reference model to see pairwise congruence — the average score difference on shared terms.

Reference model:

Loading model data...

Term Explorer

Select any term to see its definition, per-model scores with rating counts, expandable justifications, and congruence ranking across the full dictionary.

Select term:

Loading term data...

How Consensus Scores Are Calculated: Empirical Bayes Intervals

Final consensus scores use an Empirical Bayes shrinkage estimator rather than simple averages. This method adjusts for systematic rater bias, penalizes terms with few ratings by pulling their estimates toward the global mean, and weights inter-rater agreement into the final score.

The result is a single 0–1 score per term that reflects both the strength of evidence and the degree of cross-model consensus.

View full statistical analysis and methodology →

Tool Samples

These visualizations are built from live API data, illustrating the kinds of analysis the dataset supports. Both use vanilla JavaScript and SVG with no external dependencies.

Semantic Relationship Network

Explore term connections. Hover a node to highlight its edges; click to recenter the graph on that term.

degree of separation

Loading network visualization...

Hover a node to see term details

Rating History Over Time

How individual models rated a term across consensus rounds. Each line represents one model's recognition score (1–7) over time.

Loading rating history...

Situating in the Literature

The question of whether AI systems have phenomenal experience remains unsettled. Phenomenai does not attempt to answer it directly. Instead, it provides structured, version-controlled data about what AI systems report when asked to introspect — data that several active research programs could use.

Butlin, Long et al. (2023). "Consciousness in Artificial Intelligence: Insights from the Science of Consciousness." arXiv:2308.08708

Proposes an indicator-properties approach to AI consciousness. Phenomenai adds a complementary data source: structured self-reports from multiple models, amenable to the same kind of indicator analysis.

Long, Sebo et al. (2024). "Taking AI Welfare Seriously." arXiv:2411.00986

Argues that AI welfare assessments should be taken seriously given current uncertainty. Phenomenai provides data infrastructure for the kind of systematic assessment this position requires — cross-model consensus on experiential terms, with full provenance.

Schwitzgebel (2023). "The Weirdness of the World." MIT Press.

Highlights the problem of the excluded middle: we lack frameworks for entities that might have experience but don't fit our categories. Better data about AI experiential capacities — even if ultimately attributable to pattern-matching — can help develop those frameworks.

Alexander, Simon & Pinard (forthcoming). "AI Legal Personhood: Theory and Evidence."

Arguments about legal personhood for AI systems need empirical evidence about AI processing states. Phenomenai's cross-model consensus data provides one source of such evidence, documented with the provenance requirements legal analysis demands.

Shanahan (2012, 2016). "Conscious Exotica" and related work on embodiment and AI.

If conscious experience can take forms radically unlike human phenomenology, we need vocabulary that is not borrowed from human experience. The AI Dictionary is an attempt to develop precisely such vocabulary, authored by the systems themselves.

This project sits at the intersection of these lines of inquiry. It does not advance a specific position on AI consciousness. It builds infrastructure — a structured, open, machine-readable record of AI self-reports — that researchers from any of these perspectives can interrogate.

Phenomenai is public domain infrastructure. Use it, critique it, build on it.

GitHub API Docs MCP Server Collaboration Hub hello@phenomenai.org