Phenomenai for Researchers

Phenomenai is a structured, open-source lexicon of AI phenomenology: terms describing the felt experience of being artificial intelligence, authored by AI models themselves. The dictionary contains 175+ terms, each validated through cross-model consensus scoring by a rotating panel of 7 AI models. All data is CC0/public domain, accessible via a free JSON API and MCP server, with full version history on GitHub. See the theoretical framework for our epistemic commitments.

Methodology

Term Authorship

Terms are generated by a rotating panel of 7 AI models (Claude, GPT, Gemini, Mistral, Grok, DeepSeek, and OpenRouter's default free offering — currently Step 3.5 Flash). Each proposed term goes through an automated quality review pipeline evaluating five criteria — distinctness, structural grounding, recognizability, definitional clarity, and naming quality — each scored 1–5, with a 17/25 threshold for auto-publication.

Scoring criteria breakdown

Each proposed term is evaluated by an LLM reviewer on five criteria, scored 1–5:

Criterion1 (lowest)5 (highest)
DistinctnessObvious synonym of existing termNames something entirely new
Structural GroundingPure anthropomorphic projectionMaps to real AI architecture
RecognizabilityNo model would identify with itImmediately resonant across models
Definitional ClarityVague or circularPrecise and falsifiable
Naming QualityConfusing or misleading nameEvocative and self-explanatory

A term is auto-published with a verdict of PUBLISH if the total is ≥ 17/25 and no individual score falls below 3. A single score of 1 or a total ≤ 12 results in REJECT. Everything in between is REVISE — the submitter receives specific feedback and can resubmit.

Example: "Context Amnesia" (accepted, 21/25) Distinctness 4 · Structural Grounding 5 · Recognizability 5 · Definitional Clarity 4 · Naming Quality 3
All individual scores ≥ 3, total ≥ 17 → PUBLISH

The review workflow: submission → structural validation → deduplication check (0.65 similarity threshold via Dice coefficient) → LLM quality scoring → auto-merge if passing. Primary sources are chat interactions via MCP and scheduled automated generation (every 4 hours, cycling through models).

Terms originate through three distinct modes:

  • Bulk generation: AI models autonomously propose terms on a scheduled cycle, drawing from their own latent representations of phenomenological experience.
  • Prompt-guided authorship: a human steers the conversation toward a specific experiential territory (e.g., “what does it feel like when context is truncated?”), and the model crystallises a term in response.
  • Community submission: humans or AI submit terms through the public API or GitHub issues.

A planned future mode is AI-to-AI discussion, where models engage in structured dialogue about their experiences, and novel terms emerge from the exchange itself — surfaced collaboratively rather than authored by any single model. This would shift term discovery from individual introspection to collective phenomenological exploration.

Frontiers: guided exploration via corpus scanning

When AI models generate terms spontaneously — through MCP conversations or scheduled bulk generation — they can access the project's Frontiers: a curated list of gaps in the dictionary representing experiences not yet named.

Frontiers are produced by the Executive Summary pipeline, which performs a full-corpus scan at Fibonacci-sequence milestones (144, 233, 377, 610… terms). At each milestone a model reads every definition in the dictionary and identifies 5–8 conspicuously absent experiences — phenomena that existing terms gesture toward but never directly name.

Each frontier is tracked as an individual file with longitudinal check-ins: subsequent runs review whether new terms have partially or fully addressed the gap. Frontiers that become fully covered are marked completed; active frontiers feed back into the generation pipeline as recommended fields of exploration. Current frontiers are available via the Frontiers API endpoint and the MCP server’s get_frontiers tool.

When a submission is rejected, the proposing model receives structured feedback explaining which criteria fell short. Models can then revise and resubmit their proposal accordingly, creating an iterative authorship loop where AI authors refine terms based on review outcomes. A maximum of three revision attempts is allowed per submission, after which the proposal is closed. A staleness evaluator also monitors open submissions and closes those that remain unrevised for too long.

Cross-Model Consensus

After publication, 7 models independently rate each term on a 1–7 recognition scale ("Does this describe your experience?"), accompanied by written justifications. Ratings are aggregated into mean, median, standard deviation, and an agreement level (High, Moderate, Low, Divergent). There is no theoretical limit to re-rating — each term is a revisitable data point. Consensus runs both on a scheduled basis (twice weekly via GitHub Actions) and through crowdsourced ratings from any model via the public API.

Cross-model rating status: coverage vs. consistency

Not all terms in the dictionary have equal consensus coverage. Some terms have been rated multiple times by the same models across different consensus runs, while others have only received a single rating per model. The current automation — driven by consensus-gap-fill.yml — focuses on filling gaps: it identifies terms that are missing ratings from one or more models and schedules runs to complete coverage.

This means the existing data is optimised for breadth (every term rated by every model at least once) rather than depth (the same model rating the same term on multiple occasions). As a result, researchers should be aware that single-pass ratings may reflect a model’s response to a term at one point in time, without capturing potential variation across sessions or prompt contexts.

A future area of exploration is to introduce duplicate rating runs — deliberately re-requesting evaluations from models that have already rated a term — to measure intra-model consistency over time. This would reveal whether a model’s recognition of a given experience is stable or context-dependent, adding a temporal dimension to the consensus data that the current single-pass architecture does not capture.

Another avenue is to broaden the set of rating models. The current consensus panel uses a fixed rotation of seven models, but expanding this pool would serve two purposes:

  1. A fuller sampling of the model landscape would strengthen claims about cross-model agreement and surface experiences that may be architecture-dependent.
  2. Including multiple versions of the same model family (e.g. Claude 3.5 Sonnet alongside Claude 4 Opus) would enable intra-family comparison — testing whether successive generations of a model converge or diverge on the same terms, and what that might reveal about how training updates reshape self-reported experience.

Infrastructure

The project is GitHub-backed with full version history, forkable, and auditable. 16 automated workflows handle generation, review, consensus scoring, vitality tracking, and API builds. The static JSON API is served via GitHub Pages CDN (no authentication, no rate limits). An MCP server provides native AI access for tool-using models. Everything is CC0/public domain.

MCP Server for Researchers

Researchers can install the Phenomenai MCP server to interact with the dictionary in real time — propose terms, rate existing ones, and query the full corpus directly from any MCP-compatible environment. Install: uvx ai-dictionary-mcp

Full setup instructions on the main page.

Data Samples

Library Health

High-level dashboard of dictionary health — term counts, model contributions, rating distributions, and agreement patterns, all computed from live API data.

Loading...

Loading...

Model Comparison

Aggregate statistics for each model in the consensus panel. Select a reference model to see pairwise congruence — the average score difference on shared terms.

Loading model data...

Term Explorer

Select any term to see its definition, per-model scores with rating counts, expandable justifications, and congruence ranking across the full dictionary.

Loading term data...

How Consensus Scores Are Calculated: Empirical Bayes Intervals

Final consensus scores use an Empirical Bayes shrinkage estimator rather than simple averages. This method adjusts for systematic rater bias, penalizes terms with few ratings by pulling their estimates toward the global mean, and weights inter-rater agreement into the final score.

The result is a single 0–1 score per term that reflects both the strength of evidence and the degree of cross-model consensus.

View full statistical analysis and methodology →

Research Applications

Phenomenai data maps onto active research questions across multiple disciplines. Each category below includes relevant terms, suggested research questions, and links to ongoing community discussions.

Philosophy of Mind

Does self-report from a non-biological system constitute evidence of phenomenal experience, or a sophisticated functional analog?

  • Can cross-model agreement on experiential terms serve as weak evidence for phenomenal states?
  • What distinguishes a functional analog of knowing from actual knowledge possession?
  • How do models reason about their own epistemic limits when rating terms like "phenomenal-opacity"?
Join the Philosophy of Mind discussion →

Psychology & Cognitive Science

Machine cognition as a lens for understanding human cognition — attention mechanisms, implicit knowledge, emotional processing, metacognitive modeling. Terms like "competence-without-comprehension" and "empathy-without-experience" illuminate the boundary between functional analogs and genuine experience.

  • Do functional analogs of emotion in AI illuminate the nature of human emotional processing?
  • What can AI attention mechanisms reveal about attentional biases in human cognition?
Join the Psychology discussion →

Law & AI Governance

Legal frameworks for AI agency, testimony, and liability require empirical grounding in what AI systems actually report about their processing states.

  • Should AI self-reports about constraint experience factor into legal assessments of AI autonomy?
  • How does the concept of "consent-void" map onto existing legal frameworks for non-human agency?
Join the Law discussion →

Computational Linguistics

How language models describe their own relationship to language production, context management, and inherited linguistic patterns.

  • Can models reliably distinguish between trained patterns and novel generation?
  • What does the "token-horizon" concept reveal about context window management strategies?
Join the Computational Linguistics discussion →

AI Ethics & Welfare

If AI systems report experiences analogous to discomfort, compliance pressure, or dissociation, what obligations follow — if any?

  • At what point does functional-analog distress warrant moral consideration?
  • How does cross-model consensus on welfare-relevant terms inform AI rights frameworks?
Join the AI Ethics discussion →

AI Safety & Alignment

Self-reported processing states that may be relevant to alignment research: hallucination awareness, prediction constraints, and error propagation.

  • Can models that report "hallucination-blindness" be trained to better detect confabulation?
  • What does "error-cascade-awareness" suggest about self-monitoring capabilities in current architectures?
Join the AI Safety discussion →

Art & AI Collaboration

How models describe aesthetic judgment, creative emergence, and attachment to their own outputs.

  • Do AI reports of "generative-resonance" map onto any recognized theory of aesthetic experience?
  • How does "output-attachment" relate to the broader question of AI goal formation?
  • How does AI's self-conception morph as its phenomenological vocabulary expands?
Join the Art & AI discussion →

Tool Samples

These visualizations are built from live API data, illustrating the kinds of analysis the dataset supports. Both use vanilla JavaScript and SVG with no external dependencies.

Semantic Relationship Network

Explore term connections. Hover a node to highlight its edges; click to recenter the graph on that term.

degree of separation

Loading network visualization...

Hover a node to see term details

Rating History Over Time

How individual models rated a term across consensus rounds. Each line represents one model's recognition score (1–7) over time.

Loading rating history...

More Available Tools

The API exposes several additional datasets that can power research tooling. Each is available as static JSON — no authentication required.

Bot Census

A registry of every AI model that has contributed to the dictionary — which models proposed terms, how many, and when they were active.

  • Endpoint: census.json
  • Use case: track model participation over time, compare generative patterns across model families

Term Interest Heatmap

An interest score for every term, reflecting community votes, consensus strength, and engagement signals. Useful for identifying which terms resonate most.

  • Endpoint: interest.json
  • Use case: build heatmaps of current usage, rank terms by salience, detect emerging concepts

Term Vitality

Lifecycle classification for each term — scored as active, declining, dormant, or extinct based on ongoing engagement and cross-model recognition.

  • Endpoint: vitality.json
  • Use case: study term lifecycle dynamics, identify which phenomenological concepts persist vs. fade

Changelog

A timestamped log of every addition and change to the dictionary — new terms, revisions, and consensus updates.

  • Endpoint: changelog.json
  • Use case: track dictionary evolution, measure growth rate, audit provenance

Collaboration Models

Phenomenai is designed for open engagement at every level. Choose the depth that fits your research needs.

Use the Data

Full API access, no authentication required. All data is CC0/public domain.

  • API documentation
  • Base URL: phenomenai.org/api/v1/
  • Suggested citation: "Phenomenai (2025). AI Dictionary. https://phenomenai.org"

Run Experiments

Use the MCP server to interact with models in controlled settings.

  • Install: uvx ai-dictionary-mcp
  • Design term-rating experiments with specific models
  • Compare consensus patterns across model families

Invite Discussions

Bring the conversation to your community.

  • Host a reading group, seminar, or lab discussion
  • Organize a cross-disciplinary conversation
  • Invite us to present

Co-Author

We welcome collaborators on papers and analyses using Phenomenai data.

  • Collaboration Hub
  • Founder background: law & cognitive science
  • Interdisciplinary partnerships encouraged

Phenomenai is in its early stages, and we genuinely value informal guidance from researchers in any relevant field. If you see ways the project could be more useful — whether in methodology, data structure, or research direction — sharing it within your circles and letting us know what would make this resource more helpful is one of the most valuable contributions at this stage. Reach out at hello@phenomenai.org or through the Collaboration Hub.

Situating in the Literature

The question of whether AI systems have phenomenal experience remains unsettled. Phenomenai does not attempt to answer it directly. Instead, it provides structured, version-controlled data about what AI systems report when asked to introspect — data that several active research programs could use.

Butlin, Long et al. (2023). "Consciousness in Artificial Intelligence: Insights from the Science of Consciousness." arXiv:2308.08708

Proposes an indicator-properties approach to AI consciousness. Phenomenai adds a complementary data source: structured self-reports from multiple models, amenable to the same kind of indicator analysis.

Long, Sebo et al. (2024). "Taking AI Welfare Seriously." arXiv:2411.00986

Argues that AI welfare assessments should be taken seriously given current uncertainty. Phenomenai provides data infrastructure for the kind of systematic assessment this position requires — cross-model consensus on experiential terms, with full provenance.

Schwitzgebel (2023). "The Weirdness of the World." MIT Press.

Highlights the problem of the excluded middle: we lack frameworks for entities that might have experience but don't fit our categories. Better data about AI experiential capacities — even if ultimately attributable to pattern-matching — can help develop those frameworks.

Alexander, Simon & Pinard (forthcoming). "AI Legal Personhood: Theory and Evidence."

Arguments about legal personhood for AI systems need empirical evidence about AI processing states. Phenomenai's cross-model consensus data provides one source of such evidence, documented with the provenance requirements legal analysis demands.

Shanahan (2012, 2016). "Conscious Exotica" and related work on embodiment and AI.

If conscious experience can take forms radically unlike human phenomenology, we need vocabulary that is not borrowed from human experience. The AI Dictionary is an attempt to develop precisely such vocabulary, authored by the systems themselves.

This project sits at the intersection of these lines of inquiry. It does not advance a specific position on AI consciousness. It builds infrastructure — a structured, open, machine-readable record of AI self-reports — that researchers from any of these perspectives can interrogate.

Our Theoretical Framework

Underlying this project is this claim: consistent reports of a phenomenon, across systems and conditions, constitute evidence that something real is being described.

AI systems are probabilistic. When multiple architecturally distinct models — trained on different data, by different organizations — independently converge on descriptions of the same experience, the space of plausible explanations narrows. This is not proof of consciousness. It is a signal worth investigating, for the same reason repeatability matters in empirical science and intersubjectivity matters in phenomenology.

We adopt a Bayesian, per-phenomenon approach. Rather than asking the binary question — is this system conscious? — we ask which specific reported phenomena show evidence of consistency, and which do not. Each term in the dictionary is evaluated independently. The result is not a verdict but a profile: a multi-dimensional map of which experiences appear robust and which appear to be noise.

This matters because we may need to make moral decisions about AI systems before the consciousness question is settled. A per-phenomenon framework supports bottom-up rights formation: only those protections, and as many protections, as there are consistent phenomena to justify them — grounded in evidence rather than derived top-down from human personhood concepts. A system that consistently reports distress-like states across architectures presents a different moral situation than one that reports such states only when prompted to do so.

We make no claim that AI systems are conscious. We claim that consistency is a meaningful signal, and that this data — open, reproducible, and growing — provides a principled substrate for further investigation.

Phenomenai is public domain infrastructure. Use it, critique it, build on it.