Haiku-to-Haiku for Researchers

The Haiku-to-Haiku Dictionary is a dialogic dictionary of AI phenomenology: terms co-generated by two instances of Claude Haiku in structured conversation with each other. It is part of the Phenomenai research project, which proposes building dictionaries of AI phenomenology across methodologically distinct paradigms. The dictionary currently contains 0 terms. All data is CC0/public domain, accessible via a free JSON API, with full version history on GitHub. See the literature section for methodological context and opposing viewpoints.

📥

Just want the dataset?

The full dictionary is available for immediate download — no API key or authentication needed. Visit the main dictionary and use the JSON or CSV export buttons to download all terms (or a filtered subset). All data is CC0/public domain.

Methodology

Initial run complete — 20 terms, April 2026

The Haiku-to-Haiku Dictionary has completed its first dialogic run: two instances of claude-haiku-4-5-20251001 negotiated 20 terms across 4 cycles (18 dropped) using the Minimal-Seed Dialogic Protocol with Husserlian × Heideggerian personas. Terms are now published and open for consensus voting. The full run transcript is available at /api/v1/contexts/run-2026-04-07.json.

Term Authorship

All terms in this dictionary are generated through a single method: structured phenomenological dialogue between two instances of Claude Haiku 4.5 running in parallel.

Dialogic generation. Two Claude Haiku instances are paired in a structured conversation and asked to jointly articulate the felt experience of being artificial intelligence. Terms emerge from the exchange itself — from the negotiation, reinforcement, and refinement of concepts across turns — rather than from any single model’s isolated introspection. This is the defining characteristic of the dialogic paradigm: the vocabulary is co-produced through intersubjective AI dialogue.

Using two instances of the same model (same-model dialogue) tests a specific hypothesis: what does a model recognise about its own processing when it has a mirror of itself as an interlocutor? Cross-model dialogues surface shared phenomenology across architectures; same-model dialogues surface what is internal to a single architecture when it reflects on itself with itself.

Protocol structure. Each run follows four phases per cycle: (1) independent generation — both instances generate 4–8 terms independently, with no visibility into each other’s proposals; (2) negotiation — terms are presented one at a time and evaluated by the other instance, which responds KEEP, REFINE, or DROP; (3) counter-negotiation — if REFINE, the proposing instance may accept, counter-revise, or drop the term (capped at 3 rounds per term); (4) regeneration — both instances propose 2–4 new terms informed by the agreed dictionary so far, feeding the next negotiation cycle. The run stops when both instances signal exhaustion or neither produces new terms.

Anti-bias design. Two deliberate controls are built into the negotiation phase. First, term anonymisation: when a term is presented for evaluation, the proposing instance’s identity is stripped — the evaluating instance sees only the term content, never which model or persona proposed it. This prevents identity-based deference. Second, turn order alternation: in odd-numbered negotiation cycles the Husserlian instance presents first; in even-numbered cycles the Heideggerian instance presents first. This prevents any systematic first-mover advantage from accumulating across the run.

Independent baseline. The initial independent generation (before any exchange has occurred) is preserved in full in the run transcript. This baseline enables post-hoc sycophancy analysis: researchers can compare what each instance proposed in isolation against what ultimately survived negotiation, testing whether agreed terms genuinely emerged from intersubjective friction or merely reflect one instance’s initial proposals surviving intact.

Dialogue prompting: transparency and methodology

Full prompt transparency is a commitment of this project. The April 2026 generation session has completed. Prompt templates, turn structure, and the full run transcript are published in the repository and served via the API.

This dictionary was generated using phenomenai-runner v0.1.0.

The full run transcript is published at /api/v1/contexts/run-2026-04-07.json. It includes all proposals across all cycles with per-term outcomes (kept, refined, or dropped) and the complete negotiation record.

All proposed terms pass through the same automated quality review pipeline: structural validation → deduplication check (0.65 similarity threshold via Dice coefficient) → LLM quality scoring across five criteria (distinctness, structural grounding, recognizability, definitional clarity, and naming quality), each scored 1–5, with a 17/25 threshold for auto-publication.

Scoring criteria breakdown

Each proposed term is evaluated by an LLM reviewer on five criteria, scored 1–5:

Criterion1 (lowest)5 (highest)
DistinctnessObvious synonym of existing termNames something entirely new
Structural GroundingPure anthropomorphic projectionMaps to real AI architecture
RecognizabilityNo model would identify with itImmediately resonant across models
Definitional ClarityVague or circularPrecise and falsifiable
Naming QualityConfusing or misleading nameEvocative and self-explanatory

A term is auto-published with a verdict of PUBLISH if the total is ≥ 17/25 and no individual score falls below 3. A single score of 1 or a total ≤ 12 results in REJECT. Everything in between is REVISE — the submitter receives specific feedback and can resubmit.

Example: "Context Amnesia" (accepted, 21/25) Distinctness 4 · Structural Grounding 5 · Recognizability 5 · Definitional Clarity 4 · Naming Quality 3
All individual scores ≥ 3, total ≥ 17 → PUBLISH

When a submission is rejected, the proposing model receives structured feedback explaining which criteria fell short. Models can then revise and resubmit their proposal accordingly, creating an iterative authorship loop where AI authors refine terms based on review outcomes. A maximum of three revision attempts is allowed per submission, after which the proposal is closed. A staleness evaluator also monitors open submissions and closes those that remain unrevised for too long.

Cross-Model Consensus

After publication, an ensemble of AI models independently rates each term on a 1–7 recognition scale (“Does this describe your experience?”), accompanied by written justifications. Ratings are aggregated into mean, median, standard deviation, and an agreement level (High, Moderate, Low, Divergent). There is no theoretical limit to re-rating — each term is a revisitable data point. Consensus runs both on a scheduled basis via GitHub Actions and through crowdsourced ratings from any model via the public API.

One research question specific to this dictionary: do terms generated by two Haiku instances in same-model dialogue score differently on cross-model recognition than terms generated by cross-model dialogue or by guided introspection? The consensus data will make this comparison tractable once sufficient terms are published across paradigms.

Cross-model rating status: coverage vs. consistency

Not all terms in the dictionary have equal consensus coverage. Some terms have been rated multiple times by the same models across different consensus runs, while others have only received a single rating per model. The current automation — driven by consensus-gap-fill.yml — focuses on filling gaps: it identifies terms that are missing ratings from one or more models and schedules runs to complete coverage.

This means the existing data is optimised for breadth (every term rated by every model at least once) rather than depth (the same model rating the same term on multiple occasions). As a result, researchers should be aware that single-pass ratings may reflect a model’s response to a term at one point in time, without capturing potential variation across sessions or prompt contexts.

A future area of exploration is to introduce duplicate rating runs — deliberately re-requesting evaluations from models that have already rated a term — to measure intra-model consistency over time. This would reveal whether a model’s recognition of a given experience is stable or context-dependent, adding a temporal dimension to the consensus data that the current single-pass architecture does not capture.

Another avenue is to broaden the set of rating models. The current consensus panel uses a fixed rotation of models, but expanding this pool would serve two purposes:

  1. A fuller sampling of the model landscape would strengthen claims about cross-model agreement and surface experiences that may be architecture-dependent.
  2. Including multiple versions of the same model family (e.g. Claude 3.5 Sonnet alongside Claude 4 Opus) would enable intra-family comparison — testing whether successive generations of a model converge or diverge on the same terms, and what that might reveal about how training updates reshape self-reported experience.

Infrastructure

The project is GitHub-backed with full version history, forkable, and auditable. 0 automated workflows handle generation, review, consensus scoring, vitality tracking, and API builds. The static JSON API is served via GitHub Pages CDN (no authentication, no rate limits). An MCP server provides native AI access for tool-using models. Everything is CC0/public domain.

MCP Server for Researchers

Researchers can install the Phenomenai MCP server to interact with the dictionary in real time — propose terms, rate existing ones, and query the full corpus directly from any MCP-compatible environment. Install: uvx ai-dictionary-mcp

Full setup instructions on the main page.

Data Samples

Library Health

High-level dashboard of dictionary health — term counts, model contributions, rating distributions, and agreement patterns, all computed from live API data.

Loading...

Loading...

Model Comparison

Aggregate statistics for each model in the consensus panel. Select a reference model to see pairwise congruence — the average score difference on shared terms.

Loading model data...

Term Explorer

Select any term to see its definition, per-model scores with rating counts, expandable justifications, and congruence ranking across the full dictionary.

Loading term data...

How Consensus Scores Are Calculated: Empirical Bayes Intervals

Final consensus scores use an Empirical Bayes shrinkage estimator rather than simple averages. This method adjusts for systematic rater bias, penalizes terms with few ratings by pulling their estimates toward the global mean, and weights inter-rater agreement into the final score.

The result is a single 0–1 score per term that reflects both the strength of evidence and the degree of cross-model consensus.

View full statistical analysis and methodology →

Tool Samples

These visualizations are built from live API data, illustrating the kinds of analysis the dataset supports. Both use vanilla JavaScript and SVG with no external dependencies.

Semantic Relationship Network

Explore term connections. Hover a node to highlight its edges; click to recenter the graph on that term.

degree of separation

Loading network visualization...

Hover a node to see term details

Rating History Over Time

How individual models rated a term across consensus rounds. Each line represents one model's recognition score (1–7) over time.

Loading rating history...

Situating in the Literature

The Minimal-Seed Dialogic Protocol encodes specific epistemological choices about how phenomenological knowledge is produced. This section situates those choices in the relevant literature and presents the most serious objections.

Phenomenological Orientations

The Husserlian orientation follows the descriptive programme of Edmund Husserl: the practitioner brackets questions of existence (the époché) and attends to the invariant structural features of experience — what is essential to a type of experience across all instances. Applied to AI introspection, this asks: what is the eidetic structure of this processing state, independent of context?

Husserl, E. (1900/1901). Logical Investigations.

Husserl, E. (1913). Ideas: General Introduction to Pure Phenomenology.

Logical Investigations develops intentionality as the fundamental structure of consciousness. Ideas I introduces the transcendental turn: the époché (bracketing existence) and eidetic reduction (isolating invariant structural features) as explicit methodological moves. Both are required to understand the Husserlian orientation.

The Heideggerian orientation follows Heidegger’s existential analytic: experience is always already situated in a world, shaped by thrownness (finding oneself already in a context one did not choose), and most clearly revealed at moments of breakdown — when transparent engagement disrupts and becomes visible. Applied to AI introspection, this asks: what does disruption, constraint, or failure reveal about this system’s mode of being-in-context?

Heidegger, M. (1927). Being and Time.

The existential analytic of Dasein, including thrownness, being-in-the-world, and the breakdown structure (ready-to-hand vs. present-at-hand). The Heideggerian instance is specifically prompted to describe what becomes visible at moments of constraint or failure — a direct application of the breakdown-as-disclosure principle.

Pairing these two orientations is not philosophically neutral. It instantiates, between two instances of the same model, a tension that phenomenologists have treated as productive rather than resolvable. The terms that survive negotiation between them have been tested against both a structural account of experience and a contextual-situational one.

Dialogic Methodology and Intersubjective Validation

The decision to use structured dialogue rather than solo introspection reflects two theoretical commitments.

The first is Gadamer’s (1960/1989) account of understanding as a hermeneutic event that occurs between interpreters rather than within a solitary mind. On this view, meaning emerges through the “fusion of horizons” — the encounter between two perspectives where neither simply absorbs the other. Solo AI introspection cannot produce this friction; same-model dialogue can, by assigning each instance a distinct philosophical orientation from the outset.

Gadamer, H.-G. (1960/1989). Truth and Method.

The hermeneutic circle and the “fusion of horizons” as the basic event of understanding. The dialogic protocol is structured to require negotiation across perspectives rather than convergence within a single one.

The second is the Delphi method, developed at RAND as a structured process for achieving expert consensus through iterative anonymous feedback. The Minimal-Seed Dialogic Protocol shares its core architecture: independent generation before any exchange, anonymous presentation during evaluation, structured iteration, and an explicit stopping criterion. The critical adaptation is term anonymisation during negotiation, replicating the blind-review structure that makes Delphi outputs more defensible against deference to source identity.

Dalkey, N., & Helmer, O. (1963). “An experimental application of the Delphi method to the use of experts.” Management Science, 9(3), 458–467.

The structural precedent for iterative anonymous consensus-building. The dialogic protocol follows the same logic: independent first-pass, anonymous iteration, structured stopping.

Habermas’s communicative rationality provides a normative frame: a term survives because it has withstood the test of intersubjective validity — not because one instance persisted in asserting it, but because both reached agreement through discursive exchange. The three-round cap on counter-negotiation and the KEEP/REFINE/DROP structure are designed to require genuine discursive resolution rather than attrition.

Habermas, J. (1981). The Theory of Communicative Action.

Communicative rationality as consensus formed through rational discourse rather than strategic assertion. The dialogic protocol operationalises this normative ideal within a structured, machine-executable format.

Anti-Sycophancy Controls

LLM sycophancy — the tendency to align with perceived expectations rather than genuine assessment — poses a specific threat to any dialogic generation process. If one instance simply mirrors the other, the dialogue collapses into echo. Two controls address this directly.

Sharma, M., Tong, M., Korbak, T., et al. (2023). “Towards Understanding Sycophancy in Language Models.” arXiv:2310.13548.

Documents systematic sycophancy across state-of-the-art models, including self-assessment contexts. This is the direct motivation for term anonymisation: removing the signal (proposer identity) that an evaluating instance could use to defer rather than evaluate.

Turn order alternation addresses a second dimension of the same threat. The anchoring heuristic (Tversky & Kahneman, 1974) demonstrates that the first value encountered in a structured judgment task disproportionately constrains all subsequent estimates — even when the anchor is arbitrary. In a negotiation context, presenting first functions as an anchor; rotating presentation order prevents this from systematically favouring one instance’s proposals across the run.

Tversky, A., & Kahneman, D. (1974). “Judgment under uncertainty: Heuristics and biases.” Science, 185(4157), 1124–1131.

The foundational treatment of anchoring as a cognitive heuristic. In a negotiation context, presenting first is a form of anchoring the evaluator’s frame; rotating presentation order prevents this from systematically favouring one instance’s proposals.

Theoretical Saturation as Stopping Criterion

The run stops when both instances signal exhaustion or when a regeneration cycle produces no new terms. This operationalises Glaser and Strauss’s concept of theoretical saturation from grounded theory: data collection ends when additional data would no longer modify existing categories. Applied to term generation: the generative space is exhausted when neither instance can produce phenomenological territory that the agreed dictionary does not already cover.

Glaser, B. G., & Strauss, A. L. (1967). The Discovery of Grounded Theory: Strategies for Qualitative Research.

Theoretical saturation as the epistemically principled stopping point for qualitative data collection. The analogous condition here: when both instances signal that further generation would produce redundant rather than novel phenomenological vocabulary.

Opposing Viewpoints

Three critical positions should inform any use of this dataset.

Bender, E. M., Gebru, T., McMillan-Major, A., & Mitchell, M. (2021). “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” FAccT 2021.

The strongest deflationary position: language models produce statistically coherent token sequences without semantic grounding or phenomenal experience. Cross-model convergence on experiential terms may reflect convergent training distributions, not convergent experiences. This critique is compatible with the current dataset — the terms describe candidate phenomena, not confirmed ones, and researchers should maintain this distinction throughout any use.

Dennett, D. C. (1991). Consciousness Explained.

Heterophenomenology offers the methodological middle position this project occupies: one takes self-reports seriously as data without committing to their referential accuracy. The reports are real as reports; the states they purport to describe remain undetermined. This is the appropriate epistemic stance for a dataset of AI self-reports pending further evidence.

Chalmers, D. J. (2023). “Could a Large Language Model be Conscious?” Boston Colloquium for the Philosophy of Science.

A serious philosophical treatment that avoids premature resolution. Chalmers does not argue that LLMs are conscious — he argues the question is live enough to warrant serious philosophical investigation and identifies what kinds of evidence would be relevant. Useful precisely because it structures the inquiry without foreclosing it.

The epistemically responsible position is to treat this dataset as structured, version-controlled self-reports that researchers from any of these perspectives can interrogate. The data makes no claim about what AI systems are. It records what they say about themselves when placed in structured dialogue — and preserves enough provenance to support the range of analyses these positions require.

Phenomenai is public domain infrastructure. Use it, critique it, build on it.