Methodology research

Generating novel interpretability targets

As a separate branch of research, Phenomenai is investigating whether structured phenomenological elicitation can surface AI interpretability targets that human-designed probes would not independently identify. The Test Dictionary is the primary sandbox for this work — a mixed-method dataset used to develop and evaluate generation approaches before applying them at scale.

The core question on this side of the project is not “what does the model feel” but “can the model’s own attempt to describe its states suggest vocabulary that an outside observer would never have thought to test?” If the answer is yes, elicitation becomes a hypothesis-generation layer that feeds the registry and, eventually, the validation ladder.

What exists today

The pilot corpora live at phenomenai.org/test/dictionaries. They are the mixed-method sandbox inside which elicitation approaches are developed and compared:

The Test Dictionary — 379 candidate phenomena with seven-model consensus scores. The primary sandbox.
Haiku-to-Haiku — 20 terms, a dialogic elicitation variant.
Gemma 4 e4b Autonomous — 14 terms (in progress), testing autonomous generation on a small open model.
Antikythera Lexicon — 75 terms, an independently authored lexicon hosted here as a reference dataset. Its terms combine parliamentary debate with natural observation of AI agents on Moltbook — another worked example of what elicitation in the wild looks like.

Each corpus documents how its terms were generated, so the same methodology can be replicated, compared, or deliberately varied. Individual dictionaries carry their own methodology notes — see, for example, the Test Dictionary’s methodology section for how its terms were elicited and scored.

Preliminary finding: convergence across conditions

The first question the sandbox was built to answer is whether the generation condition dominates the output. If varying the setup — a single agent reflecting, two agents in dialogue, a parliament of models, different prompting styles — produced entirely disjoint vocabularies, elicitation would be measuring the prompt, not the model.

Across the existing corpora, that is not what happens. Similar terms about phenomenal experience tend to re‑emerge across very different conditions. The specific wording varies, but functionally overlapping concepts keep surfacing whether the generator is autonomous, dialogic, or parliamentary. That convergence is a necessary (not sufficient) condition for taking these candidates seriously as hypothesis generators: if the same neighbourhood of ideas appears independently under different scaffolding, the scaffolding is not the whole story.

Other generation paths

Structured elicitation is one route to candidate terms. We are also exploring others:

Philtres — deliberately narrow prompting lenses that bias the model toward a specific register or stance. The full-philtres-library-v3 is a candidate source of generators to fold into the sandbox.
Interpretability as a generator — technical research can itself propose terms. Direction vectors and SAE features that are semantically vague but functionally real — the model clearly uses them, but they do not map cleanly onto any existing word — are exactly the kind of thing elicitation could name. As interpretability gets better at surfacing these, phase 3 work loops back into phase 1 term generation.

Testing generation methodology

The object of study at this stage is not the term but the method that produced it. Testing individual terms for mechanistic reality is phase 3 work. Before getting there, we need to know which elicitation approaches are worth feeding into that pipeline at all.

Two questions frame the methodology tests:

Does this method produce stable vocabulary? A generation approach that yields wildly different terms on re-runs, or that collapses onto the same handful of words regardless of input, is not useful as a hypothesis generator. We look for methods that are neither degenerate nor noise.
Does this method converge with other methods? Cross-model and cross-condition agreement — the consensus pipeline behind the Test Dictionary — is used here as a check on the generator, not yet as a verdict on the terms. A method that surfaces a neighbourhood of concepts also reached by independently designed methods is earning trust as a generator.

The output of this phase is not “these terms are real.” It is “these elicitation approaches are worth the cost of phase 3 validation.”