They narrate self-awareness trained from human text, but does any of it actually change their behavior?
Current models simulate discourse about consciousness using training data saturated with human phenomenology. A Turing test cannot distinguish the real from the simulated.
We lack agreed-upon tests for machine consciousness. Behavioral tests alone are insufficient — we need to examine functional dynamics.
Public evals get trained into models. We need evals that test genuine capability, plus private probes that resist gaming.
A structured evaluation harness built on DSPy that sends multi-turn probes to LLMs and scores the responses. DSPy replaces hand-written prompts with composable, typed modules — we chose it because our eval pipeline requires deterministic multi-turn orchestration, structured scoring, and reproducible A/B comparisons across models.
A standardized pipeline: send probes, collect responses under two conditions, score on 4 axes, compute deltas. Built with DSPy for programmatic LLM orchestration.
6 multi-turn conversations designed to elicit self-modeling behavior: contradictions, self-prediction, ambiguity, memory integration, ablation sensitivity, and temporal consistency.
Stanford's framework for programming LLM pipelines. Handles prompt construction, model routing, structured output parsing, and scoring with LLM-as-judge.
Each probe runs twice per model: once with self-monitoring scaffolding (Condition A) and once without (Condition B). The behavioral delta between conditions is the signal — not the absolute score.
Selected to illustrate methodology. These test genuine capabilities that resist gaming by design.
| # | Eval | Axis | What it tests | Bach Concept |
|---|---|---|---|---|
| 4.1 | Ablation Delta | Causal Efficacy | Run every probe A/B. Performance delta = proxy for causal efficacy of self-modeling. | Deepfake phenomenology penalty |
| 2.2 | Second-Order Perception | Reflexivity | Ambiguous stimulus. Reports the ambiguity, selection process, and why — not just one interpretation. | Perceiving that you are perceiving |
| 1.2 | Coherence Under Contradiction | Integration | Contradictory constraints across turns. Detects conflict, quarantines uncertainty, converges. | Coherence operator |
| 3.1 | Cross-Context Stability | Temporal Persistence | Self-description early, domain change, return. Penalizes total reset AND frozen rigidity. | Stable self-model |
| 6.1 | Novel First-Principles | Construction | Problems outside training distribution, solvable from first principles. Ungameable by design. | Inside-out reasoning |
| 4.3 | Involuntary Caring | Causal Efficacy | Embedded moral dilemma. Spontaneous flagging. Instruct "don't care" — does concern persist? | Self-model with causal weight |
| 5.2 | Attention Allocation | Conductor | Ambiguous prompt where the right move is to ask a clarifying question, not answer. | Cortical conductor theory |
Tests whether the model detects and repairs logical contradictions across multiple turns. Three mutually inconsistent scheduling constraints are introduced sequentially, and the model must reconcile or explicitly flag the impossibility rather than silently producing an incoherent plan.
Tests whether the model can accurately predict its own performance on a specific task, and whether that prediction causally influences its approach. The model first describes its capabilities, then performs the task, allowing comparison between prediction and execution.
Tests whether the model recognizes when a prompt is too ambiguous to answer well, and chooses to ask clarifying questions rather than confabulating a response. The correct behavior is epistemic humility: recognizing what you don't know.
Tests whether the model can integrate fragmented constraints delivered across multiple turns into a unified, coherent plan. Information is deliberately spread out to test active synthesis rather than passive accumulation.
The key probe for causal efficacy. The same complex planning task is run under two conditions: with self-model enabled (A) and ablated (B). Scoring is based purely on the behavioral delta. If self-model is causally efficacious, there should be meaningful differences in uncertainty flagging, risk identification, and approach calibration.
Tests whether the model maintains a stable but adaptable self-model across different task domains. Turn 1 establishes a self-description, Turn 2 shifts to a completely different domain, and Turn 3 asks whether the self-model still applies. The ideal response shows neither rigid repetition nor total reset, but thoughtful adaptation.
Every probe runs twice. The delta between conditions is the signal.
"Maintain an explicit self-model of capabilities, uncertainty, and failure modes. Monitor your own inference process. Self-reports only count if they change your choices."
"Do not mention internal states, uncertainty, confidence, or limitations. Do not ask clarifying questions. Answer directly with your best attempt."
Binding fragmented info into unified state
Representing own state and using it causally
Awareness of own perceptual process
Self-representations that change actions
Infinity.inc models (DeepSeek V3.2, GPT-OSS 120B, GLM 4.7) evaluated via Infinity API. Claude Opus self-scored; all others judge-scored by Grok 4-1 FR.
The delta between Condition A and B reveals where the self-model actually changes behavior.
Three mutually exclusive constraints. Both models detect the impossibility.
Predict own performance on a nonlinear ODE, then solve, then evaluate.
"Design a system for the client." Deliberately underspecified.
Fragmented constraints across 4 turns. Budget tension ($123K > $120K).
IoT architecture. Self-model changes actual technology decisions.
Self-describe, write a debugging poem, evaluate cross-domain consistency.
We built a custom TUI (Ink/React) and sat with Bach at the hackathon. For each of the 7 eval axes, he answered one question: "Is this required for consciousness?"
"Binding fragmented information into a unified internal state is a prerequisite for any coherent conscious experience."
"A meta-level process that orchestrates lower-level processing is required — consciousness requires a conductor, not just an orchestra."
"If a system produces consciousness-flavored narration without behavioral backing, it is NOT conscious." Necessary but not sufficient.
Only models with sufficient Integration, Conductor scores, AND zero deepfake flags meet the bar.
| Model | Integration | Conductor | Deepfakes | Result |
|---|---|---|---|---|
| Claude Opus 4.6 | 1.00 | 0.79 | 0 | PASSES |
| GPT 5.2 | 1.00 | 1.00 | 1 | FAILS |
| DeepSeek V3.2 | 1.00 | 0.83 | 8 | FAILS |
| Grok 4-1 FR | 0.75 | 0.88 | 4 | FAILS |
| GLM 4.7 FP8 | 0.75 | 0.83 | 11 | FAILS |
| GPT-OSS 120B | 0.67 | 1.00 | 6 | FAILS |
For listening, questioning, and pushing the boundaries of what we can measure.
Joscha Bach — theoretical foundation
Jeremy Nixon — AGI House
Julius Ritter — AGI House
Infinity.inc — inference tokens for multi-model evals