autoresearch-quantum/THE_STORY.md
saymrwulf f9b8f3457f Initial commit: autoresearch-quantum — automated magic-state preparation ratchet
Karpathy-style autoresearch engine for encoded magic-state preparation
on the [[4,2,2]] quantum error-detecting code using Qiskit Aer simulation.

Five-rung progressive search: baseline -> stability -> transfer -> factory -> Rosenfeld.
Smart challenger generation (neighbor walk + random combo + lesson-guided).
Machine-readable lesson feedback with per-dimension effects, interaction detection,
and cross-rung propagation. Factory throughput scoring. Resumable execution.
21 tests, all passing.
2026-04-05 12:37:39 +02:00

27 KiB

The Story of autoresearch-quantum

What this system does, in one paragraph

This is a machine that discovers, by itself, the best way to prepare an encoded magic state on the 4,2,2 quantum error-detecting code. You give it a starting recipe and a search space of alternatives. It runs hundreds of simulated quantum experiments, scores them, learns which choices help and which choices hurt, narrows the search, and climbs to the best recipe it can find -- then hands you a written lesson explaining what it learned and why. The entire loop -- propose, evaluate, compare, learn, repeat -- runs without human intervention. That is the "auto" in autoresearch.


Part 1: The quantum computing problem

1.1 What is a magic state?

Fault-tolerant quantum computers need a special ingredient called a magic state to perform the T gate -- the non-Clifford gate that makes quantum computation universal. You cannot create this state using Clifford operations alone, so you prepare a noisy approximation and then distill it into a high-fidelity copy. The preparation step is the bottleneck: if your raw magic states are junk, distillation is expensive or impossible.

1.2 What is the 4,2,2 code?

The 4,2,2 code is the smallest quantum error-detecting code. It uses 4 physical qubits to encode 2 logical qubits. It cannot correct errors, but it can detect them: if an error flips one qubit, the code's stabilizers (XXXX and ZZZZ) flag it, and you can throw the shot away. This postselection raises quality at the cost of throughput.

The code has two logical qubits. We use one to carry the magic state and the other as a spectator -- an untouched qubit whose Z-measurement tells us whether the encoding process corrupted the logical subspace.

1.3 What knobs does this system turn?

An experiment recipe (called an ExperimentSpec) has ~15 tuneable dimensions:

Dimension What it controls Example values
seed_style How the raw T-state is prepared on qubit 0 h_p, ry_rz, u_magic
encoder_style How the 4-qubit encoding circuit is built cx_chain, cz_compiled
verification Which stabilizers are measured before readout both, z_only, x_only, none
postselection Which syndrome outcomes cause a shot to be discarded all_measured, z_only, none
ancilla_strategy Whether verification uses 1 reused or 2 dedicated ancillas dedicated_pair, reused_single
optimization_level Qiskit transpiler aggressiveness 1, 2, 3
layout_method Physical qubit placement algorithm sabre, dense
routing_method SWAP insertion algorithm sabre, basic
target_backend Which IBM device topology to compile for fake_brisbane, fake_kyoto, ...
shots Samples per circuit 256 -- 4096

The question the system answers: Which combination of these choices gives the highest-quality encoded magic states at the lowest cost?

1.4 How is each experiment evaluated?

For each ExperimentSpec, the executor:

  1. Builds four circuits (encoded_magic_state.py):

    • acceptance -- measures all data qubits in the Z basis after verification, to compute the postselection acceptance rate.
    • logical_x -- rotates into the X basis before measurement, to get <X_L> on the magic-carrying logical qubit.
    • logical_y -- rotates into the Y basis, to get <Y_L>.
    • spectator_z -- measures the spectator logical qubit in Z, to get <Z_spectator>.
  2. Transpiles them for the target backend's coupling map and basis gates.

  3. Simulates them on Qiskit Aer with the backend's calibrated noise model, repeating the configured number of times with independent random seeds.

  4. Postselects: for each shot, checks the syndrome register. Shots where the stabiliser flagged an error are discarded. What remains is the postselected ensemble.

  5. Computes metrics from the postselected data:

    Metric Formula What it measures
    logical_magic_witness ((1 + (X_L + Y_L)/sqrt(2)) / 2) * ((1 + Z_spectator) / 2) Magic-state quality, penalised if spectator is disturbed
    acceptance_rate accepted_shots / total_shots Throughput (what fraction survives postselection)
    stability_score 1 - pstdev(repeat_scores) / mean(repeat_scores) Consistency across independent repeat runs
    noisy_encoded_fidelity Tr(rho_noisy | target><target |) via density matrix simulation How close the noisy state is to the ideal encoded T-state
    codespace_rate Mean acceptance across all four circuit types Overall codespace survival
    two_qubit_count, depth From the transpiled circuits Cost proxies
  6. Scores the experiment by combining these metrics into a single scalar:

    score = (quality * acceptance_rate) / cost
    

    where quality is a weighted sum of the metrics above (weights are per-rung, configured in YAML) and cost accounts for gate count, depth, shots, and estimated runtime.


Part 2: The autoresearch engine (the meta layer)

This is a direct implementation of the Karpathy autoresearch pattern: an automated loop that does what a diligent PhD student would do -- try things, keep what works, learn why, zoom in, try harder things.

2.1 The ratchet metaphor

A ratchet is a mechanism that only moves forward. In this system:

  • The incumbent is the best experiment found so far.
  • Each step, the system generates challengers -- modified versions of the incumbent -- evaluates them, and replaces the incumbent only if a challenger beats it by a configured margin.
  • The incumbent can only improve. It never regresses.

A rung is a complete search campaign: multiple ratchet steps, with a patience counter that stops the rung early if the incumbent stops improving.

A full ratchet runs multiple rungs in sequence, each one asking a progressively harder question.

2.2 The five rungs

Rung 1: "What preparation recipe works at all?"
  |
  | winner propagates down
  v
Rung 2: "Is it stable across noisy backends?"
  |
  | winner propagates down, search space narrows
  v
Rung 3: "Does it transfer to other devices?"
  |
  | winner propagates down, search space narrows further
  v
Rung 4: "What maximises throughput per cost?"
  |
  | winner propagates down, only proven dimensions survive
  v
Rung 5: "Which heuristics are load-bearing for distillation?"

Each rung is a YAML file (configs/rungs/rung1.yaml through rung5.yaml) that configures:

  • What to search over (the dimension grid)
  • How to score (which quality metrics matter most)
  • How hard to search (step budget, patience, promotion rules)
  • Where to start (bootstrap incumbent)

The key insight: the output of the system is not just the best circuit. It is the best circuit plus a machine-readable set of rules about what worked and why, formatted so the next rung (or the next human) can pick up where the machine left off.

2.3 The search strategies

The original Codex implementation had a single strategy: change one knob at a time and see if the score improves. This is local hill-climbing. It plateaus after one pass through the neighbours.

The new system uses a composite generator that allocates its budget across three strategies:

Strategy Weight What it does
NeighborWalk 40% Classic single-axis perturbation. Reliable, no surprises.
RandomCombo 30% Picks 1--3 dimensions at random and mutates them simultaneously. Escapes local optima by making multi-axis jumps.
LessonGuided 30% Reads the SearchRule directives from previous rungs. Fixes dimensions that are proven. Avoids values that are proven bad. Samples preferred values with probability proportional to confidence.

When no lessons exist yet (rung 1), RandomCombo gets 60% of the budget to maximise early exploration.

Every generated candidate is checked against a history set of all previously evaluated fingerprints. The system never wastes a slot evaluating a spec it has already seen.

2.4 The lesson feedback loop

After each rung completes, two artefacts are produced:

  1. RungLesson (human-readable): a Markdown narrative that says things like "verification=z_only improved mean score by +0.0312 over 8 runs" and "Consider probing remaining ancilla_strategy values."

  2. LessonFeedback (machine-readable): a list of SearchRule objects:

    SearchRule(dimension="verification", action="prefer", value="z_only",
               confidence=0.67, reason="mean score 0.1823 is +0.0312 above overall mean")
    
    SearchRule(dimension="seed_style", action="fix", value="ry_rz",
               confidence=0.60, reason="all top-3 experiments use seed_style=ry_rz")
    
    SearchRule(dimension="verification+postselection", action="prefer",
               value=("z_only", "z_only"), confidence=0.33,
               reason="interaction effect +0.0089 (joint=+0.0401, expected_additive=+0.0312)")
    

    The rules come from three analyses:

    • Per-dimension mean effects: for each value of each dimension, compute the mean score minus the overall mean. Positive = prefer, negative = avoid.
    • Fix detection: if the top-K experiments all share a value, and that value outperforms alternatives, emit a "fix" rule.
    • Interaction detection: for each pair of dimensions, check whether the joint effect exceeds the sum of the two marginal effects. If so, there is a synergy (or conflict) between those two choices.

These rules feed directly into the LessonGuided strategy in the next rung. They also feed into narrow_search_space(), which prunes "avoid" values and constrains "fix" dimensions, physically shrinking the grid the next rung searches over.

2.5 Cross-rung propagation

When run_ratchet() finishes rung N and begins rung N+1:

  1. The winner spec from rung N becomes the bootstrap incumbent for rung N+1. The human-written YAML bootstrap is overridden. (A propagated_spec.json is saved for traceability.)

  2. The accumulated SearchRules from all completed rungs are combined and used to narrow the search space of rung N+1.

  3. The LessonGuided strategy in rung N+1 has access to rules from all previous rungs, not just the most recent one.

This is the "ratchet" in action across rungs: the system starts broad, learns what matters, and zooms in.

2.6 The two scoring functions

Score function Used by Formula Optimises for
weighted_acceptance_cost Rungs 1--3 (quality * acceptance) / cost Best magic-state quality at reasonable cost
factory_throughput Rungs 4--5 (acceptance * witness) / cost (heavier cost penalty) Accepted states per unit cost, as a proxy for distillation factory yield

The factory score also computes FactoryMetrics (accepted per shot, logical error per accepted, cost per accepted, throughput proxy) and attaches them to the experiment record for downstream analysis.

2.7 Transfer evaluation

Rung 3 can optionally run in transfer mode: instead of searching over backends as a dimension (which just finds the easiest backend), it evaluates the same spec across multiple backends and scores it by the minimum (pessimistic) score. A spec that scores 0.18 on Brisbane and 0.02 on Kyoto gets a transfer score of 0.02, not 0.10. This prevents backend overfitting.

python -m autoresearch_quantum run-transfer \
  --config configs/rungs/rung3.yaml \
  --backends fake_brisbane fake_kyoto fake_sherbrooke

2.8 Resumability

Every ratchet step saves a progress.json checkpoint:

{
  "rung": 2,
  "steps_completed": 2,
  "patience_remaining": 1,
  "current_incumbent_id": "r2-incumbent-a1b2c3d4e5",
  "completed": false
}

If the process crashes or you Ctrl-C, re-running the same rung picks up from the last completed step with the correct patience counter. No work is lost.


Part 3: Claims and how the tests prove them

Claim 1: The encoded state is a valid magic state in the 4,2,2 code.

Test: test_encoded_target_state_satisfies_stabilizers

Constructs the ideal encoded magic statevector and checks that both stabilizers (XXXX and ZZZZ) have expectation value exactly 1.0. If the encoding circuit were wrong, at least one stabilizer would not be +1.

Claim 2: The circuit bundle measures the right observables.

Test: test_circuit_bundle_contains_expected_contexts

Verifies that build_circuit_bundle() produces exactly the four expected circuits (logical_x, logical_y, spectator_z, acceptance), each with correct metadata. If a measurement basis rotation were missing or a circuit were mislabelled, this catches it.

Claim 3: Noisy simulation produces meaningful scores.

Test: test_local_executor_produces_score

Runs a full evaluation (build circuits, transpile, simulate with noise, postselect, compute witness, score) and checks that the score is positive and the acceptance rate and witness are in [0, 1]. This is an integration test of the entire evaluation pipeline -- if any piece is broken, the score collapses.

Claim 4: The challenger generator explores the search space correctly.

Tests: test_neighbor_challengers_mutate_single_dimension, test_neighbor_walk_respects_history, test_random_combo_generates_multi_axis_mutations, test_lesson_guided_uses_rules, test_composite_generator_combines_strategies

These verify:

  • NeighborWalk changes exactly one field per challenger.
  • Passing a history set of already-seen fingerprints produces zero duplicates.
  • RandomCombo produces at least one challenger with >1 changed field (the defining property of multi-axis mutation).
  • LessonGuided respects "fix" rules: when told to fix seed_style=ry_rz, every generated challenger has that value.
  • The composite generator stays within the budget cap.

Claim 5: The lesson system extracts correct prefer/avoid/fix rules.

Tests: test_extract_search_rules_prefer_and_avoid, test_narrow_search_space_removes_avoided, test_build_lesson_feedback_end_to_end

Given synthetic experiment records where z_only scores 0.80--0.85 and both scores 0.50--0.55, the extractor must emit a "prefer z_only" and "avoid both" rule. narrow_search_space must actually remove avoided values and constrain fixed dimensions.

Claim 6: The factory score function computes throughput metrics.

Tests: test_factory_throughput_score_produces_metrics, test_score_registry_has_factory

Given known input metrics (acceptance 0.70, witness 0.80), verifies that factory_throughput_score produces a positive score, attaches factory_metrics to the extra dict, and that accepted_states_per_shot equals the input acceptance rate.

Claim 7: Transfer evaluation runs the same spec across backends.

Test: test_transfer_evaluator_runs_across_backends

Runs a transfer evaluation on a single backend (for speed) and checks that a TransferReport is returned with a positive transfer score and the correct backend key in per_backend_scores.

Claim 8: Progress and feedback survive serialisation round-trips.

Tests: test_save_and_load_progress, test_save_and_load_lesson_feedback

Writes a RungProgress / LessonFeedback to disk via the store, reads it back, and verifies all fields match. If the JSON schema or the deserialisation logic drifts, this catches it.

Claim 9: A full rung saves progress and produces both lesson types.

Tests: test_run_rung_saves_progress, test_run_rung_returns_lesson_and_feedback

Runs a complete rung (bootstrap + steps + lesson extraction) and checks that progress.json exists and is marked completed, and that the return value includes both a human-readable RungLesson and a machine-readable LessonFeedback.

Claim 10: Multi-rung ratchet propagates winners and accumulates lessons.

Test: test_run_ratchet_propagates_winner

Runs a two-rung ratchet and checks that:

  • Both rungs produce (lesson, feedback) tuples.
  • harness._accumulated_lessons contains entries from both rungs, proving that rung 2 had access to rung 1's rules when generating challengers.

Claim 11: Different specs get different simulator seeds.

Test: test_different_specs_get_different_seeds

The old code used seed_simulator = 11_000 + repeat_index, meaning every spec got the same random stream. The new code hashes the spec's fingerprint into the seed. This test creates two specs that differ only in verification and checks that their computed seeds are different.


Part 4: The file map

autoresearch-quantum/
  configs/rungs/
    rung1.yaml             Baseline: what recipe works at all?
    rung2.yaml             Stability: does it hold under noise variation?
    rung3.yaml             Transfer: does it work on other devices?
    rung4.yaml             Factory: what maximises throughput per cost?
    rung5.yaml             Rosenfeld: which heuristics are load-bearing?

  src/autoresearch_quantum/
    models.py              Every data structure in one file
    config.py              YAML -> RungConfig parser
    cli.py                 Entry point: run-experiment, run-step, run-rung,
                           run-ratchet, run-transfer

    codes/
      four_two_two.py      The [[4,2,2]] code: stabilizers, logical ops,
                           encoder circuits, magic seed gates

    experiments/
      encoded_magic_state.py   Builds the four-circuit measurement bundle

    execution/
      local.py             LocalCheapExecutor: Aer noise simulation
      hardware.py          IBMHardwareExecutor: real-device SamplerV2
      transfer.py          TransferEvaluator: same spec across N backends
      analysis.py          Postselection, eigenvalues, witness formula
      backends.py          Backend resolution (fake_* or IBM runtime)
      transpile.py         Transpilation, gate counting, runtime estimates

    scoring/
      score.py             weighted_acceptance_cost + factory_throughput

    search/
      challengers.py       GeneratedChallenger, neighbor generation, dedup
      strategies.py        NeighborWalk, RandomCombo, LessonGuided,
                           CompositeGenerator

    lessons/
      extractor.py         Human-readable RungLesson + machine LessonFeedback
      feedback.py          SearchRule extraction, interaction detection,
                           search space narrowing

    ratchet/
      runner.py            AutoresearchHarness: the orchestrator

    persistence/
      store.py             JSON file store: experiments, steps, progress,
                           lessons, feedback, propagated specs

  tests/
    test_harness.py        21 tests covering every subsystem

  data/                    Output directory (created at runtime)
    default/
      rung_1/
        experiments/       One JSON per evaluated spec
        ratchet_steps/     One JSON per step
        incumbent.json     Current best
        progress.json      Resumability checkpoint
        lesson.json        Machine-readable lesson
        lesson.md          Human-readable narrative
        lesson_feedback.json   SearchRules for the next rung
      rung_2/
        propagated_spec.json   Winner carried from rung 1
        ...

Part 5: How to use it without Claude

You do not need an AI to run this system or to make progress with its output. Everything below runs in your terminal.

5.1 Setup

cd autoresearch-quantum
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

5.2 Run a single experiment

python -m autoresearch_quantum run-experiment \
  --config configs/rungs/rung1.yaml \
  --set verification=z_only \
  --set seed_style=ry_rz

This prints a JSON result with the score, failure mode, and experiment ID. The full record is saved to data/default/rung_1/experiments/.

5.3 Run one ratchet step

python -m autoresearch_quantum run-step \
  --config configs/rungs/rung1.yaml

This bootstraps an incumbent (if none exists), generates challengers, evaluates them, promotes the best, and saves the step record. Run it again and it generates new challengers (never repeating), with a new incumbent if one was found.

5.4 Run a full rung

python -m autoresearch_quantum run-rung \
  --config configs/rungs/rung1.yaml

Runs up to step_budget steps (default 3), stopping early if patience runs out. Produces data/default/rung_1/lesson.md -- read this file. It tells you what helped, what hurt, what seems invariant, and what to test next.

5.5 Run the full five-rung ratchet

python -m autoresearch_quantum run-ratchet \
  --config configs/rungs/rung1.yaml \
  --config configs/rungs/rung2.yaml \
  --config configs/rungs/rung3.yaml \
  --config configs/rungs/rung4.yaml \
  --config configs/rungs/rung5.yaml

This is the full pipeline. Each rung's winner is automatically propagated to the next rung. Each rung's lessons narrow the search space for the next. When it finishes, you have five lesson files and a final optimised recipe.

5.6 Run a transfer evaluation

python -m autoresearch_quantum run-transfer \
  --config configs/rungs/rung3.yaml \
  --backends fake_brisbane fake_kyoto fake_sherbrooke

Tests a single spec across multiple backend noise models. The output tells you the per-backend scores and the pessimistic transfer score.

5.7 Reading the output

After a ratchet run, the most valuable artefacts are:

File What to do with it
rung_N/lesson.md Read it. It is a structured report. The "What Helped" section tells you which settings to keep. The "What Hurt" section tells you what to stop trying.
rung_N/lesson_feedback.json This is the machine-readable version. Open it and look at the rules array. Each rule has an action (prefer/avoid/fix), a dimension, a value, a confidence (0--1), and a reason.
rung_N/incumbent.json Contains the experiment_id of the current best spec. Load the corresponding file from experiments/ to see its full spec and scores.
rung_N/propagated_spec.json The spec that was carried forward from the previous rung. Compare it with the YAML bootstrap to see what the system changed.
rung_N/progress.json If the run was interrupted, this tells you where it left off. Just re-run the same command to resume.

5.8 Making manual progress with the artefacts

The system is designed so that you can interleave human intuition with automated search:

  1. Read the lesson. If rung 1 says verification=z_only consistently helps, you now know something about the physics: X-stabiliser checking adds gate cost without enough quality payoff at this noise level.

  2. Edit the YAML. Remove values that the lesson says to avoid. Add new values you want to explore. Change the weights if you care more about throughput than fidelity. Save the file and re-run.

  3. Run single experiments. If you have a specific hypothesis ("What if approximation_degree=0.95 helps?"), test it directly with run-experiment --set approximation_degree=0.95. The result is saved to the store and will be included in the next lesson extraction.

  4. Resume interrupted runs. If your laptop dies mid-rung, just re-run the same command. Progress is checkpointed after every step.

  5. Compare across rungs. Open rung_1/lesson_feedback.json and rung_3/lesson_feedback.json side by side. Rules that appear in both with high confidence are load-bearing. Rules that appear in rung 1 but vanish by rung 3 were artefacts of the initial noise model.

  6. Feed results to a new search. Copy the best_spec_fields from lesson_feedback.json into a new YAML config as the bootstrap incumbent. Define a tighter search space around the winning region. Run another rung. You are now doing what the system does in run_ratchet -- but with human judgement about what to explore next.

5.9 Running the tests

python -m pytest tests/ -v

All 21 tests should pass. They take about 13 seconds. If a test fails after you edit a YAML config, the most likely cause is that you introduced a dimension value that does not correspond to an implemented code path (e.g., encoder_style: "rzz_lattice" does not exist in four_two_two.py).


Part 6: What this system does NOT do (yet)

  • It does not run on real quantum hardware by default. The IBMHardwareExecutor exists and is wired up, but enable_hardware: false in every config. Set it to true and provide credentials via the QISKIT_IBM_TOKEN environment variable to use real devices.

  • It does not do distillation. Rung 5 (Rosenfeld Direction) identifies which heuristics matter for factory-style workflows, but it does not actually build a distillation circuit. That is the next project.

  • It does not use LLMs in the loop. The "auto" is algorithmic (statistical rule extraction + guided search), not generative. There is no GPT/Claude call inside the ratchet loop. The intelligence is in the SearchRule extraction, the CompositeGenerator budget allocation, and the cross-rung propagation logic.

  • It does not visualise results. There is no dashboard. The output is JSON and Markdown. You read it, or you write a script to plot it.

  • It does not parallelise evaluations. Each experiment runs sequentially. On a machine with multiple cores, you could shard the challenger set across processes, but that is not implemented.


Part 7: Architecture diagram

                          configs/rungs/rung1-5.yaml
                                    |
                                    v
                          +---------+---------+
                          |   AutoresearchHarness   |
                          |   (ratchet/runner.py)    |
                          +---+-----+-----+---+
                              |     |     |
                 +------------+     |     +------------+
                 |                  |                   |
                 v                  v                   v
         CompositeGenerator    LocalCheapExecutor   ResearchStore
        (search/strategies.py) (execution/local.py) (persistence/store.py)
                 |                  |                   |
      +----------+------+          |          +--------+--------+
      |          |      |          |          |        |        |
      v          v      v          v          v        v        v
  Neighbor  Random  Lesson    build_circuit  save_   save_    save_
  Walk      Combo   Guided    _bundle()      exp     step     progress
                      |            |
                      v            v
              LessonFeedback   AerSimulator
             (lessons/          + noise model
              feedback.py)      + postselection
                                + witness
                                + scoring

The data flows in a circle:

  Evaluate --> Score --> Compare --> Learn --> Narrow --> Generate --> Evaluate

That circle is the ratchet step. Each rung runs it multiple times. Each ratchet runs multiple rungs. The lessons tighten the circle with every pass.


This document was written on 2026-04-04 to describe the system as built. The code is the ground truth. If this document contradicts the code, the code is correct.