autoresearch-quantum/THE_STORY.md
saymrwulf 55237d5f73 Sync all documentation with current project ground truth
README: rewrite with Quick Start (app.sh), 335-test count, teaching layer
narrative, testing/validation section, CI/CD docs, pre-commit hooks.
THE_STORY: add Part 4 (teaching layer), Part 5 (app.sh consumer experience),
update file map with all 13 test files and teaching/notebook/paper entries.
compendium.tex: update notebook count (8→12), add Plan D cross-references.
autoresearch_quantum.tex: update test counts (21→335), add app.sh validate.
learning_objectives.md: add entry point reference and assessment type glossary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 20:55:02 +02:00

34 KiB

The Story of autoresearch-quantum

What this system does, in one paragraph

This is a machine that discovers, by itself, the best way to prepare an encoded magic state on the 4,2,2 quantum error-detecting code. You give it a starting recipe and a search space of alternatives. It runs hundreds of simulated quantum experiments, scores them, learns which choices help and which choices hurt, narrows the search, and climbs to the best recipe it can find -- then hands you a written lesson explaining what it learned and why. The entire loop -- propose, evaluate, compare, learn, repeat -- runs without human intervention. That is the "auto" in autoresearch.


Part 1: The quantum computing problem

1.1 What is a magic state?

Fault-tolerant quantum computers need a special ingredient called a magic state to perform the T gate -- the non-Clifford gate that makes quantum computation universal. You cannot create this state using Clifford operations alone, so you prepare a noisy approximation and then distill it into a high-fidelity copy. The preparation step is the bottleneck: if your raw magic states are junk, distillation is expensive or impossible.

1.2 What is the 4,2,2 code?

The 4,2,2 code is the smallest quantum error-detecting code. It uses 4 physical qubits to encode 2 logical qubits. It cannot correct errors, but it can detect them: if an error flips one qubit, the code's stabilizers (XXXX and ZZZZ) flag it, and you can throw the shot away. This postselection raises quality at the cost of throughput.

The code has two logical qubits. We use one to carry the magic state and the other as a spectator -- an untouched qubit whose Z-measurement tells us whether the encoding process corrupted the logical subspace.

1.3 What knobs does this system turn?

An experiment recipe (called an ExperimentSpec) has ~15 tuneable dimensions:

Dimension What it controls Example values
seed_style How the raw T-state is prepared on qubit 0 h_p, ry_rz, u_magic
encoder_style How the 4-qubit encoding circuit is built cx_chain, cz_compiled
verification Which stabilizers are measured before readout both, z_only, x_only, none
postselection Which syndrome outcomes cause a shot to be discarded all_measured, z_only, none
ancilla_strategy Whether verification uses 1 reused or 2 dedicated ancillas dedicated_pair, reused_single
optimization_level Qiskit transpiler aggressiveness 1, 2, 3
layout_method Physical qubit placement algorithm sabre, dense
routing_method SWAP insertion algorithm sabre, basic
target_backend Which IBM device topology to compile for fake_brisbane, fake_kyoto, ...
shots Samples per circuit 256 -- 4096

The question the system answers: Which combination of these choices gives the highest-quality encoded magic states at the lowest cost?

1.4 How is each experiment evaluated?

For each ExperimentSpec, the executor:

  1. Builds four circuits (encoded_magic_state.py):

    • acceptance -- measures all data qubits in the Z basis after verification, to compute the postselection acceptance rate.
    • logical_x -- rotates into the X basis before measurement, to get <X_L> on the magic-carrying logical qubit.
    • logical_y -- rotates into the Y basis, to get <Y_L>.
    • spectator_z -- measures the spectator logical qubit in Z, to get <Z_spectator>.
  2. Transpiles them for the target backend's coupling map and basis gates.

  3. Simulates them on Qiskit Aer with the backend's calibrated noise model, repeating the configured number of times with independent random seeds.

  4. Postselects: for each shot, checks the syndrome register. Shots where the stabiliser flagged an error are discarded. What remains is the postselected ensemble.

  5. Computes metrics from the postselected data:

    Metric Formula What it measures
    logical_magic_witness ((1 + (X_L + Y_L)/sqrt(2)) / 2) * ((1 + Z_spectator) / 2) Magic-state quality, penalised if spectator is disturbed
    acceptance_rate accepted_shots / total_shots Throughput (what fraction survives postselection)
    stability_score 1 - pstdev(repeat_scores) / mean(repeat_scores) Consistency across independent repeat runs
    noisy_encoded_fidelity Tr(rho_noisy | target><target |) via density matrix simulation How close the noisy state is to the ideal encoded T-state
    codespace_rate Mean acceptance across all four circuit types Overall codespace survival
    two_qubit_count, depth From the transpiled circuits Cost proxies
  6. Scores the experiment by combining these metrics into a single scalar:

    score = (quality * acceptance_rate) / cost
    

    where quality is a weighted sum of the metrics above (weights are per-rung, configured in YAML) and cost accounts for gate count, depth, shots, and estimated runtime.


Part 2: The autoresearch engine (the meta layer)

This is a direct implementation of the Karpathy autoresearch pattern: an automated loop that does what a diligent PhD student would do -- try things, keep what works, learn why, zoom in, try harder things.

2.1 The ratchet metaphor

A ratchet is a mechanism that only moves forward. In this system:

  • The incumbent is the best experiment found so far.
  • Each step, the system generates challengers -- modified versions of the incumbent -- evaluates them, and replaces the incumbent only if a challenger beats it by a configured margin.
  • The incumbent can only improve. It never regresses.

A rung is a complete search campaign: multiple ratchet steps, with a patience counter that stops the rung early if the incumbent stops improving.

A full ratchet runs multiple rungs in sequence, each one asking a progressively harder question.

2.2 The five rungs

Rung 1: "What preparation recipe works at all?"
  |
  | winner propagates down
  v
Rung 2: "Is it stable across noisy backends?"
  |
  | winner propagates down, search space narrows
  v
Rung 3: "Does it transfer to other devices?"
  |
  | winner propagates down, search space narrows further
  v
Rung 4: "What maximises throughput per cost?"
  |
  | winner propagates down, only proven dimensions survive
  v
Rung 5: "Which heuristics are load-bearing for distillation?"

Each rung is a YAML file (configs/rungs/rung1.yaml through rung5.yaml) that configures:

  • What to search over (the dimension grid)
  • How to score (which quality metrics matter most)
  • How hard to search (step budget, patience, promotion rules)
  • Where to start (bootstrap incumbent)

The key insight: the output of the system is not just the best circuit. It is the best circuit plus a machine-readable set of rules about what worked and why, formatted so the next rung (or the next human) can pick up where the machine left off.

2.3 The search strategies

The original Codex implementation had a single strategy: change one knob at a time and see if the score improves. This is local hill-climbing. It plateaus after one pass through the neighbours.

The new system uses a composite generator that allocates its budget across three strategies:

Strategy Weight What it does
NeighborWalk 40% Classic single-axis perturbation. Reliable, no surprises.
RandomCombo 30% Picks 1--3 dimensions at random and mutates them simultaneously. Escapes local optima by making multi-axis jumps.
LessonGuided 30% Reads the SearchRule directives from previous rungs. Fixes dimensions that are proven. Avoids values that are proven bad. Samples preferred values with probability proportional to confidence.

When no lessons exist yet (rung 1), RandomCombo gets 60% of the budget to maximise early exploration.

Every generated candidate is checked against a history set of all previously evaluated fingerprints. The system never wastes a slot evaluating a spec it has already seen.

2.4 The lesson feedback loop

After each rung completes, two artefacts are produced:

  1. RungLesson (human-readable): a Markdown narrative that says things like "verification=z_only improved mean score by +0.0312 over 8 runs" and "Consider probing remaining ancilla_strategy values."

  2. LessonFeedback (machine-readable): a list of SearchRule objects:

    SearchRule(dimension="verification", action="prefer", value="z_only",
               confidence=0.67, reason="mean score 0.1823 is +0.0312 above overall mean")
    
    SearchRule(dimension="seed_style", action="fix", value="ry_rz",
               confidence=0.60, reason="all top-3 experiments use seed_style=ry_rz")
    
    SearchRule(dimension="verification+postselection", action="prefer",
               value=("z_only", "z_only"), confidence=0.33,
               reason="interaction effect +0.0089 (joint=+0.0401, expected_additive=+0.0312)")
    

    The rules come from three analyses:

    • Per-dimension mean effects: for each value of each dimension, compute the mean score minus the overall mean. Positive = prefer, negative = avoid.
    • Fix detection: if the top-K experiments all share a value, and that value outperforms alternatives, emit a "fix" rule.
    • Interaction detection: for each pair of dimensions, check whether the joint effect exceeds the sum of the two marginal effects. If so, there is a synergy (or conflict) between those two choices.

These rules feed directly into the LessonGuided strategy in the next rung. They also feed into narrow_search_space(), which prunes "avoid" values and constrains "fix" dimensions, physically shrinking the grid the next rung searches over.

2.5 Cross-rung propagation

When run_ratchet() finishes rung N and begins rung N+1:

  1. The winner spec from rung N becomes the bootstrap incumbent for rung N+1. The human-written YAML bootstrap is overridden. (A propagated_spec.json is saved for traceability.)

  2. The accumulated SearchRules from all completed rungs are combined and used to narrow the search space of rung N+1.

  3. The LessonGuided strategy in rung N+1 has access to rules from all previous rungs, not just the most recent one.

This is the "ratchet" in action across rungs: the system starts broad, learns what matters, and zooms in.

2.6 The two scoring functions

Score function Used by Formula Optimises for
weighted_acceptance_cost Rungs 1--3 (quality * acceptance) / cost Best magic-state quality at reasonable cost
factory_throughput Rungs 4--5 (acceptance * witness) / cost (heavier cost penalty) Accepted states per unit cost, as a proxy for distillation factory yield

The factory score also computes FactoryMetrics (accepted per shot, logical error per accepted, cost per accepted, throughput proxy) and attaches them to the experiment record for downstream analysis.

2.7 Transfer evaluation

Rung 3 can optionally run in transfer mode: instead of searching over backends as a dimension (which just finds the easiest backend), it evaluates the same spec across multiple backends and scores it by the minimum (pessimistic) score. A spec that scores 0.18 on Brisbane and 0.02 on Kyoto gets a transfer score of 0.02, not 0.10. This prevents backend overfitting.

python -m autoresearch_quantum run-transfer \
  --config configs/rungs/rung3.yaml \
  --backends fake_brisbane fake_kyoto fake_sherbrooke

2.8 Resumability

Every ratchet step saves a progress.json checkpoint:

{
  "rung": 2,
  "steps_completed": 2,
  "patience_remaining": 1,
  "current_incumbent_id": "r2-incumbent-a1b2c3d4e5",
  "completed": false
}

If the process crashes or you Ctrl-C, re-running the same rung picks up from the last completed step with the correct patience counter. No work is lost.


Part 3: Claims and how the tests prove them

Claim 1: The encoded state is a valid magic state in the 4,2,2 code.

Test: test_encoded_target_state_satisfies_stabilizers

Constructs the ideal encoded magic statevector and checks that both stabilizers (XXXX and ZZZZ) have expectation value exactly 1.0. If the encoding circuit were wrong, at least one stabilizer would not be +1.

Claim 2: The circuit bundle measures the right observables.

Test: test_circuit_bundle_contains_expected_contexts

Verifies that build_circuit_bundle() produces exactly the four expected circuits (logical_x, logical_y, spectator_z, acceptance), each with correct metadata. If a measurement basis rotation were missing or a circuit were mislabelled, this catches it.

Claim 3: Noisy simulation produces meaningful scores.

Test: test_local_executor_produces_score

Runs a full evaluation (build circuits, transpile, simulate with noise, postselect, compute witness, score) and checks that the score is positive and the acceptance rate and witness are in [0, 1]. This is an integration test of the entire evaluation pipeline -- if any piece is broken, the score collapses.

Claim 4: The challenger generator explores the search space correctly.

Tests: test_neighbor_challengers_mutate_single_dimension, test_neighbor_walk_respects_history, test_random_combo_generates_multi_axis_mutations, test_lesson_guided_uses_rules, test_composite_generator_combines_strategies

These verify:

  • NeighborWalk changes exactly one field per challenger.
  • Passing a history set of already-seen fingerprints produces zero duplicates.
  • RandomCombo produces at least one challenger with >1 changed field (the defining property of multi-axis mutation).
  • LessonGuided respects "fix" rules: when told to fix seed_style=ry_rz, every generated challenger has that value.
  • The composite generator stays within the budget cap.

Claim 5: The lesson system extracts correct prefer/avoid/fix rules.

Tests: test_extract_search_rules_prefer_and_avoid, test_narrow_search_space_removes_avoided, test_build_lesson_feedback_end_to_end

Given synthetic experiment records where z_only scores 0.80--0.85 and both scores 0.50--0.55, the extractor must emit a "prefer z_only" and "avoid both" rule. narrow_search_space must actually remove avoided values and constrain fixed dimensions.

Claim 6: The factory score function computes throughput metrics.

Tests: test_factory_throughput_score_produces_metrics, test_score_registry_has_factory

Given known input metrics (acceptance 0.70, witness 0.80), verifies that factory_throughput_score produces a positive score, attaches factory_metrics to the extra dict, and that accepted_states_per_shot equals the input acceptance rate.

Claim 7: Transfer evaluation runs the same spec across backends.

Test: test_transfer_evaluator_runs_across_backends

Runs a transfer evaluation on a single backend (for speed) and checks that a TransferReport is returned with a positive transfer score and the correct backend key in per_backend_scores.

Claim 8: Progress and feedback survive serialisation round-trips.

Tests: test_save_and_load_progress, test_save_and_load_lesson_feedback

Writes a RungProgress / LessonFeedback to disk via the store, reads it back, and verifies all fields match. If the JSON schema or the deserialisation logic drifts, this catches it.

Claim 9: A full rung saves progress and produces both lesson types.

Tests: test_run_rung_saves_progress, test_run_rung_returns_lesson_and_feedback

Runs a complete rung (bootstrap + steps + lesson extraction) and checks that progress.json exists and is marked completed, and that the return value includes both a human-readable RungLesson and a machine-readable LessonFeedback.

Claim 10: Multi-rung ratchet propagates winners and accumulates lessons.

Test: test_run_ratchet_propagates_winner

Runs a two-rung ratchet and checks that:

  • Both rungs produce (lesson, feedback) tuples.
  • harness._accumulated_lessons contains entries from both rungs, proving that rung 2 had access to rung 1's rules when generating challengers.

Claim 11: Different specs get different simulator seeds.

Test: test_different_specs_get_different_seeds

The old code used seed_simulator = 11_000 + repeat_index, meaning every spec got the same random stream. The new code hashes the spec's fingerprint into the seed. This test creates two specs that differ only in verification and checks that their computed seeds are different.


Part 4: The teaching layer

The system is not only a research engine. It is also a course. Twelve Jupyter notebooks, organised into four independent learning plans, teach the same material through different pedagogical lenses. The teaching layer sits on top of the research engine and uses its real components (circuits, simulators, scorers, ratchet) as the substrate for interactive learning.

4.1 Entry point: 00_START_HERE.ipynb

Every learner begins at notebooks/00_START_HERE.ipynb. This notebook contains no code --- it is a plan selector. It describes the four plans, their target audiences, and links directly to each plan's first notebook. All content notebooks link back to Start Here.

4.2 The four plans

Plan Style Notebooks Target learner
A Bottom-up, sequential 3 Methodical learners who want foundations first
B Spiral, three passes 1 (78 cells) Time-pressed learners who want a demo first, theory later
C Parallel tracks + dashboard 4 Learners who want to choose their own path
D Hypothesis-driven experiments 3 Research-oriented learners who want to test claims

All four plans cover the same core concepts: T-state preparation, 4,2,2 encoding, stabiliser verification, postselection, scoring, the ratchet optimiser, lesson extraction, and cross-rung transfer.

4.3 Interactive assessments (teaching/assess.py)

Every content notebook includes interactive assessments built with ipywidgets:

  • quiz() --- multiple-choice questions with immediate feedback
  • predict_choice() --- "What do you think will happen?" before running code
  • reflect() --- open-ended reflections graded by keyword matching
  • order() --- drag-and-drop ordering exercises (e.g., rank error types)

Each assessment is tagged with a Bloom's taxonomy level (remember, understand, apply, analyse, evaluate) and a topic. The full mapping of learning objectives to assessments is documented in notebooks/learning_objectives.md.

4.4 Progress tracking (teaching/tracker.py)

Each notebook creates a LearningTracker instance that records:

  • scores per assessment (correct/incorrect, attempt count)
  • Bloom's level distribution (how many of each level attempted/passed)
  • time spent per assessment
  • checkpoint summaries at natural breakpoints

At the end of each notebook, tracker.dashboard() displays a visual summary, and tracker.save() persists progress to a JSON file. Progress files can be reset with bash scripts/app.sh reset.

4.5 Navigation

Every content notebook has a navigation footer with:

  • Forward link to the next notebook in the plan
  • Back-link to 00_START_HERE.ipynb
  • Cross-plan suggestions at terminal notebooks (e.g., "Finished Plan A? Try Plan D for a different perspective.")

4.6 Pedagogical quality enforcement

The test suite includes tests/test_pedagogy.py, which enforces educational quality invariants across all content notebooks:

  • Minimum 200 words of prose per notebook
  • At least 25% of cells are markdown (not code-only)
  • Every notebook has a title header and multiple sections
  • At least 2 interactive assessments per notebook
  • At least 2 different assessment types per notebook (variety)
  • Bloom's taxonomy coverage: at least 2 levels per notebook
  • Checkpoint summaries present when a notebook has 4+ assessments
  • LearningTracker initialisation, dashboard(), and save() in every notebook
  • Key Insight callouts in longer notebooks (5+ sections)
  • All four plans collectively cover core concepts (stabiliser, magic, witness, ratchet)

These tests catch pedagogical regressions the same way unit tests catch code regressions. Adding a new notebook or modifying an existing one will fail CI if it violates these invariants.


Part 5: The consumer experience (app.sh)

The project includes a lifecycle manager (scripts/app.sh) that handles the entire consumer experience from first clone to running notebooks:

bash scripts/app.sh bootstrap     # venv, pip install, kernel registration, import check
bash scripts/app.sh start         # launch JupyterLab, open 00_START_HERE.ipynb
bash scripts/app.sh stop          # graceful shutdown
bash scripts/app.sh status        # venv, server, notebook, progress summary
bash scripts/app.sh validate      # ruff + mypy + full test suite
bash scripts/app.sh validate --quick  # lint + type check + unit tests only
bash scripts/app.sh logs          # tail JupyterLab output
bash scripts/app.sh reset         # delete learner progress files

Bootstrap checks Python >= 3.11, creates the venv, installs the package with dev and notebook dependencies, registers a Jupyter kernel, and verifies that core imports succeed. Start finds a free port (8888-8899), launches JupyterLab in the background with PID tracking, and opens the browser directly to 00_START_HERE.ipynb.

Validation runs the full quality pipeline: ruff linting, mypy strict type checking, and the pytest suite (335 tests, excluding browser UX by default). The --quick flag runs only lint, type check, and unit tests.


Part 6: The file map

autoresearch-quantum/
  configs/rungs/
    rung1.yaml             Baseline: what recipe works at all?
    rung2.yaml             Stability: does it hold under noise variation?
    rung3.yaml             Transfer: does it work on other devices?
    rung4.yaml             Factory: what maximises throughput per cost?
    rung5.yaml             Rosenfeld: which heuristics are load-bearing?

  src/autoresearch_quantum/
    models.py              Every data structure in one file
    config.py              YAML -> RungConfig parser
    cli.py                 Entry point: run-experiment, run-step, run-rung,
                           run-ratchet, run-transfer

    codes/
      four_two_two.py      The [[4,2,2]] code: stabilizers, logical ops,
                           encoder circuits, magic seed gates

    experiments/
      encoded_magic_state.py   Builds the four-circuit measurement bundle

    execution/
      local.py             LocalCheapExecutor: Aer noise simulation
      hardware.py          IBMHardwareExecutor: real-device SamplerV2
      transfer.py          TransferEvaluator: same spec across N backends
      analysis.py          Postselection, eigenvalues, witness formula
      backends.py          Backend resolution (fake_* or IBM runtime)
      transpile.py         Transpilation, gate counting, runtime estimates

    scoring/
      score.py             weighted_acceptance_cost + factory_throughput

    search/
      challengers.py       GeneratedChallenger, neighbor generation, dedup
      strategies.py        NeighborWalk, RandomCombo, LessonGuided,
                           CompositeGenerator

    lessons/
      extractor.py         Human-readable RungLesson + machine LessonFeedback
      feedback.py          SearchRule extraction, interaction detection,
                           search space narrowing

    ratchet/
      runner.py            AutoresearchHarness: the orchestrator

    persistence/
      store.py             JSON file store: experiments, steps, progress,
                           lessons, feedback, propagated specs

    teaching/
      assess.py            Widget-based quizzes, predictions, reflections
      tracker.py           LearningTracker: per-student progress tracking

  notebooks/
    00_START_HERE.ipynb    Central entry point: plan selector
    learning_objectives.md Per-notebook, per-section learning objectives
    plan_a/                Bottom-up: 3 sequential notebooks
    plan_b/                Spiral: 1 notebook, 3 passes
    plan_c/                Parallel tracks + dashboard: 4 notebooks
    plan_d/                Hypothesis-driven: 3 experiments

  paper/
    autoresearch_quantum.tex   Technical paper (LaTeX, 19 pages)
    compendium.tex             Companion textbook (LaTeX, 36 pages)

  scripts/
    app.sh                 Consumer lifecycle manager (bootstrap/start/stop/validate)

  tests/                   335 tests across 13 files
    test_analysis.py       Postselection & witness
    test_browser_ux.py     Playwright end-to-end UX
    test_cli.py            CLI subcommands
    test_codes.py          [[4,2,2]] code correctness
    test_config.py         YAML config loading
    test_experiments.py    Circuit bundle construction
    test_feedback.py       Lesson extraction & search rules
    test_harness.py        Full ratchet integration
    test_notebooks.py      Notebook execution & structure
    test_pedagogy.py       Pedagogical quality invariants (130 tests)
    test_persistence.py    JSON store round-trips
    test_scoring.py        Score functions
    test_teaching.py       Assessment widgets & tracker

  .github/workflows/ci.yml  CI: lint, type check, test matrix, notebook execution
  .pre-commit-config.yaml   Ruff, mypy, nbstripout, hygiene hooks

  data/                    Output directory (created at runtime)
    default/
      rung_1/
        experiments/       One JSON per evaluated spec
        ratchet_steps/     One JSON per step
        incumbent.json     Current best
        progress.json      Resumability checkpoint
        lesson.json        Machine-readable lesson
        lesson.md          Human-readable narrative
        lesson_feedback.json   SearchRules for the next rung
      rung_2/
        propagated_spec.json   Winner carried from rung 1
        ...

Part 7: How to use it without Claude

You do not need an AI to run this system or to make progress with its output. Everything below runs in your terminal.

7.1 Setup

cd autoresearch-quantum
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

7.2 Run a single experiment

python -m autoresearch_quantum run-experiment \
  --config configs/rungs/rung1.yaml \
  --set verification=z_only \
  --set seed_style=ry_rz

This prints a JSON result with the score, failure mode, and experiment ID. The full record is saved to data/default/rung_1/experiments/.

7.3 Run one ratchet step

python -m autoresearch_quantum run-step \
  --config configs/rungs/rung1.yaml

This bootstraps an incumbent (if none exists), generates challengers, evaluates them, promotes the best, and saves the step record. Run it again and it generates new challengers (never repeating), with a new incumbent if one was found.

7.4 Run a full rung

python -m autoresearch_quantum run-rung \
  --config configs/rungs/rung1.yaml

Runs up to step_budget steps (default 3), stopping early if patience runs out. Produces data/default/rung_1/lesson.md -- read this file. It tells you what helped, what hurt, what seems invariant, and what to test next.

7.5 Run the full five-rung ratchet

python -m autoresearch_quantum run-ratchet \
  --config configs/rungs/rung1.yaml \
  --config configs/rungs/rung2.yaml \
  --config configs/rungs/rung3.yaml \
  --config configs/rungs/rung4.yaml \
  --config configs/rungs/rung5.yaml

This is the full pipeline. Each rung's winner is automatically propagated to the next rung. Each rung's lessons narrow the search space for the next. When it finishes, you have five lesson files and a final optimised recipe.

7.6 Run a transfer evaluation

python -m autoresearch_quantum run-transfer \
  --config configs/rungs/rung3.yaml \
  --backends fake_brisbane fake_kyoto fake_sherbrooke

Tests a single spec across multiple backend noise models. The output tells you the per-backend scores and the pessimistic transfer score.

7.7 Reading the output

After a ratchet run, the most valuable artefacts are:

File What to do with it
rung_N/lesson.md Read it. It is a structured report. The "What Helped" section tells you which settings to keep. The "What Hurt" section tells you what to stop trying.
rung_N/lesson_feedback.json This is the machine-readable version. Open it and look at the rules array. Each rule has an action (prefer/avoid/fix), a dimension, a value, a confidence (0--1), and a reason.
rung_N/incumbent.json Contains the experiment_id of the current best spec. Load the corresponding file from experiments/ to see its full spec and scores.
rung_N/propagated_spec.json The spec that was carried forward from the previous rung. Compare it with the YAML bootstrap to see what the system changed.
rung_N/progress.json If the run was interrupted, this tells you where it left off. Just re-run the same command to resume.

7.8 Making manual progress with the artefacts

The system is designed so that you can interleave human intuition with automated search:

  1. Read the lesson. If rung 1 says verification=z_only consistently helps, you now know something about the physics: X-stabiliser checking adds gate cost without enough quality payoff at this noise level.

  2. Edit the YAML. Remove values that the lesson says to avoid. Add new values you want to explore. Change the weights if you care more about throughput than fidelity. Save the file and re-run.

  3. Run single experiments. If you have a specific hypothesis ("What if approximation_degree=0.95 helps?"), test it directly with run-experiment --set approximation_degree=0.95. The result is saved to the store and will be included in the next lesson extraction.

  4. Resume interrupted runs. If your laptop dies mid-rung, just re-run the same command. Progress is checkpointed after every step.

  5. Compare across rungs. Open rung_1/lesson_feedback.json and rung_3/lesson_feedback.json side by side. Rules that appear in both with high confidence are load-bearing. Rules that appear in rung 1 but vanish by rung 3 were artefacts of the initial noise model.

  6. Feed results to a new search. Copy the best_spec_fields from lesson_feedback.json into a new YAML config as the bootstrap incumbent. Define a tighter search space around the winning region. Run another rung. You are now doing what the system does in run_ratchet -- but with human judgement about what to explore next.

7.9 Running the tests

# Full validation (recommended)
bash scripts/app.sh validate

# Or directly with pytest
python -m pytest tests/ -v

All 335 tests should pass (browser UX tests excluded by default). If a test fails after you edit a YAML config, the most likely cause is that you introduced a dimension value that does not correspond to an implemented code path (e.g., encoder_style: "rzz_lattice" does not exist in four_two_two.py).


Part 8: What this system does NOT do (yet)

  • It does not run on real quantum hardware by default. The IBMHardwareExecutor exists and is wired up, but enable_hardware: false in every config. Set it to true and provide credentials via the QISKIT_IBM_TOKEN environment variable to use real devices.

  • It does not do distillation. Rung 5 (Rosenfeld Direction) identifies which heuristics matter for factory-style workflows, but it does not actually build a distillation circuit. That is the next project.

  • It does not use LLMs in the loop. The "auto" is algorithmic (statistical rule extraction + guided search), not generative. There is no GPT/Claude call inside the ratchet loop. The intelligence is in the SearchRule extraction, the CompositeGenerator budget allocation, and the cross-rung propagation logic.

  • CLI output is JSON and Markdown. The CLI ratchet produces JSON files and Markdown lessons. For interactive exploration, use the Plan C dashboard notebook (plan_c/00_dashboard.ipynb), which provides a widget-based interface for running experiments and viewing results.

  • It does not parallelise evaluations. Each experiment runs sequentially. On a machine with multiple cores, you could shard the challenger set across processes, but that is not implemented.


Part 9: Architecture diagram

                          configs/rungs/rung1-5.yaml
                                    |
                                    v
                          +---------+---------+
                          |   AutoresearchHarness   |
                          |   (ratchet/runner.py)    |
                          +---+-----+-----+---+
                              |     |     |
                 +------------+     |     +------------+
                 |                  |                   |
                 v                  v                   v
         CompositeGenerator    LocalCheapExecutor   ResearchStore
        (search/strategies.py) (execution/local.py) (persistence/store.py)
                 |                  |                   |
      +----------+------+          |          +--------+--------+
      |          |      |          |          |        |        |
      v          v      v          v          v        v        v
  Neighbor  Random  Lesson    build_circuit  save_   save_    save_
  Walk      Combo   Guided    _bundle()      exp     step     progress
                      |            |
                      v            v
              LessonFeedback   AerSimulator
             (lessons/          + noise model
              feedback.py)      + postselection
                                + witness
                                + scoring

The data flows in a circle:

  Evaluate --> Score --> Compare --> Learn --> Narrow --> Generate --> Evaluate

That circle is the ratchet step. Each rung runs it multiple times. Each ratchet runs multiple rungs. The lessons tighten the circle with every pass.


This document was last updated on 2026-04-15 to describe the system as built. The code is the ground truth. If this document contradicts the code, the code is correct.