README: rewrite with Quick Start (app.sh), 335-test count, teaching layer narrative, testing/validation section, CI/CD docs, pre-commit hooks. THE_STORY: add Part 4 (teaching layer), Part 5 (app.sh consumer experience), update file map with all 13 test files and teaching/notebook/paper entries. compendium.tex: update notebook count (8→12), add Plan D cross-references. autoresearch_quantum.tex: update test counts (21→335), add app.sh validate. learning_objectives.md: add entry point reference and assessment type glossary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
34 KiB
The Story of autoresearch-quantum
What this system does, in one paragraph
This is a machine that discovers, by itself, the best way to prepare an encoded magic state on the 4,2,2 quantum error-detecting code. You give it a starting recipe and a search space of alternatives. It runs hundreds of simulated quantum experiments, scores them, learns which choices help and which choices hurt, narrows the search, and climbs to the best recipe it can find -- then hands you a written lesson explaining what it learned and why. The entire loop -- propose, evaluate, compare, learn, repeat -- runs without human intervention. That is the "auto" in autoresearch.
Part 1: The quantum computing problem
1.1 What is a magic state?
Fault-tolerant quantum computers need a special ingredient called a magic state to perform the T gate -- the non-Clifford gate that makes quantum computation universal. You cannot create this state using Clifford operations alone, so you prepare a noisy approximation and then distill it into a high-fidelity copy. The preparation step is the bottleneck: if your raw magic states are junk, distillation is expensive or impossible.
1.2 What is the 4,2,2 code?
The 4,2,2 code is the smallest quantum error-detecting code. It uses 4 physical qubits to encode 2 logical qubits. It cannot correct errors, but it can detect them: if an error flips one qubit, the code's stabilizers (XXXX and ZZZZ) flag it, and you can throw the shot away. This postselection raises quality at the cost of throughput.
The code has two logical qubits. We use one to carry the magic state and the other as a spectator -- an untouched qubit whose Z-measurement tells us whether the encoding process corrupted the logical subspace.
1.3 What knobs does this system turn?
An experiment recipe (called an ExperimentSpec) has ~15 tuneable dimensions:
| Dimension | What it controls | Example values |
|---|---|---|
seed_style |
How the raw T-state is prepared on qubit 0 | h_p, ry_rz, u_magic |
encoder_style |
How the 4-qubit encoding circuit is built | cx_chain, cz_compiled |
verification |
Which stabilizers are measured before readout | both, z_only, x_only, none |
postselection |
Which syndrome outcomes cause a shot to be discarded | all_measured, z_only, none |
ancilla_strategy |
Whether verification uses 1 reused or 2 dedicated ancillas | dedicated_pair, reused_single |
optimization_level |
Qiskit transpiler aggressiveness | 1, 2, 3 |
layout_method |
Physical qubit placement algorithm | sabre, dense |
routing_method |
SWAP insertion algorithm | sabre, basic |
target_backend |
Which IBM device topology to compile for | fake_brisbane, fake_kyoto, ... |
shots |
Samples per circuit | 256 -- 4096 |
The question the system answers: Which combination of these choices gives the highest-quality encoded magic states at the lowest cost?
1.4 How is each experiment evaluated?
For each ExperimentSpec, the executor:
-
Builds four circuits (
encoded_magic_state.py):acceptance-- measures all data qubits in the Z basis after verification, to compute the postselection acceptance rate.logical_x-- rotates into the X basis before measurement, to get<X_L>on the magic-carrying logical qubit.logical_y-- rotates into the Y basis, to get<Y_L>.spectator_z-- measures the spectator logical qubit in Z, to get<Z_spectator>.
-
Transpiles them for the target backend's coupling map and basis gates.
-
Simulates them on Qiskit Aer with the backend's calibrated noise model, repeating the configured number of times with independent random seeds.
-
Postselects: for each shot, checks the syndrome register. Shots where the stabiliser flagged an error are discarded. What remains is the postselected ensemble.
-
Computes metrics from the postselected data:
Metric Formula What it measures logical_magic_witness((1 + (X_L + Y_L)/sqrt(2)) / 2) * ((1 + Z_spectator) / 2)Magic-state quality, penalised if spectator is disturbed acceptance_rateaccepted_shots / total_shotsThroughput (what fraction survives postselection) stability_score1 - pstdev(repeat_scores) / mean(repeat_scores)Consistency across independent repeat runs noisy_encoded_fidelityTr(rho_noisy | target><target |)via density matrix simulationHow close the noisy state is to the ideal encoded T-state codespace_rateMean acceptance across all four circuit types Overall codespace survival two_qubit_count,depthFrom the transpiled circuits Cost proxies -
Scores the experiment by combining these metrics into a single scalar:
score = (quality * acceptance_rate) / costwhere
qualityis a weighted sum of the metrics above (weights are per-rung, configured in YAML) andcostaccounts for gate count, depth, shots, and estimated runtime.
Part 2: The autoresearch engine (the meta layer)
This is a direct implementation of the Karpathy autoresearch pattern: an automated loop that does what a diligent PhD student would do -- try things, keep what works, learn why, zoom in, try harder things.
2.1 The ratchet metaphor
A ratchet is a mechanism that only moves forward. In this system:
- The incumbent is the best experiment found so far.
- Each step, the system generates challengers -- modified versions of the incumbent -- evaluates them, and replaces the incumbent only if a challenger beats it by a configured margin.
- The incumbent can only improve. It never regresses.
A rung is a complete search campaign: multiple ratchet steps, with a patience counter that stops the rung early if the incumbent stops improving.
A full ratchet runs multiple rungs in sequence, each one asking a progressively harder question.
2.2 The five rungs
Rung 1: "What preparation recipe works at all?"
|
| winner propagates down
v
Rung 2: "Is it stable across noisy backends?"
|
| winner propagates down, search space narrows
v
Rung 3: "Does it transfer to other devices?"
|
| winner propagates down, search space narrows further
v
Rung 4: "What maximises throughput per cost?"
|
| winner propagates down, only proven dimensions survive
v
Rung 5: "Which heuristics are load-bearing for distillation?"
Each rung is a YAML file (configs/rungs/rung1.yaml through rung5.yaml)
that configures:
- What to search over (the dimension grid)
- How to score (which quality metrics matter most)
- How hard to search (step budget, patience, promotion rules)
- Where to start (bootstrap incumbent)
The key insight: the output of the system is not just the best circuit. It is the best circuit plus a machine-readable set of rules about what worked and why, formatted so the next rung (or the next human) can pick up where the machine left off.
2.3 The search strategies
The original Codex implementation had a single strategy: change one knob at a time and see if the score improves. This is local hill-climbing. It plateaus after one pass through the neighbours.
The new system uses a composite generator that allocates its budget across three strategies:
| Strategy | Weight | What it does |
|---|---|---|
NeighborWalk |
40% | Classic single-axis perturbation. Reliable, no surprises. |
RandomCombo |
30% | Picks 1--3 dimensions at random and mutates them simultaneously. Escapes local optima by making multi-axis jumps. |
LessonGuided |
30% | Reads the SearchRule directives from previous rungs. Fixes dimensions that are proven. Avoids values that are proven bad. Samples preferred values with probability proportional to confidence. |
When no lessons exist yet (rung 1), RandomCombo gets 60% of the budget
to maximise early exploration.
Every generated candidate is checked against a history set of all previously evaluated fingerprints. The system never wastes a slot evaluating a spec it has already seen.
2.4 The lesson feedback loop
After each rung completes, two artefacts are produced:
-
RungLesson (human-readable): a Markdown narrative that says things like "verification=z_only improved mean score by +0.0312 over 8 runs" and "Consider probing remaining ancilla_strategy values."
-
LessonFeedback (machine-readable): a list of
SearchRuleobjects:SearchRule(dimension="verification", action="prefer", value="z_only", confidence=0.67, reason="mean score 0.1823 is +0.0312 above overall mean") SearchRule(dimension="seed_style", action="fix", value="ry_rz", confidence=0.60, reason="all top-3 experiments use seed_style=ry_rz") SearchRule(dimension="verification+postselection", action="prefer", value=("z_only", "z_only"), confidence=0.33, reason="interaction effect +0.0089 (joint=+0.0401, expected_additive=+0.0312)")The rules come from three analyses:
- Per-dimension mean effects: for each value of each dimension, compute the mean score minus the overall mean. Positive = prefer, negative = avoid.
- Fix detection: if the top-K experiments all share a value, and that value outperforms alternatives, emit a "fix" rule.
- Interaction detection: for each pair of dimensions, check whether the joint effect exceeds the sum of the two marginal effects. If so, there is a synergy (or conflict) between those two choices.
These rules feed directly into the LessonGuided strategy in the next rung.
They also feed into narrow_search_space(), which prunes "avoid" values and
constrains "fix" dimensions, physically shrinking the grid the next rung
searches over.
2.5 Cross-rung propagation
When run_ratchet() finishes rung N and begins rung N+1:
-
The winner spec from rung N becomes the bootstrap incumbent for rung N+1. The human-written YAML bootstrap is overridden. (A
propagated_spec.jsonis saved for traceability.) -
The accumulated SearchRules from all completed rungs are combined and used to narrow the search space of rung N+1.
-
The
LessonGuidedstrategy in rung N+1 has access to rules from all previous rungs, not just the most recent one.
This is the "ratchet" in action across rungs: the system starts broad, learns what matters, and zooms in.
2.6 The two scoring functions
| Score function | Used by | Formula | Optimises for |
|---|---|---|---|
weighted_acceptance_cost |
Rungs 1--3 | (quality * acceptance) / cost |
Best magic-state quality at reasonable cost |
factory_throughput |
Rungs 4--5 | (acceptance * witness) / cost (heavier cost penalty) |
Accepted states per unit cost, as a proxy for distillation factory yield |
The factory score also computes FactoryMetrics (accepted per shot, logical
error per accepted, cost per accepted, throughput proxy) and attaches them to
the experiment record for downstream analysis.
2.7 Transfer evaluation
Rung 3 can optionally run in transfer mode: instead of searching over backends as a dimension (which just finds the easiest backend), it evaluates the same spec across multiple backends and scores it by the minimum (pessimistic) score. A spec that scores 0.18 on Brisbane and 0.02 on Kyoto gets a transfer score of 0.02, not 0.10. This prevents backend overfitting.
python -m autoresearch_quantum run-transfer \
--config configs/rungs/rung3.yaml \
--backends fake_brisbane fake_kyoto fake_sherbrooke
2.8 Resumability
Every ratchet step saves a progress.json checkpoint:
{
"rung": 2,
"steps_completed": 2,
"patience_remaining": 1,
"current_incumbent_id": "r2-incumbent-a1b2c3d4e5",
"completed": false
}
If the process crashes or you Ctrl-C, re-running the same rung picks up from the last completed step with the correct patience counter. No work is lost.
Part 3: Claims and how the tests prove them
Claim 1: The encoded state is a valid magic state in the 4,2,2 code.
Test: test_encoded_target_state_satisfies_stabilizers
Constructs the ideal encoded magic statevector and checks that both stabilizers (XXXX and ZZZZ) have expectation value exactly 1.0. If the encoding circuit were wrong, at least one stabilizer would not be +1.
Claim 2: The circuit bundle measures the right observables.
Test: test_circuit_bundle_contains_expected_contexts
Verifies that build_circuit_bundle() produces exactly the four expected
circuits (logical_x, logical_y, spectator_z, acceptance), each with correct
metadata. If a measurement basis rotation were missing or a circuit were
mislabelled, this catches it.
Claim 3: Noisy simulation produces meaningful scores.
Test: test_local_executor_produces_score
Runs a full evaluation (build circuits, transpile, simulate with noise, postselect, compute witness, score) and checks that the score is positive and the acceptance rate and witness are in [0, 1]. This is an integration test of the entire evaluation pipeline -- if any piece is broken, the score collapses.
Claim 4: The challenger generator explores the search space correctly.
Tests: test_neighbor_challengers_mutate_single_dimension,
test_neighbor_walk_respects_history,
test_random_combo_generates_multi_axis_mutations,
test_lesson_guided_uses_rules,
test_composite_generator_combines_strategies
These verify:
- NeighborWalk changes exactly one field per challenger.
- Passing a history set of already-seen fingerprints produces zero duplicates.
- RandomCombo produces at least one challenger with >1 changed field (the defining property of multi-axis mutation).
- LessonGuided respects "fix" rules: when told to fix
seed_style=ry_rz, every generated challenger has that value. - The composite generator stays within the budget cap.
Claim 5: The lesson system extracts correct prefer/avoid/fix rules.
Tests: test_extract_search_rules_prefer_and_avoid,
test_narrow_search_space_removes_avoided,
test_build_lesson_feedback_end_to_end
Given synthetic experiment records where z_only scores 0.80--0.85 and
both scores 0.50--0.55, the extractor must emit a "prefer z_only" and
"avoid both" rule. narrow_search_space must actually remove avoided values
and constrain fixed dimensions.
Claim 6: The factory score function computes throughput metrics.
Tests: test_factory_throughput_score_produces_metrics,
test_score_registry_has_factory
Given known input metrics (acceptance 0.70, witness 0.80), verifies that
factory_throughput_score produces a positive score, attaches
factory_metrics to the extra dict, and that accepted_states_per_shot
equals the input acceptance rate.
Claim 7: Transfer evaluation runs the same spec across backends.
Test: test_transfer_evaluator_runs_across_backends
Runs a transfer evaluation on a single backend (for speed) and checks that a
TransferReport is returned with a positive transfer score and the correct
backend key in per_backend_scores.
Claim 8: Progress and feedback survive serialisation round-trips.
Tests: test_save_and_load_progress,
test_save_and_load_lesson_feedback
Writes a RungProgress / LessonFeedback to disk via the store, reads it
back, and verifies all fields match. If the JSON schema or the
deserialisation logic drifts, this catches it.
Claim 9: A full rung saves progress and produces both lesson types.
Tests: test_run_rung_saves_progress,
test_run_rung_returns_lesson_and_feedback
Runs a complete rung (bootstrap + steps + lesson extraction) and checks that
progress.json exists and is marked completed, and that the return value
includes both a human-readable RungLesson and a machine-readable
LessonFeedback.
Claim 10: Multi-rung ratchet propagates winners and accumulates lessons.
Test: test_run_ratchet_propagates_winner
Runs a two-rung ratchet and checks that:
- Both rungs produce (lesson, feedback) tuples.
harness._accumulated_lessonscontains entries from both rungs, proving that rung 2 had access to rung 1's rules when generating challengers.
Claim 11: Different specs get different simulator seeds.
Test: test_different_specs_get_different_seeds
The old code used seed_simulator = 11_000 + repeat_index, meaning every
spec got the same random stream. The new code hashes the spec's fingerprint
into the seed. This test creates two specs that differ only in verification
and checks that their computed seeds are different.
Part 4: The teaching layer
The system is not only a research engine. It is also a course. Twelve Jupyter notebooks, organised into four independent learning plans, teach the same material through different pedagogical lenses. The teaching layer sits on top of the research engine and uses its real components (circuits, simulators, scorers, ratchet) as the substrate for interactive learning.
4.1 Entry point: 00_START_HERE.ipynb
Every learner begins at notebooks/00_START_HERE.ipynb. This notebook
contains no code --- it is a plan selector. It describes the four plans, their
target audiences, and links directly to each plan's first notebook. All
content notebooks link back to Start Here.
4.2 The four plans
| Plan | Style | Notebooks | Target learner |
|---|---|---|---|
| A | Bottom-up, sequential | 3 | Methodical learners who want foundations first |
| B | Spiral, three passes | 1 (78 cells) | Time-pressed learners who want a demo first, theory later |
| C | Parallel tracks + dashboard | 4 | Learners who want to choose their own path |
| D | Hypothesis-driven experiments | 3 | Research-oriented learners who want to test claims |
All four plans cover the same core concepts: T-state preparation, 4,2,2 encoding, stabiliser verification, postselection, scoring, the ratchet optimiser, lesson extraction, and cross-rung transfer.
4.3 Interactive assessments (teaching/assess.py)
Every content notebook includes interactive assessments built with ipywidgets:
- quiz() --- multiple-choice questions with immediate feedback
- predict_choice() --- "What do you think will happen?" before running code
- reflect() --- open-ended reflections graded by keyword matching
- order() --- drag-and-drop ordering exercises (e.g., rank error types)
Each assessment is tagged with a Bloom's taxonomy level (remember, understand,
apply, analyse, evaluate) and a topic. The full mapping of learning objectives
to assessments is documented in notebooks/learning_objectives.md.
4.4 Progress tracking (teaching/tracker.py)
Each notebook creates a LearningTracker instance that records:
- scores per assessment (correct/incorrect, attempt count)
- Bloom's level distribution (how many of each level attempted/passed)
- time spent per assessment
- checkpoint summaries at natural breakpoints
At the end of each notebook, tracker.dashboard() displays a visual summary,
and tracker.save() persists progress to a JSON file. Progress files can be
reset with bash scripts/app.sh reset.
4.5 Navigation
Every content notebook has a navigation footer with:
- Forward link to the next notebook in the plan
- Back-link to 00_START_HERE.ipynb
- Cross-plan suggestions at terminal notebooks (e.g., "Finished Plan A? Try Plan D for a different perspective.")
4.6 Pedagogical quality enforcement
The test suite includes tests/test_pedagogy.py, which enforces educational
quality invariants across all content notebooks:
- Minimum 200 words of prose per notebook
- At least 25% of cells are markdown (not code-only)
- Every notebook has a title header and multiple sections
- At least 2 interactive assessments per notebook
- At least 2 different assessment types per notebook (variety)
- Bloom's taxonomy coverage: at least 2 levels per notebook
- Checkpoint summaries present when a notebook has 4+ assessments
- LearningTracker initialisation, dashboard(), and save() in every notebook
- Key Insight callouts in longer notebooks (5+ sections)
- All four plans collectively cover core concepts (stabiliser, magic, witness, ratchet)
These tests catch pedagogical regressions the same way unit tests catch code regressions. Adding a new notebook or modifying an existing one will fail CI if it violates these invariants.
Part 5: The consumer experience (app.sh)
The project includes a lifecycle manager (scripts/app.sh) that handles the
entire consumer experience from first clone to running notebooks:
bash scripts/app.sh bootstrap # venv, pip install, kernel registration, import check
bash scripts/app.sh start # launch JupyterLab, open 00_START_HERE.ipynb
bash scripts/app.sh stop # graceful shutdown
bash scripts/app.sh status # venv, server, notebook, progress summary
bash scripts/app.sh validate # ruff + mypy + full test suite
bash scripts/app.sh validate --quick # lint + type check + unit tests only
bash scripts/app.sh logs # tail JupyterLab output
bash scripts/app.sh reset # delete learner progress files
Bootstrap checks Python >= 3.11, creates the venv, installs the package with
dev and notebook dependencies, registers a Jupyter kernel, and verifies that
core imports succeed. Start finds a free port (8888-8899), launches JupyterLab
in the background with PID tracking, and opens the browser directly to
00_START_HERE.ipynb.
Validation runs the full quality pipeline: ruff linting, mypy strict type
checking, and the pytest suite (335 tests, excluding browser UX by default).
The --quick flag runs only lint, type check, and unit tests.
Part 6: The file map
autoresearch-quantum/
configs/rungs/
rung1.yaml Baseline: what recipe works at all?
rung2.yaml Stability: does it hold under noise variation?
rung3.yaml Transfer: does it work on other devices?
rung4.yaml Factory: what maximises throughput per cost?
rung5.yaml Rosenfeld: which heuristics are load-bearing?
src/autoresearch_quantum/
models.py Every data structure in one file
config.py YAML -> RungConfig parser
cli.py Entry point: run-experiment, run-step, run-rung,
run-ratchet, run-transfer
codes/
four_two_two.py The [[4,2,2]] code: stabilizers, logical ops,
encoder circuits, magic seed gates
experiments/
encoded_magic_state.py Builds the four-circuit measurement bundle
execution/
local.py LocalCheapExecutor: Aer noise simulation
hardware.py IBMHardwareExecutor: real-device SamplerV2
transfer.py TransferEvaluator: same spec across N backends
analysis.py Postselection, eigenvalues, witness formula
backends.py Backend resolution (fake_* or IBM runtime)
transpile.py Transpilation, gate counting, runtime estimates
scoring/
score.py weighted_acceptance_cost + factory_throughput
search/
challengers.py GeneratedChallenger, neighbor generation, dedup
strategies.py NeighborWalk, RandomCombo, LessonGuided,
CompositeGenerator
lessons/
extractor.py Human-readable RungLesson + machine LessonFeedback
feedback.py SearchRule extraction, interaction detection,
search space narrowing
ratchet/
runner.py AutoresearchHarness: the orchestrator
persistence/
store.py JSON file store: experiments, steps, progress,
lessons, feedback, propagated specs
teaching/
assess.py Widget-based quizzes, predictions, reflections
tracker.py LearningTracker: per-student progress tracking
notebooks/
00_START_HERE.ipynb Central entry point: plan selector
learning_objectives.md Per-notebook, per-section learning objectives
plan_a/ Bottom-up: 3 sequential notebooks
plan_b/ Spiral: 1 notebook, 3 passes
plan_c/ Parallel tracks + dashboard: 4 notebooks
plan_d/ Hypothesis-driven: 3 experiments
paper/
autoresearch_quantum.tex Technical paper (LaTeX, 19 pages)
compendium.tex Companion textbook (LaTeX, 36 pages)
scripts/
app.sh Consumer lifecycle manager (bootstrap/start/stop/validate)
tests/ 335 tests across 13 files
test_analysis.py Postselection & witness
test_browser_ux.py Playwright end-to-end UX
test_cli.py CLI subcommands
test_codes.py [[4,2,2]] code correctness
test_config.py YAML config loading
test_experiments.py Circuit bundle construction
test_feedback.py Lesson extraction & search rules
test_harness.py Full ratchet integration
test_notebooks.py Notebook execution & structure
test_pedagogy.py Pedagogical quality invariants (130 tests)
test_persistence.py JSON store round-trips
test_scoring.py Score functions
test_teaching.py Assessment widgets & tracker
.github/workflows/ci.yml CI: lint, type check, test matrix, notebook execution
.pre-commit-config.yaml Ruff, mypy, nbstripout, hygiene hooks
data/ Output directory (created at runtime)
default/
rung_1/
experiments/ One JSON per evaluated spec
ratchet_steps/ One JSON per step
incumbent.json Current best
progress.json Resumability checkpoint
lesson.json Machine-readable lesson
lesson.md Human-readable narrative
lesson_feedback.json SearchRules for the next rung
rung_2/
propagated_spec.json Winner carried from rung 1
...
Part 7: How to use it without Claude
You do not need an AI to run this system or to make progress with its output. Everything below runs in your terminal.
7.1 Setup
cd autoresearch-quantum
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
7.2 Run a single experiment
python -m autoresearch_quantum run-experiment \
--config configs/rungs/rung1.yaml \
--set verification=z_only \
--set seed_style=ry_rz
This prints a JSON result with the score, failure mode, and experiment ID.
The full record is saved to data/default/rung_1/experiments/.
7.3 Run one ratchet step
python -m autoresearch_quantum run-step \
--config configs/rungs/rung1.yaml
This bootstraps an incumbent (if none exists), generates challengers, evaluates them, promotes the best, and saves the step record. Run it again and it generates new challengers (never repeating), with a new incumbent if one was found.
7.4 Run a full rung
python -m autoresearch_quantum run-rung \
--config configs/rungs/rung1.yaml
Runs up to step_budget steps (default 3), stopping early if patience runs
out. Produces data/default/rung_1/lesson.md -- read this file. It tells you
what helped, what hurt, what seems invariant, and what to test next.
7.5 Run the full five-rung ratchet
python -m autoresearch_quantum run-ratchet \
--config configs/rungs/rung1.yaml \
--config configs/rungs/rung2.yaml \
--config configs/rungs/rung3.yaml \
--config configs/rungs/rung4.yaml \
--config configs/rungs/rung5.yaml
This is the full pipeline. Each rung's winner is automatically propagated to the next rung. Each rung's lessons narrow the search space for the next. When it finishes, you have five lesson files and a final optimised recipe.
7.6 Run a transfer evaluation
python -m autoresearch_quantum run-transfer \
--config configs/rungs/rung3.yaml \
--backends fake_brisbane fake_kyoto fake_sherbrooke
Tests a single spec across multiple backend noise models. The output tells you the per-backend scores and the pessimistic transfer score.
7.7 Reading the output
After a ratchet run, the most valuable artefacts are:
| File | What to do with it |
|---|---|
rung_N/lesson.md |
Read it. It is a structured report. The "What Helped" section tells you which settings to keep. The "What Hurt" section tells you what to stop trying. |
rung_N/lesson_feedback.json |
This is the machine-readable version. Open it and look at the rules array. Each rule has an action (prefer/avoid/fix), a dimension, a value, a confidence (0--1), and a reason. |
rung_N/incumbent.json |
Contains the experiment_id of the current best spec. Load the corresponding file from experiments/ to see its full spec and scores. |
rung_N/propagated_spec.json |
The spec that was carried forward from the previous rung. Compare it with the YAML bootstrap to see what the system changed. |
rung_N/progress.json |
If the run was interrupted, this tells you where it left off. Just re-run the same command to resume. |
7.8 Making manual progress with the artefacts
The system is designed so that you can interleave human intuition with automated search:
-
Read the lesson. If rung 1 says
verification=z_onlyconsistently helps, you now know something about the physics: X-stabiliser checking adds gate cost without enough quality payoff at this noise level. -
Edit the YAML. Remove values that the lesson says to avoid. Add new values you want to explore. Change the weights if you care more about throughput than fidelity. Save the file and re-run.
-
Run single experiments. If you have a specific hypothesis ("What if
approximation_degree=0.95helps?"), test it directly withrun-experiment --set approximation_degree=0.95. The result is saved to the store and will be included in the next lesson extraction. -
Resume interrupted runs. If your laptop dies mid-rung, just re-run the same command. Progress is checkpointed after every step.
-
Compare across rungs. Open
rung_1/lesson_feedback.jsonandrung_3/lesson_feedback.jsonside by side. Rules that appear in both with high confidence are load-bearing. Rules that appear in rung 1 but vanish by rung 3 were artefacts of the initial noise model. -
Feed results to a new search. Copy the
best_spec_fieldsfromlesson_feedback.jsoninto a new YAML config as the bootstrap incumbent. Define a tighter search space around the winning region. Run another rung. You are now doing what the system does inrun_ratchet-- but with human judgement about what to explore next.
7.9 Running the tests
# Full validation (recommended)
bash scripts/app.sh validate
# Or directly with pytest
python -m pytest tests/ -v
All 335 tests should pass (browser UX tests excluded by default). If a test
fails after you edit a YAML config, the most likely cause is that you
introduced a dimension value that does not correspond to an implemented code
path (e.g., encoder_style: "rzz_lattice" does not exist in
four_two_two.py).
Part 8: What this system does NOT do (yet)
-
It does not run on real quantum hardware by default. The
IBMHardwareExecutorexists and is wired up, butenable_hardware: falsein every config. Set it totrueand provide credentials via theQISKIT_IBM_TOKENenvironment variable to use real devices. -
It does not do distillation. Rung 5 (Rosenfeld Direction) identifies which heuristics matter for factory-style workflows, but it does not actually build a distillation circuit. That is the next project.
-
It does not use LLMs in the loop. The "auto" is algorithmic (statistical rule extraction + guided search), not generative. There is no GPT/Claude call inside the ratchet loop. The intelligence is in the
SearchRuleextraction, theCompositeGeneratorbudget allocation, and the cross-rung propagation logic. -
CLI output is JSON and Markdown. The CLI ratchet produces JSON files and Markdown lessons. For interactive exploration, use the Plan C dashboard notebook (
plan_c/00_dashboard.ipynb), which provides a widget-based interface for running experiments and viewing results. -
It does not parallelise evaluations. Each experiment runs sequentially. On a machine with multiple cores, you could shard the challenger set across processes, but that is not implemented.
Part 9: Architecture diagram
configs/rungs/rung1-5.yaml
|
v
+---------+---------+
| AutoresearchHarness |
| (ratchet/runner.py) |
+---+-----+-----+---+
| | |
+------------+ | +------------+
| | |
v v v
CompositeGenerator LocalCheapExecutor ResearchStore
(search/strategies.py) (execution/local.py) (persistence/store.py)
| | |
+----------+------+ | +--------+--------+
| | | | | | |
v v v v v v v
Neighbor Random Lesson build_circuit save_ save_ save_
Walk Combo Guided _bundle() exp step progress
| |
v v
LessonFeedback AerSimulator
(lessons/ + noise model
feedback.py) + postselection
+ witness
+ scoring
The data flows in a circle:
Evaluate --> Score --> Compare --> Learn --> Narrow --> Generate --> Evaluate
That circle is the ratchet step. Each rung runs it multiple times. Each ratchet runs multiple rungs. The lessons tighten the circle with every pass.
This document was last updated on 2026-04-15 to describe the system as built. The code is the ground truth. If this document contradicts the code, the code is correct.