% ============================================================================ % autoresearch-quantum: Automated Heuristic Optimisation for Encoded % Magic-State Preparation on the [[4,2,2]] Code % ============================================================================ \documentclass[11pt,a4paper]{article} % ── Typography & layout ───────────────────────────────────────────────────── \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{lmodern} \usepackage{microtype} \usepackage[margin=2.4cm]{geometry} \usepackage{parskip} \usepackage{setspace} \onehalfspacing % ── Mathematics ───────────────────────────────────────────────────────────── \usepackage{amsmath,amssymb,amsthm} \usepackage{braket} \usepackage{mathtools} % ── Figures & tables ──────────────────────────────────────────────────────── \usepackage{graphicx} \usepackage{booktabs} \usepackage{array} \usepackage{multirow} \usepackage{float} \usepackage{caption} \captionsetup{font=small,labelfont=bf} % ── Code listings ─────────────────────────────────────────────────────────── \usepackage{listings} \usepackage{xcolor} \usepackage[most]{tcolorbox} \definecolor{codebg}{HTML}{F7F7F7} \definecolor{codeframe}{HTML}{CCCCCC} \definecolor{keywordblue}{HTML}{0550AE} \definecolor{stringgreen}{HTML}{0A6640} \definecolor{commentgray}{HTML}{6A737D} \definecolor{accentred}{HTML}{CF222E} \lstdefinestyle{python}{ language=Python, basicstyle=\small\ttfamily, keywordstyle=\color{keywordblue}\bfseries, stringstyle=\color{stringgreen}, commentstyle=\color{commentgray}\itshape, backgroundcolor=\color{codebg}, frame=single, rulecolor=\color{codeframe}, breaklines=true, breakatwhitespace=false, showstringspaces=false, tabsize=4, xleftmargin=1em, xrightmargin=1em, aboveskip=1em, belowskip=1em, } \lstdefinestyle{yaml}{ basicstyle=\small\ttfamily, backgroundcolor=\color{codebg}, frame=single, rulecolor=\color{codeframe}, breaklines=true, showstringspaces=false, xleftmargin=1em, xrightmargin=1em, aboveskip=1em, belowskip=1em, } \lstdefinestyle{shell}{ basicstyle=\small\ttfamily, backgroundcolor=\color{codebg}, frame=single, rulecolor=\color{codeframe}, breaklines=true, showstringspaces=false, xleftmargin=1em, xrightmargin=1em, aboveskip=1em, belowskip=1em, } % ── Explanation boxes ─────────────────────────────────────────────────────── \newtcolorbox{intuition}{ colback=blue!3, colframe=blue!40!black, fonttitle=\bfseries, title=Intuition, boxrule=0.5pt, arc=2pt, left=6pt,right=6pt,top=4pt,bottom=4pt, } \newtcolorbox{keyconcept}{ colback=orange!4, colframe=orange!60!black, fonttitle=\bfseries, title=Key Concept, boxrule=0.5pt, arc=2pt, left=6pt,right=6pt,top=4pt,bottom=4pt, } \newtcolorbox{warning}{ colback=red!3, colframe=red!50!black, fonttitle=\bfseries, title=Subtlety, boxrule=0.5pt, arc=2pt, left=6pt,right=6pt,top=4pt,bottom=4pt, } % ── Algorithms ────────────────────────────────────────────────────────────── \usepackage[ruled,vlined,linesnumbered]{algorithm2e} % ── Cross-references & links ─────────────────────────────────────────────── \usepackage[colorlinks=true,linkcolor=blue!60!black,citecolor=blue!60!black,urlcolor=blue!60!black]{hyperref} \usepackage{cleveref} % ── Theorem environments ─────────────────────────────────────────────────── \newtheorem{definition}{Definition}[section] \newtheorem{proposition}{Proposition}[section] \newtheorem{claim}{Claim}[section] % ── Shorthand ────────────────────────────────────────────────────────────── \newcommand{\code}[1]{\texttt{#1}} \newcommand{\file}[1]{\texttt{\small #1}} \newcommand{\CC}{\mathcal{C}} \newcommand{\HH}{\mathcal{H}} \newcommand{\R}{\mathbb{R}} \newcommand{\expect}[1]{\langle #1 \rangle} \hypersetup{pdftitle={Automated Heuristic Optimisation for Encoded Magic-State Preparation on the [[4,2,2]] Code}} \title{% \textbf{Automated Heuristic Optimisation for Encoded Magic-State\\ Preparation on the $\boldsymbol{[\![4,2,2]\!]}$ Code}\\[0.8em] \large A Karpathy-Style Autoresearch Ratchet\\ for Quantum Error Detection Experiments } \author{% System design and implementation by Claude (Anthropic)\\ Research direction and project ownership by the author\\[0.3em] {\small \url{https://github.com/saymrwulf/autoresearch-quantum}} } \date{April 2026} \begin{document} \maketitle \begin{abstract} We present \textsc{autoresearch-quantum}, an automated experimental optimisation system for encoded magic-state preparation on the $[\![4,2,2]\!]$ quantum error-detecting code. The system implements a five-rung progressive search ratchet---inspired by Karpathy's autoresearch pattern---that discovers high-quality preparation recipes through simulated noisy evaluation, statistical lesson extraction, and cross-rung knowledge propagation. Starting from a combinatorial space of ${\sim}10^4$ possible experiment configurations (seed gates, encoder circuits, verification protocols, transpilation strategies), the ratchet autonomously narrows to a near-optimal recipe without human intervention between rungs. We describe the quantum-mechanical observables being optimised, the search and scoring machinery, the machine-readable feedback loop that closes the ``auto'' in autoresearch, and a complete test suite that anchors each architectural claim to an executable proof. The system runs entirely on classical simulation via Qiskit Aer and is designed for eventual promotion to IBM Quantum hardware. \end{abstract} \tableofcontents \newpage % ============================================================================ \section{Introduction and Motivation} \label{sec:intro} % ============================================================================ Fault-tolerant quantum computation requires non-Clifford gates. The standard approach is to prepare \emph{magic states}---resource states that, when consumed by gate teleportation, implement a $T$ gate ($\pi/8$ rotation) on a logical qubit \cite{bravyi2005}. The quality of these raw magic states directly governs the cost of downstream distillation: higher input fidelity means fewer distillation rounds, which means fewer physical qubits and less wall-clock time. Preparing an encoded magic state is not a single circuit---it is a family of choices: how to seed the initial rotation, how to encode into the code's logical subspace, whether to verify stabilisers, how aggressively to postselect, how to transpile for a specific device topology. Each choice interacts with the others in ways that are difficult to predict analytically, especially under realistic device noise. \begin{keyconcept} The core question is combinatorial, not algebraic: \emph{which combination of preparation, verification, and transpilation choices yields the highest-quality encoded magic states at the lowest cost, under a given noise model?} This is an experimental optimisation problem, and it is the kind of problem that machines are better at than humans---provided you give the machine a well-defined objective and a mechanism to learn from its own experiments. \end{keyconcept} \textsc{autoresearch-quantum} is that mechanism. It is a direct implementation of the \emph{autoresearch pattern} described by Karpathy~\cite{karpathy2025}: an automated loop that proposes experiments, evaluates them, extracts lessons, and uses those lessons to propose better experiments. The system produces two outputs: an optimised experiment recipe, and a structured explanation of why that recipe works. % ============================================================================ \section{The Quantum Problem} \label{sec:quantum} % ============================================================================ % ── 2.1 The [[4,2,2]] code ───────────────────────────────────────────────── \subsection{\texorpdfstring{The $[\![4,2,2]\!]$ Error-Detecting Code}{The [[4,2,2]] Error-Detecting Code}} \label{sec:code} The $[\![4,2,2]\!]$ code is the smallest quantum error-detecting code. It encodes $k = 2$ logical qubits into $n = 4$ physical qubits and has distance $d = 2$: it can detect any single-qubit error, but cannot correct it. Its stabiliser group is generated by two operators: \begin{equation} S_X = X^{\otimes 4} = XXXX, \qquad S_Z = Z^{\otimes 4} = ZZZZ. \label{eq:stabilizers} \end{equation} Any state $\ket{\psi}$ in the code space satisfies $S_X\ket{\psi} = S_Z\ket{\psi} = +\ket{\psi}$. A single-qubit error $E$ anticommutes with at least one stabiliser, so measuring the stabilisers reveals whether an error occurred without collapsing the logical information. \begin{intuition} Think of the stabilisers as parity checks. $S_Z = ZZZZ$ checks whether the total $Z$-parity of all four qubits is even. If a bit-flip error $X_i$ hits one qubit, the parity flips from even to odd, and you detect it. $S_X = XXXX$ does the same in the $X$ basis, catching phase-flip errors. Together, they detect any single error of any type---but they cannot tell you \emph{which} qubit was affected, so correction is impossible. The response is to \emph{discard} the shot (postselection). \end{intuition} The code has two logical qubits with the following representative logical operators: \begin{equation} \bar{X}_1 = X_1 X_3, \quad \bar{Z}_1 = Z_1 Z_2, \quad \bar{X}_2 = X_2 X_3, \quad \bar{Z}_2 = Z_2 Z_3. \label{eq:logicals} \end{equation} We use logical qubit~1 to carry the magic state and logical qubit~2 as a \emph{spectator}: an idle qubit whose $\bar{Z}_2$ measurement monitors whether the encoding process preserved the code space. % ── 2.2 Magic states ─────────────────────────────────────────────────────── \subsection{\texorpdfstring{Magic States and the $T$ Gate}{Magic States and the T Gate}} \label{sec:magic} The $T$ gate applies the rotation $T = \mathrm{diag}(1, e^{i\pi/4})$ to a single qubit. It cannot be implemented transversally on most stabiliser codes, so the standard workaround is \emph{gate teleportation}: consume a pre-prepared resource state to apply $T$ to a data qubit via Clifford operations and measurement. The canonical magic state for the $T$ gate is: \begin{equation} \ket{T} = \frac{1}{\sqrt{2}}\bigl(\ket{0} + e^{i\pi/4}\ket{1}\bigr) = T H \ket{0}. \label{eq:magic_state} \end{equation} Our system prepares an \emph{encoded} version: the magic state is placed on logical qubit~1 of the $[\![4,2,2]\!]$ code, so that error detection protects it during storage and transport. \begin{warning} The $[\![4,2,2]\!]$ code is too small for real fault tolerance---its distance of~2 means a two-qubit error goes undetected. We use it because it is the \emph{simplest non-trivial testbed} for the engineering question of how to prepare, verify, and evaluate an encoded magic state. The heuristics discovered here---which seed gates help, whether $X$-stabiliser checking is worth the cost, how transpilation interacts with noise---are intended to transfer to larger codes. \end{warning} % ── 2.3 Preparation circuits ─────────────────────────────────────────────── \subsection{Preparation Circuit Architecture} \label{sec:prep} The preparation circuit has two stages: \paragraph{Stage 1: Seed gate.} A single-qubit rotation on physical qubit~0 creates the raw magic state before encoding. The system explores three decompositions, all mathematically equivalent on an ideal device but differing in their interaction with noise: \begin{center} \begin{tabular}{lll} \toprule \textbf{Name} & \textbf{Gate sequence} & \textbf{Rationale} \\ \midrule \code{h\_p} & $H$ then $P(\pi/4)$ & Native Qiskit decomposition \\ \code{ry\_rz} & $R_y(\pi/2)$ then $R_z(\pi/4)$ & Avoids the $H$ gate entirely \\ \code{u\_magic} & $U(\pi/2, 0, \pi/4)$ & Single physical gate; minimal depth \\ \bottomrule \end{tabular} \end{center} All three produce the same ideal state $\ket{T}$. Under noise, they differ because each gate decomposition couples differently to coherent and incoherent error channels of the physical device. \paragraph{Stage 2: Encoder.} An entangling circuit maps the single-qubit state on qubit~0 into the four-qubit code space. Two implementations exist: \begin{itemize} \item \code{cx\_chain}: A sequence of CNOT gates that propagates the logical information: \[ \text{CNOT}_{0 \to 2}, \quad \text{CNOT}_{1 \to 0}, \quad H_3, \quad \text{CNOT}_{3 \to 0}, \quad \text{CNOT}_{3 \to 1}, \quad \text{CNOT}_{3 \to 2}. \] \item \code{cz\_compiled}: The same logical operation decomposed into CZ and Hadamard gates, which may be more natural for devices with a native CZ interaction. \end{itemize} % ── 2.4 Verification and postselection ───────────────────────────────────── \subsection{Verification and Postselection} \label{sec:verification} After encoding, the system optionally measures one or both stabilisers using ancilla qubits. This is \emph{verification}: a check that the state is in the code space before proceeding to the logical measurement. \begin{definition}[Postselection] \label{def:postselection} Given $N$ total shots, let $A \subseteq \{1, \ldots, N\}$ be the set of shots where all measured syndromes are $+1$ (no error detected). The \emph{postselected ensemble} consists of only the shots in $A$. The \emph{acceptance rate} is $|A|/N$. \end{definition} The system supports four postselection modes: \begin{center} \begin{tabular}{lp{8cm}} \toprule \textbf{Mode} & \textbf{Behaviour} \\ \midrule \code{all\_measured} & Discard a shot if \emph{any} measured syndrome is $-1$ \\ \code{z\_only} & Discard only if the $S_Z$ syndrome is $-1$ \\ \code{x\_only} & Discard only if the $S_X$ syndrome is $-1$ \\ \code{none} & Keep all shots regardless of syndrome \\ \bottomrule \end{tabular} \end{center} \begin{intuition} Postselection is a quality--throughput trade-off. Aggressive postselection ($\code{all\_measured}$) gives higher-quality surviving shots but discards more data, lowering the acceptance rate. The optimiser's job is to find the mode that maximises the product of quality and throughput under the prevailing noise. \end{intuition} Two ancilla strategies exist: \begin{itemize} \item \code{dedicated\_pair}: Each stabiliser gets its own ancilla qubit (two ancillas total for \code{both} verification). More gates, but no risk of mid-circuit reset errors. \item \code{reused\_single}: One ancilla, measured and reset between stabiliser checks. Fewer qubits, but the reset introduces a potential error source. \end{itemize} % ── 2.5 Measurement protocol ────────────────────────────────────────────── \subsection{Measurement Protocol and Observables} \label{sec:observables} After preparation and optional verification, the system constructs four circuits that measure different observables on the same encoded state: \begin{enumerate} \item \textbf{Acceptance circuit.} Measures all data qubits in the $Z$ basis. No basis rotation before measurement. Used to compute the acceptance rate. \item \textbf{Logical $X$ circuit.} Applies $H$ to qubits $\{0, 2\}$ (the support of $\bar{X}_1$) before measurement, effectively measuring in the $X$ basis on those qubits. The expectation value $\expect{\bar{X}_1}$ is computed from the parity of the postselected outcomes on the relevant qubits. \item \textbf{Logical $Y$ circuit.} Applies $S^\dagger$ then $H$ to qubit~0 ($Y$-basis), $Z$-basis on qubit~1, and $H$ on qubit~2 ($X$-basis). This measures the three-body operator $Y_0 Z_1 X_2$, which is the logical $Y$ on qubit~1 of the $[\![4,2,2]\!]$ code. The expectation $\expect{\bar{Y}_1}$ is extracted from the postselected parity. \item \textbf{Spectator $Z$ circuit.} Applies $H$ to neither qubit~1 nor qubit~2 (they are already in the $Z$ basis). Measures the two-body operator $Z_1 Z_2 = \bar{Z}_2$, the logical $Z$ of the spectator qubit. \end{enumerate} \begin{keyconcept} A perfect encoded magic state satisfies $\expect{\bar{X}_1} = \expect{\bar{Y}_1} = 1/\sqrt{2}$ and $\expect{\bar{Z}_2} = +1$. Any deviation indicates either a preparation error (the seed or encoder is wrong), a noise-induced error (the device corrupted the state), or a code-space leakage (the spectator was disturbed). \end{keyconcept} % ── 2.6 The witness function ────────────────────────────────────────────── \subsection{The Magic-State Witness} \label{sec:witness} We combine the three observables into a single quality metric called the \emph{logical magic witness}: \begin{equation} W = \underbrace{% \frac{1 + \frac{\expect{\bar{X}_1} + \expect{\bar{Y}_1}}{\sqrt{2}}}{2} }_{\text{magic quality}} \;\cdot\; \underbrace{% \frac{1 + \expect{\bar{Z}_2}}{2} }_{\text{spectator alignment}}. \label{eq:witness} \end{equation} The first factor measures how close the logical qubit is to the ideal magic state. For $\ket{T}$, the Bloch vector has equal $X$ and $Y$ components of magnitude $1/\sqrt{2}$, so the numerator reaches its maximum of $1 + 1 = 2$ and the factor equals~1. The second factor penalises any disturbance of the spectator qubit. An ideal encoding leaves the spectator in $\ket{+}_L$, giving $\expect{\bar{Z}_2} = +1$ and a factor of~1. If the encoding process or noise leaks information into the spectator, $\expect{\bar{Z}_2}$ drops, and so does $W$. The witness is bounded: $W \in [0, 1]$, with $W = 1$ for the ideal encoded magic state. \begin{warning} $W$ is not a fidelity. It is a \emph{witness} in the sense that $W < 1$ proves the state is not ideal, but $W = 1$ does not uniquely identify the state (other states could also achieve $W = 1$ if the measured observables happen to align). For the purpose of heuristic optimisation, this is sufficient: we are maximising $W$ as a proxy for magic-state quality, not proving fault tolerance. \end{warning} % ============================================================================ \section{The Scoring Functions} \label{sec:scoring} % ============================================================================ Raw metrics (witness, acceptance rate, gate count, depth) must be collapsed into a single scalar for the ratchet to compare experiments. The system provides two scoring functions, used at different stages of the search. \subsection{Weighted Acceptance-Cost Score (Rungs 1--3)} \label{sec:wac} \begin{equation} \text{score}_{\text{WAC}} = \frac{Q \cdot r_{\text{accept}}}{\;C\;} \label{eq:wac} \end{equation} where: \begin{align} Q &= \frac{\sum_{i} w_i \cdot q_i}{\sum_{i} w_i}, \quad q_i \in \{F_{\text{ideal}},\; F_{\text{noisy}},\; W,\; r_{\text{code}},\; S,\; A_{\text{spec}}\}, \label{eq:quality} \\[4pt] C &= C_0 + \alpha_{\text{2q}} \cdot n_{\text{2q}} + \alpha_d \cdot d + \alpha_s \cdot n_s + \alpha_r \cdot t_r + \alpha_q \cdot c_q. \label{eq:cost} \end{align} \begin{center} \begin{tabular}{clp{7.5cm}} \toprule \textbf{Symbol} & \textbf{Name} & \textbf{Definition} \\ \midrule $F_{\text{ideal}}$ & Ideal fidelity & $\braket{T_{\text{enc}} | \rho_{\text{ideal}} | T_{\text{enc}}}$, from statevector sim \\ $F_{\text{noisy}}$ & Noisy fidelity & $\braket{T_{\text{enc}} | \rho_{\text{noisy}} | T_{\text{enc}}}$, from density-matrix sim with noise model \\ $W$ & Witness & Eq.~\eqref{eq:witness} \\ $r_{\text{code}}$ & Codespace rate & Mean acceptance across all four circuits \\ $S$ & Stability score & $1 - \sigma(\{s_j\}) / \bar{s}$, where $s_j$ is the score of repeat $j$ \\ $A_{\text{spec}}$ & Spectator alignment & $(1 + \expect{\bar{Z}_2})/2$ \\ $r_{\text{accept}}$ & Acceptance rate & Fraction of shots surviving postselection \\ $n_{\text{2q}}$ & Two-qubit gates & Sum across all transpiled circuits \\ $d$ & Circuit depth & Maximum across all transpiled circuits \\ $n_s$ & Shot count & $\text{shots} \times \text{repeats} \times |\text{circuits}|$ \\ $t_r$ & Runtime estimate & Estimated wall-clock time \\ $c_q$ & Queue cost proxy & 0 for simulation; 1 for hardware \\ \bottomrule \end{tabular} \end{center} The weights $w_i$ and cost coefficients $\alpha_*$ are per-rung hyperparameters defined in the YAML configuration. For example, Rung~1 puts 40\% weight on noisy fidelity and 25\% on the witness, while Rung~2 shifts 20\% weight to stability. \begin{intuition} The score has a clear physical interpretation: it is \emph{useful quality per unit cost}. The numerator $Q \cdot r_{\text{accept}}$ measures how much high-quality output you get. The denominator $C$ measures what you pay in gates, depth, and shots. A circuit that achieves $W = 0.9$ but requires 80 two-qubit gates is penalised relative to one that achieves $W = 0.85$ with only 30 gates. \end{intuition} \subsection{Factory Throughput Score (Rungs 4--5)} \label{sec:factory} For the later rungs, the question shifts from ``what is the best state?'' to ``what gives the most accepted states per unit cost?''---a factory-style metric: \begin{equation} \text{score}_{\text{factory}} = \frac{r_{\text{accept}} \cdot W}{\;C'\;} \label{eq:factory} \end{equation} where $C'$ uses a 1.5$\times$ heavier penalty on two-qubit gate count and circuit depth than $C$. This is not a theoretical derivation; it is an engineering choice that biases the optimiser toward cheaper circuits when the quality is already good enough. In addition to the scalar score, the factory scorer computes and stores auxiliary metrics: \begin{equation} \text{throughput} = \frac{r_{\text{accept}}}{C'}, \qquad \text{logical error} = 1 - W, \qquad \text{cost per accepted} = \frac{C'}{r_{\text{accept}}}. \label{eq:factory_metrics} \end{equation} These are attached to the experiment record for downstream analysis. % ============================================================================ \section{The Autoresearch Engine} \label{sec:engine} % ============================================================================ \subsection{The Ratchet Mechanism} \label{sec:ratchet} The ratchet is a monotone optimisation procedure: the incumbent (best-known experiment) can only improve or stay the same, never regress. \begin{algorithm}[H] \SetAlgoLined \SetKwInOut{Input}{Input} \SetKwInOut{Output}{Output} \Input{Rung config $\mathcal{R}$, step budget $B$, patience $P$} \Output{Steps $\mathcal{S}$, lesson $\ell$, feedback $\phi$} $x^* \leftarrow \textsc{EnsureIncumbent}(\mathcal{R})$\; $p \leftarrow P$\; \For{$t \leftarrow 1$ \KwTo $B$}{ $\mathcal{G} \leftarrow \textsc{CompositeGenerator}(x^*, \text{dims}(\mathcal{R}), H, \Phi)$\; \ForEach{challenger $c \in \mathcal{G}$}{ Evaluate $c$ on cheap tier\; $H \leftarrow H \cup \{\text{fingerprint}(c)\}$\; } $\hat{c} \leftarrow \arg\max_{c \in \mathcal{G}} \text{score}(c)$\; \uIf{$\text{score}(\hat{c}) > \text{score}(x^*) + \epsilon$}{ $x^* \leftarrow \hat{c}$\; $p \leftarrow P$ \tcp*{Reset patience} }\Else{ $p \leftarrow p - 1$\; } Save progress checkpoint\; \lIf{$p \leq 0$}{\textbf{break} \tcp*{Early stopping}} } $(\ell, \phi) \leftarrow \textsc{ExtractLesson}(\text{all experiments in } \mathcal{R})$\; $\Phi \leftarrow \Phi \cup \{\phi\}$\; \Return{steps, $\ell$, $\phi$}\; \caption{Run-Rung: a complete search campaign} \label{alg:rung} \end{algorithm} Key design choices: \begin{itemize} \item \textbf{History deduplication} ($H$): the set of all fingerprints ever evaluated, across all steps of the current rung. No experiment is run twice. \item \textbf{Patience} ($p$): if the incumbent survives $P$ consecutive steps without being dethroned, the rung terminates early. This prevents wasting budget when the search has converged. \item \textbf{Promotion threshold} ($\epsilon$): a challenger must beat the incumbent by at least $\epsilon$ (the \code{cheap\_margin}) to be considered for promotion. This prevents churn from noise-band fluctuations. \end{itemize} % ── 4.2 Multi-rung propagation ───────────────────────────────────────────── \subsection{Multi-Rung Propagation} \label{sec:propagation} When the ratchet runs multiple rungs in sequence, it propagates knowledge forward: \begin{algorithm}[H] \SetAlgoLined \SetKwInOut{Input}{Input} \Input{Rung configs $[\mathcal{R}_1, \ldots, \mathcal{R}_K]$} $\Phi \leftarrow \emptyset$ \tcp*{Accumulated lesson feedback} \For{$i \leftarrow 1$ \KwTo $K$}{ \If{$i > 1$}{ Override $\mathcal{R}_i.\text{bootstrap}$ with winner spec from $\phi_{i-1}$\; $\mathcal{R}_i.\text{search\_space} \leftarrow \textsc{Narrow}(\mathcal{R}_i.\text{search\_space}, \Phi)$\; } $(\_,\; \ell_i,\; \phi_i) \leftarrow \textsc{RunRung}(\mathcal{R}_i, \Phi)$\; $\Phi \leftarrow \Phi \cup \{\phi_i\}$\; } \caption{Run-Ratchet: progressive multi-rung search} \label{alg:ratchet} \end{algorithm} Two things happen between rungs: \begin{enumerate} \item \textbf{Winner propagation.} The best experiment spec from rung $i$ becomes the starting point (bootstrap incumbent) for rung $i{+}1$. The human-written YAML bootstrap is overridden. A \file{propagated\_spec.json} is saved for traceability. \item \textbf{Search space narrowing.} The accumulated \code{SearchRule} objects from all completed rungs are combined and used to prune the dimension grid: \begin{itemize} \item Dimensions with a \code{fix} rule are constrained to a single value. \item Values with an \code{avoid} rule are removed, unless doing so would reduce a dimension below 2 values. \end{itemize} \end{enumerate} \begin{intuition} Rung~1 searches broadly: 3 seed styles $\times$ 2 encoder styles $\times$ 3 verification modes $\times$ 3 postselection modes $\times$ \ldots{} $\approx 10^3$ combinations. By Rung~5, lessons have fixed the seed style, fixed the encoder, and eliminated half the verification modes. The search space might be down to ${\sim}20$ combinations. The later rungs are not wasting budget re-exploring dead ends. \end{intuition} % ── 4.3 Search strategies ───────────────────────────────────────────────── \subsection{Challenger Generation Strategies} \label{sec:strategies} The system uses a composite generator that allocates its per-step budget across three strategies: \subsubsection{NeighborWalk (40\% budget)} The simplest strategy: for each dimension in the search space, try every alternative value while holding all other dimensions fixed. This is a single-axis perturbation, equivalent to one step of coordinate descent. It is thorough within a Hamming distance of~1 from the incumbent but cannot escape local optima. \subsubsection{RandomCombo (30\% budget, or 60\% when no lessons exist)} Picks 1--3 dimensions uniformly at random and samples a new value for each. This produces multi-axis mutations that can jump across the landscape: \begin{lstlisting}[style=python] n_dims = random.randint(1, min(3, len(dim_names))) chosen_dims = random.sample(dim_names, n_dims) for dim in chosen_dims: alternatives = [v for v in values if v != current] updates[dim] = random.choice(alternatives) \end{lstlisting} The key property: unlike NeighborWalk, this can change \code{verification} and \code{postselection} simultaneously, which matters because these two dimensions interact (see \S\ref{sec:interactions}). \subsubsection{LessonGuided (30\% budget, only when lessons exist)} Reads the \code{SearchRule} list from accumulated lesson feedback and constructs candidates that respect the rules: \begin{enumerate} \item Apply all \code{fix} rules unconditionally (e.g., always set \code{seed\_style=ry\_rz}). \item For dimensions with \code{prefer} rules, sample from preferred values with probability proportional to the rule's confidence score. \item For dimensions with \code{avoid} rules, exclude those values from sampling. \item With 50\% probability, also explore non-preferred, non-avoided values to prevent premature convergence. \end{enumerate} All three strategies check every generated candidate against the global history set and discard duplicates before returning. % ============================================================================ \section{The Lesson Feedback System} \label{sec:lessons} % ============================================================================ The feedback system is the component that makes the ratchet \emph{intelligent} rather than merely \emph{persistent}. It transforms raw experiment data into actionable directives. \subsection{Rule Extraction} \label{sec:rules} Given $N$ experiment records from a completed rung, the system extracts three types of rules: \paragraph{Per-dimension mean effects.} For each dimension $d$ and each value $v$ in the search space, compute: \begin{equation} \delta_{d,v} = \bar{s}_{d=v} - \bar{s}_{\text{overall}} \label{eq:delta} \end{equation} where $\bar{s}_{d=v}$ is the mean score of all experiments that used value $v$ for dimension $d$. If $\delta > \tau$ (default threshold $\tau = 0.005$), emit a \code{prefer} rule. If $\delta < -\tau$, emit an \code{avoid} rule. The confidence is $\min(1, n_{d=v} / N)$, where $n_{d=v}$ is the number of experiments with that value. \paragraph{Fix detection.} Let $\mathcal{T}_K$ be the top-$K$ experiments by score, where $K = \min(5, \max(3, \lfloor N/3 \rfloor))$. For each dimension, if all experiments in $\mathcal{T}_K$ share the same value $v^*$, and $\bar{s}_{d=v^*} > \bar{s}_{d \neq v^*}$, emit a \code{fix} rule with confidence $K/N$. \paragraph{Interaction detection.} \label{sec:interactions} For each pair of dimensions $(d_a, d_b)$, compute the joint effect and compare it to the sum of marginals: \begin{equation} I_{v_a, v_b} = \underbrace{(\bar{s}_{d_a=v_a, d_b=v_b} - \bar{s}_{\text{overall}})}_{\text{joint effect}} - \underbrace{(\delta_{d_a, v_a} + \delta_{d_b, v_b})}_{\text{sum of marginals}}. \label{eq:interaction} \end{equation} If $|I| > 2\tau$, the pair exhibits a synergy ($I > 0$) or conflict ($I < 0$) that the marginal analysis would miss. A rule is emitted for the composite dimension $d_a{+}d_b$. \begin{keyconcept} Interaction detection matters because the $[\![4,2,2]\!]$ experiment has genuine dimension couplings. For example, \code{verification=z\_only} combined with \code{postselection=z\_only} may outperform the additive prediction because $Z$-only postselection is exactly matched to $Z$-only verification: no shots are wasted on $X$-syndrome failures that cannot be postselected against. \end{keyconcept} \subsection{Search Space Narrowing} \label{sec:narrowing} The \code{narrow\_search\_space()} function applies the extracted rules to physically shrink the dimension grid: \begin{lstlisting}[style=python] for dim, values in search_space.dimensions.items(): if dim in fix_map and fix_map[dim] in values: new_dims[dim] = [fix_map[dim]] # Lock to one value elif dim in avoid_map: filtered = [v for v in values if v not in avoid_map[dim]] if len(filtered) >= min_values_per_dim: new_dims[dim] = filtered # Remove bad values else: new_dims[dim] = list(values) # Safety: keep all \end{lstlisting} The safety clause prevents over-pruning: if avoiding too many values would leave a dimension with fewer than 2 options, the original set is kept. Confidence thresholds ($\geq 0.3$ for avoid, $\geq 0.4$ for fix) provide additional conservatism. \subsection{Dual Output Format} \label{sec:dual} Every rung produces two artefacts: \begin{enumerate} \item \textbf{RungLesson} (human-readable Markdown): sections for ``What Helped'', ``What Hurt'', ``Invariants'', ``Hardware-Specific Effects'', ``Next Tests'', ``Promote Forward'', and ``Discard''. Saved as \file{lesson.md}. \item \textbf{LessonFeedback} (machine-readable JSON): the list of \code{SearchRule} objects, the narrowed dimension grid, the best spec's field values, and optional transfer scores. Saved as \file{lesson\_feedback.json} and consumed by the \code{LessonGuided} strategy and the propagation logic. \end{enumerate} % ============================================================================ \section{Transfer Evaluation} \label{sec:transfer} % ============================================================================ Rung~3 asks: does this recipe generalise, or is it overfitting to one backend's noise profile? The \code{TransferEvaluator} answers this by running the \emph{same} spec on multiple backend noise models and scoring pessimistically: \begin{equation} s_{\text{transfer}} = \min_{b \in \mathcal{B}} \; s_b(\text{spec}) \label{eq:transfer} \end{equation} where $\mathcal{B}$ is the set of backends (e.g., \code{fake\_brisbane}, \code{fake\_kyoto}, \code{fake\_sherbrooke}) and $s_b$ is the WAC score under backend $b$'s noise model. \begin{lstlisting}[style=shell] python -m autoresearch_quantum run-transfer \ --config configs/rungs/rung3.yaml \ --backends fake_brisbane fake_kyoto fake_sherbrooke \end{lstlisting} \begin{warning} This is different from searching over \code{target\_backend} as a dimension (which the original Codex implementation did). Searching over backends finds the \emph{easiest} backend. Transfer evaluation finds specs that work \emph{everywhere}---a much harder and more useful criterion. \end{warning} % ============================================================================ \section{The Five Rungs in Detail} \label{sec:rungs} % ============================================================================ \begin{table}[H] \centering \small \begin{tabular}{@{}clp{4.8cm}lll@{}} \toprule \textbf{Rung} & \textbf{Name} & \textbf{Question} & \textbf{Score} & \textbf{Steps} & \textbf{Key shift} \\ \midrule 1 & Baseline & What recipe works at all? & WAC & 3 & Broad search \\ 2 & Stability & Is it stable under noise variation? & WAC & 3 & $+20\%$ stability weight \\ 3 & Transfer & Does it work on other devices? & WAC & 3 & Pessimistic cross-backend \\ 4 & Factory & What maximises throughput/cost? & Factory & 3 & $1.5\times$ cost penalty \\ 5 & Rosenfeld & Which heuristics are load-bearing? & Factory & 4 & Narrowest search space \\ \bottomrule \end{tabular} \caption{The five rungs and their progressive focus.} \label{tab:rungs} \end{table} Each rung YAML configures different quality weights. Rung~1 puts 40\% weight on noisy fidelity (because early on, we want to find specs that even \emph{approach} the ideal under noise). By Rung~2, 20\% shifts to stability. By Rung~5, the factory throughput score replaces WAC entirely, and stability carries 25\% weight. The bootstrap incumbent for Rung~1 is human-chosen. For Rungs 2--5, it is automatically overridden by the winner from the previous rung via propagation (\S\ref{sec:propagation}). % ============================================================================ \section{Resumability and Persistence} \label{sec:persistence} % ============================================================================ All state is persisted as JSON files under a store directory: \begin{lstlisting}[style=shell] data/default/ rung_1/ experiments/ # One JSON per evaluated spec r1-incumbent-a3f2b8c901.json r1-challenger-7e9d1f0ab3.json ... ratchet_steps/ # One JSON per step step_0001.json incumbent.json # {"experiment_id": "r1-incumbent-..."} progress.json # Resumability checkpoint lesson.json # Machine-readable lesson lesson.md # Human-readable narrative lesson_feedback.json # SearchRules for next rung rung_2/ propagated_spec.json # Winner carried from rung 1 ... \end{lstlisting} The \code{progress.json} file records the exact state needed to resume: \begin{lstlisting}[style=python] @dataclass class RungProgress: rung: int steps_completed: int patience_remaining: int current_incumbent_id: str completed: bool = False \end{lstlisting} If the process is interrupted, re-running the same command detects the incomplete progress and resumes from the last checkpoint. No experiment is re-evaluated, and the patience counter is preserved. % ============================================================================ \section{Verification: Claims and Tests} \label{sec:verification_claims} % ============================================================================ The full test suite contains 335 tests across 13 files, covering the research engine, teaching layer, notebook structure, and pedagogical quality. Below we present the 21 core research-engine tests, grouped by subsystem, with the falsification condition for each. \subsection{Quantum Correctness (3 tests)} \begin{claim} \label{claim:stabilizers} The ideal encoded magic state is a $+1$ eigenstate of both stabilisers. \end{claim} \textbf{Test:} \code{test\_encoded\_target\_state\_satisfies\_stabilizers}. Computes $\expect{S_X}$ and $\expect{S_Z}$ on the statevector produced by the default preparation circuit. Asserts $|\expect{S} - 1| < 10^{-8}$. \emph{Would fail if:} the encoder circuit is wrong, or the seed gate prepares the wrong single-qubit state. \begin{claim} \label{claim:bundle} The measurement bundle contains exactly the four expected circuits with correct metadata. \end{claim} \textbf{Test:} \code{test\_circuit\_bundle\_contains\_expected\_contexts}. \emph{Would fail if:} a circuit is missing, mislabelled, or lacks the \code{logical\_operator} metadata needed by the analysis code. \begin{claim} \label{claim:executor} The local executor produces positive scores with physically meaningful metrics under noisy simulation. \end{claim} \textbf{Test:} \code{test\_local\_executor\_produces\_score}. Runs the full evaluation pipeline (build, transpile, simulate, postselect, score) and checks $s > 0$, $0 \leq r_{\text{accept}} \leq 1$, $0 \leq W \leq 1$. \emph{Would fail if:} any stage of the pipeline is broken, e.g., the transpiler produces an invalid circuit, or the noise model is incompatible. \subsection{Search Correctness (5 tests)} \begin{claim} \label{claim:neighbor} NeighborWalk mutates exactly one dimension per challenger. \end{claim} \textbf{Test:} \code{test\_neighbor\_challengers\_mutate\_single\_dimension}. For each generated challenger, counts the number of fields that differ from the incumbent. Asserts the count is exactly~1. \begin{claim} \label{claim:history} History deduplication prevents re-evaluation. \end{claim} \textbf{Test:} \code{test\_neighbor\_walk\_respects\_history}. Generates challengers, passes their fingerprints as history, regenerates. Second call must return an empty list. \begin{claim} \label{claim:multiaxis} RandomCombo produces multi-axis mutations. \end{claim} \textbf{Test:} \code{test\_random\_combo\_generates\_multi\_axis\_mutations}. With 3 dimensions and 10 candidates, at least one must differ from the incumbent in $>1$ field. (Probabilistic, but failure probability is $< 10^{-6}$.) \begin{claim} \label{claim:guided} LessonGuided respects \code{fix} rules. \end{claim} \textbf{Test:} \code{test\_lesson\_guided\_uses\_rules}. Given a \code{fix} rule for \code{seed\_style=ry\_rz}, every generated challenger must have \code{spec.seed\_style == "ry\_rz"}. \begin{claim} \label{claim:composite} The composite generator stays within the budget cap. \end{claim} \textbf{Test:} \code{test\_composite\_generator\_combines\_strategies}. Asserts $0 < |\mathcal{G}| \leq \text{max\_challengers}$. \subsection{Lesson Correctness (3 tests)} \begin{claim} \label{claim:rules} Per-dimension analysis emits correct prefer/avoid directives. \end{claim} \textbf{Test:} \code{test\_extract\_search\_rules\_prefer\_and\_avoid}. With synthetic data where \code{z\_only} scores $\sim\!0.82$ and \code{both} scores $\sim\!0.52$, the rules must contain \code{(verification, prefer, z\_only)} and \code{(verification, avoid, both)}. \begin{claim} \label{claim:narrow} Search space narrowing removes avoided values and constrains fixed dimensions. \end{claim} \textbf{Test:} \code{test\_narrow\_search\_space\_removes\_avoided}. After applying an \code{avoid} rule for \code{x\_only} and a \code{fix} rule for \code{ry\_rz}, asserts \code{x\_only} is absent and the fixed dimension has exactly one value. \begin{claim} \label{claim:feedback} The end-to-end feedback pipeline produces correct best-spec fields. \end{claim} \textbf{Test:} \code{test\_build\_lesson\_feedback\_end\_to\_end}. With synthetic data, the \code{best\_spec\_fields} must report the top-scoring experiment's spec. \subsection{Scoring Correctness (2 tests)} \begin{claim} \label{claim:factory} The factory score computes throughput metrics and attaches them to the record. \end{claim} \textbf{Test:} \code{test\_factory\_throughput\_score\_produces\_metrics}. With known inputs ($r_{\text{accept}} = 0.7$, $W = 0.8$), verifies $s > 0$ and \code{metrics.extra["factory\_metrics"]["accepted\_states\_per\_shot"] == 0.70}. \begin{claim} \label{claim:registry} Both scoring functions are registered and dispatchable by name. \end{claim} \textbf{Test:} \code{test\_score\_registry\_has\_factory}. Asserts \code{"factory\_throughput"} is a key in \code{SCORE\_REGISTRY}. \subsection{Transfer Correctness (1 test)} \begin{claim} \label{claim:transfer} The transfer evaluator produces a valid report with per-backend scores. \end{claim} \textbf{Test:} \code{test\_transfer\_evaluator\_runs\_across\_backends}. Runs a single-backend transfer evaluation and checks that the report has the correct type and a positive transfer score. \subsection{Persistence Correctness (2 tests)} \begin{claim} \label{claim:progress} Progress checkpoints survive serialisation round-trips. \end{claim} \textbf{Test:} \code{test\_save\_and\_load\_progress}. Writes a \code{RungProgress} to disk, reads it back, verifies all fields match. \begin{claim} \label{claim:feedback_persist} Lesson feedback (including SearchRule objects) survives serialisation. \end{claim} \textbf{Test:} \code{test\_save\_and\_load\_lesson\_feedback}. Round-trips a \code{LessonFeedback} with one rule, verifies the rule's dimension and action are preserved. \subsection{Integration (4 tests)} \begin{claim} \label{claim:rung_progress} A full rung run saves progress and marks completion. \end{claim} \textbf{Test:} \code{test\_run\_rung\_saves\_progress}. After \code{run\_rung()}, \code{progress.json} must exist with \code{completed=true}. \begin{claim} \label{claim:rung_lesson} A full rung returns both a human lesson and machine feedback. \end{claim} \textbf{Test:} \code{test\_run\_rung\_returns\_lesson\_and\_feedback}. Checks the return type is a 3-tuple of (steps, RungLesson, LessonFeedback). \begin{claim} \label{claim:propagation} Multi-rung ratchet propagates winners and accumulates lessons. \end{claim} \textbf{Test:} \code{test\_run\_ratchet\_propagates\_winner}. Runs a two-rung ratchet and asserts \code{len(harness.\_accumulated\_lessons) == 2}. \begin{claim} \label{claim:seed} Different specs receive different simulator seeds. \end{claim} \textbf{Test:} \code{test\_different\_specs\_get\_different\_seeds}. Computes $\text{SHA-256}(\text{fingerprint} \| \text{repeat\_index})$ for two specs with different \code{verification} values. The seeds must differ. \emph{Regression test:} the original code used $11000 + i$ for all specs. % ============================================================================ \section{Usage Without AI Assistance} \label{sec:usage} % ============================================================================ \subsection{Installation} \begin{lstlisting}[style=shell] cd autoresearch-quantum python -m venv .venv && source .venv/bin/activate pip install -e ".[dev]" python -m pytest tests/ -v # 335 tests bash scripts/app.sh validate # full validation (lint + types + tests) \end{lstlisting} Requires Python $\geq$ 3.11 and Qiskit $\geq$ 2.3. No GPU needed. \subsection{CLI Reference} \begin{table}[H] \centering \small \begin{tabular}{@{}lp{9cm}@{}} \toprule \textbf{Command} & \textbf{What it does} \\ \midrule \code{run-experiment} & Evaluate a single spec. Use \code{--set key=val} to override fields. \\ \code{run-step} & One ratchet step: generate challengers, evaluate, maybe update incumbent. \\ \code{run-rung} & Full rung: repeated steps until budget or patience exhausted. Produces lesson. \\ \code{run-ratchet} & Multiple rungs in sequence with automatic propagation. \\ \code{run-transfer} & Evaluate one spec across multiple backends. \\ \bottomrule \end{tabular} \end{table} \subsection{The Feedback Artefacts and How to Read Them} \label{sec:artefacts} After running \code{run-rung --config configs/rungs/rung1.yaml}: \paragraph{lesson.md} is the human report. Skim the ``What Helped'' and ``What Hurt'' sections first---they tell you which knobs to turn. The ``Invariants'' section tells you which knobs don't seem to matter (yet). \paragraph{lesson\_feedback.json} is the machine report. The \code{rules} array is the most important part: \begin{lstlisting}[style=python] { "dimension": "verification", "action": "prefer", "value": "z_only", "confidence": 0.67, "reason": "mean score 0.1823 is +0.0312 above overall mean (8 samples)" } \end{lstlisting} Read it as: ``With 67\% confidence (based on 8 out of 12 experiments), \code{verification=z\_only} helps.'' A confidence below 0.3 means the system saw too few samples to be sure. \paragraph{incumbent.json} points to the current best. Load the corresponding experiment file to see its full spec and all metrics. \paragraph{propagated\_spec.json} (Rungs 2--5 only) is the spec that was carried forward. Compare it with the YAML bootstrap to see what changed automatically. \subsection{Manual Iteration Workflow} \begin{enumerate} \item Run Rung~1. Read \file{lesson.md}. \item Edit \file{rung2.yaml}: remove values that the lesson says to avoid. Add new values for dimensions you want to explore. \item Run Rung~2. Or run the full ratchet and let propagation handle it. \item Open \file{lesson\_feedback.json} from Rung~1 and Rung~3 side by side. Rules that appear in both with high confidence are load-bearing. Rules that appear only early on were artefacts of a specific noise model. \item Test specific hypotheses: \code{run-experiment --set approximation\_degree=0.95}. The result joins the store and influences the next lesson extraction. \item If interrupted, just re-run the same command. Progress is checkpointed. \end{enumerate} % ============================================================================ \section{Limitations and Future Work} \label{sec:limitations} % ============================================================================ \begin{enumerate} \item \textbf{Simulation only.} The hardware executor exists but is disabled by default. Real-device runs require IBM Quantum credentials and the \code{qiskit-ibm-runtime} package. \item \textbf{No distillation.} Rung~5 identifies factory-relevant heuristics, but the system does not build or evaluate distillation circuits. Extending to a 15-to-1 protocol is future work. \item \textbf{No parallelism.} Experiments run sequentially. A process-level sharding of the challenger set is straightforward but not implemented. \item \textbf{Small code.} The $[\![4,2,2]\!]$ code has distance~2 and cannot correct errors. Scaling to the Steane code ($[\![7,1,3]\!]$) or surface codes requires extending the circuit builder, but the search engine is code-agnostic. \item \textbf{Witness is not fidelity.} The witness $W$ is a necessary condition for magic-state quality, not sufficient. A full tomographic characterisation would be more rigorous but is orders of magnitude more expensive. \item \textbf{No LLM in the loop.} The ``auto'' is algorithmic (statistics + guided search), not generative. Integrating an LLM to propose novel circuit architectures---rather than just varying parameters of a fixed family---is a natural extension. \end{enumerate} % ============================================================================ \section{Conclusion} \label{sec:conclusion} % ============================================================================ \textsc{autoresearch-quantum} demonstrates that the Karpathy autoresearch pattern transfers naturally to quantum computing experiments. The key ingredients are: \begin{itemize} \item A well-defined scalar objective that balances quality and cost. \item A combinatorial search space of physically motivated choices. \item A multi-strategy challenger generator that balances exploitation (NeighborWalk), exploration (RandomCombo), and learning (LessonGuided). \item A feedback loop that converts experiment results into machine-readable rules, narrows the search space, and propagates knowledge across rungs. \item A monotone ratchet that guarantees the incumbent never regresses. \end{itemize} The system's output is not just a circuit---it is a structured body of experimental evidence, human-readable lessons, and machine-readable directives that a researcher can build on without re-running the search. \vspace{1em} \hrule \vspace{0.5em} {\small All code, configurations, and tests are available at \url{https://github.com/saymrwulf/autoresearch-quantum}. The system requires Python $\geq$ 3.11, Qiskit $\geq$ 2.3, and Qiskit Aer $\geq$ 0.17.} % ============================================================================ % References % ============================================================================ \begin{thebibliography}{9} \bibitem{bravyi2005} S.~Bravyi and A.~Kitaev, ``Universal quantum computation with ideal Clifford gates and noisy ancillas,'' \emph{Phys.\ Rev.\ A}, vol.~71, no.~2, p.~022316, 2005. \bibitem{karpathy2025} A.~Karpathy, ``Autoresearch,'' \url{https://github.com/karpathy/autoresearch}, 2025. \bibitem{qiskit2024} Qiskit contributors, ``Qiskit: An open-source framework for quantum computing,'' \url{https://github.com/Qiskit/qiskit}, 2024. \bibitem{gottesman1997} D.~Gottesman, ``Stabilizer codes and quantum error correction,'' PhD thesis, California Institute of Technology, 1997. \bibitem{knill2005} E.~Knill, ``Quantum computing with realistically noisy devices,'' \emph{Nature}, vol.~434, pp.~39--44, 2005. \bibitem{campbell2017} E.~T.~Campbell, B.~M.~Terhal, and C.~Vuillot, ``Roads towards fault-tolerant universal quantum computation,'' \emph{Nature}, vol.~549, pp.~172--179, 2017. \bibitem{litinski2019} D.~Litinski, ``Magic state distillation: Not as costly as you think,'' \emph{Quantum}, vol.~3, p.~205, 2019. \end{thebibliography} \end{document}