A Classical Architecture For Digital Quantum Computers

Scaling bottlenecks the making of digital quantum computers, posing challenges from both the quantum and the classical components. We present a classical architecture to cope with a comprehensive list of the latter challenges {\em all at once}, and implement it fully in an end-to-end system by integrating a multi-core RISC-V CPU with our in-house control electronics. Our architecture enables scalable, high-precision control of large quantum processors and accommodates evolving requirements of quantum hardware. A central feature is a microarchitecture executing quantum operations in parallel on arbitrary predefined qubit groups. Another key feature is a reconfigurable quantum instruction set that supports easy qubit re-grouping and instructions extensions. As a demonstration, we implement the widely-studied surface code quantum computing workflow, which is instructive for being demanding on both the controllers and the integrated classical computation. Our design, for the first time, reduces instruction issuing and transmission costs to constants, which do not scale with the number of qubits, without adding any overheads in decoding or dispatching. Rather than relying on specialized hardware for syndrome decoding, our system uses a dedicated multi-core CPU for both qubit control and classical computation, including syndrome decoding. This simplifies the system design and facilitates load-balancing between the quantum and classical components. We implement recent proposals as decoding firmware on a RISC-V system-on-chip (SoC) that parallelizes general inner decoders. By using our in-house Union-Find and PyMatching 2 implementations, we can achieve unprecedented decoding capabilities of up to distances 47 and 67 with the currently available SoCs, under realistic and optimistic assumptions of physical error rate $p=0.001 and p=0.0001, respectively, all in just 1 \textmu s.

Abstract-Scaling bottlenecks the making of digital quantum computers, posing challenges from both the quantum and the classical components.We present a classical architecture to cope with a comprehensive list of the latter challenges all at once, and implement it fully in an end-to-end system by integrating a multi-core RISC-V CPU with our in-house control electronics.
Our architecture enables scalable, high-precision control of large quantum processors and accommodates evolving requirements of quantum hardware.A central feature is a microarchitecture executing quantum operations in parallel on arbitrary predefined qubit groups.Another key feature is a reconfigurable quantum instruction set that supports easy qubit re-grouping and instructions extensions.
As a demonstration, we implement the widely-studied surface code quantum computing workflow, which is instructive for being demanding on both the controllers and the integrated classical computation.Our design, for the first time, reduces instruction issuing and transmission costs to constants, which do not scale with the number of qubits, without adding any overheads in decoding or dispatching.
Rather than relying on specialized hardware for syndrome decoding, our system uses a dedicated general-purpose multi-core CPU for both qubit control and classical computation, including syndrome decoding.This simplifies the system design and facilitates load-balancing between the quantum and classical components.We implement recent theoretical proposals as decoding firmware on a RISC-V system-on-chip that parallelizes general inner decoders.By using various inner decoders, including our in-house Union-Find and PyMatching 2 implementations, we can achieve unprecedented decoding capabilities of up to distances 47 and 67 with the currently available systems-on-chips (SoCs), under realistic and optimistic assumptions of physical error rate p = 0.001 and p = 0.0001, respectively, all in just 1 µs.

I. MOTIVATIONS AND SUMMARY OF RESULTS
As quantum computers become more sophisticated [1], [2], [4], [17], their demands on the classical control multiply accordingly.In this section, we analyze those challenges, then summarize our solutions.We confine this work to the superconducting-circuit platform, the focus of our team.We first review the setup as the starting point for our discussion.
Superconducting system setup.Figure 1 illustrates a standard setup for a superconducting quantum computing system.Quantum computing workflows.Applications are the end goals of quantum computers, thus the origins of their design requirements.Most applications belong to one of the two main paradigms: noisy intermediate-scale quantum (NISQ) applications and fault-tolerant quantum computations (FTQC).NISQ applications operate on noisy, unprotected physical qubits, limited in scale and in precision.FTQCs operate on encoded logical qubits, each consisting of (likely) thousands of physical qubits.The logical qubits have drastically reduced sensitivity to physical-level noises, allowing computations of an arbitrary length and scale, thus consequently the ultimate quantum advantages.
In NISQ, the PC sends the quantum circuit to the control electronics.The latter parse the circuit into microwave waveforms, play them synchronously on the drive lines to the qubits, process the measurement responses from the quantum chip, and finally, return the measurement results to the PC.The PC can then perform a classical post-processing, before possibly starting the next round of quantum circuit execution.
FTQC differs from NISQ in several key aspects.First, it requires constant extraction and decoding of the classical error syndromes, which are constantly churned out by the faulty quantum circuits.The decoding in turn requires realtime and intense classical computation.Second, while NISQ executes a static circuit, FTQC requires dynamic quantum circuit generation according to the decoding results.
Both NISQ and FTQC demand seamless coordination and collaboration between classical and quantum computational resources, which in turn require a co-design of classical and quantum architecture.We focus on the design and implementation of classical architectures.We analyze the challenges from two perspectives: scaling-up and actual implementation of a complete system.
Challenges in scaling up the classical architecture.Maintaining a high precision in the control of quantum hardware is the primary requirement here as it would directly affect fidelities of the quantum operations involved.Failing it would result in performance loss that eventually needs to be compensated by the quantum hardware, compounding the difficulty for the latter.Specifically, for superconducting qubits, microwave pulses played on different AWG channels and the sampling window of digitizer channels need to be synchronized at the picosecond level to ensure high-fidelity physical operations [41].
A second set of challenges are caused by the large number of instructions -the efficiency of their issuance, transmission, and execution as the number of qubits grows.These problems have been recognized by several authors [7], [11], [19], and we refer to them together as "instruction stresses".In FTQC, dynamic quantum instructions need to be issued and transmitted fast enough to keep in pace with the rapid quantum execution, posing a hard constraint on the classical architecture.This may not be required for NISQ, but is still desirable as it would decrease the total running time.
Syndrome decoding is yet another major bottleneck to FTQC classical architecture [36].For surface code schemes on present-day superconducting qubits, one round of syndrome extraction takes roughly 1µs [3], and generates O(d 2 ) bits of syndrome information in parallel, for d being the code distance.Against this increasing syndrome size, the decoding algorithm needs to keep up with the constant syndrome extraction time and in order to avoid exponential syndrome backlog.
Multiple decoding schemes were proposed to tackle this problem [12], [13], [18], [22], [37], but can only handle code distances no more than 11, even with specialized hardware.Recently, a new parallel decoding scheme was proposed independently in [33], [34].An implementation of the scheme achieved a code distance of 11 for physical error rate p = 0.4% [31].
A fourth set of challenges originate from a desirable feature that we call "permissiveness", which means the ability to accommodate evolving requirements by other components of a quantum computer.Our field experiences indicate that implementing a complete classical architecture is time-consuming and labor-intensive.On the other hand, in this early stage of quantum computing, changes are rapid in applications, hardware characteristics, and error-correction schemes.Thus a stable yet permissive classical architecture would be costeffective in the classical-quantum co-design process.
Challenges for implementing a complete system.Many researchers have proposed innovative solutions addressing one or a few of the above problems.Ultimately, a single system needs to be built for a real quantum computer.Building such a complete system has the additional challenge of balancing competing objectives with currently available and compatible technologies.To our knowledge, there has not been a system implementation addressing all the aforementioned challenges in scalability.
Our contributions.We present and implement a classical architecture to address all the scalability challenges mentioned above in one single system. 1) Our system provides high-fidelity qubit control by interconnecting one-chassis PXIe systems through a starlike hierarchy with high-density connectors.This design synchronizes, with high accuracy, pulses from different control electronics, enabling precise qubit control even as the system size increases, thereby maintaining highfidelity.This conclusion is supported by the extensive testing of the channel-to-channel and phase jitter of the AWG outputs.2) To address instruction stresses, we develop an efficient "quantum instruction pipeline" that combines Single-Instruction-Multiple-Data (SIMD) with a broadcasting mechanism.This pipeline enables parallel application of the same type of gate on arbitrarily-sized qubit groups.The operation types and the qubit groups of application-specific instructions can be easily configured either prior to the execution of the quantum program or during runtime.Additionally, the costs across instruction issuing, transmission, dispatching and execution does not scale with the size of each qubit group.3) Our system's permissiveness is achieved through a combination of features, including a reconfigurable quantum instruction set, Memory-Mapped IO (MMIO) in the microarchitecture, and a portable general-purpose CPU.The instruction set and the underlying MMIObased microarchitecture facilitate the incorporation of new quantum instructions.4) We achieve unprecedented performances on decoding throughput for surface codes, a mainstream approach that our architecture and the implemented system are nevertheless not restricted to.More specifically, we implement the surface code and a parallel decoding firmware based on the recent theoretical proposals [31], [34] in a dedicated CPU and benchmark its performance on a development board.By leveraging our in-house Union-Find and PyMatching 2 [21] as inner decoders, we can decode up to distances 13 and 31 on SiFive P650 [32] or T-head C910 [10], or 41 and 67 with ET-SoC-1 [16], all in just 1 microsecond for physical error rate p = 0.0001.
The proposed classical architecture is implemented fully in an end-to-end quantum computer system by integrating a multi-core, vectorized RISC-V CPU with our in-house control electronics.Our system also features an MLIR-based compiler to support the proposed reconfigurable quantum instruction set and enables optimization possibilities on various abstraction layers.We highlight the following features among the many of our implemented system.

5) Low communication latency
A key metric for FTQC is the "decoding latency", i.e. the time between the completions of syndrome generation and decoding.Such latency consists of the decoding algorithm latency and the communication latency between the control system and the quantum device.In our design, we use on-board communication to reduce the latter.This design also enables other capabilities where latency plays a critical role, such as on-the-fly calibration [20], [27], [29], and just-in-time compilation [39], [40].6) Load balancing through multi-core CPU The bottleneck in classical computation is not always syndrome decoding, and can vary during the computational process.To accommodate different scenarios, we use a dedicated multi-core CPU in our system, allowing dynamic allocation of cores to syndrome decoding, qubit control or other computation-heavy tasks.This design allows us to achieve optimal performance while avoiding the unnecessary complexities and cost of using specialized hardware for syndrome decoding.

Comparison with previous work.
To the best of our knowledge, our architecture proposal and the resulting actual implementation represent the first attempt to address, in a single system, the above comprehensive list of scaling challenges for the classical architecture.
Instruction stresses have been known for long, with various mitigating approaches proposed [7], [11], [19].Those include using Single-Instruction-Multiple-Data (SIMD) and Very-Long-Instruction-Word (VLIW) to reduce the instruction issuance rate [19], and multiprocessors to increase quantum operation and circuit-level parallelism [42].However, those methods provide only constant-factor improvements and are insufficient to cope with the increasing overhead that scales with the code distance in surface code quantum computing.
The QuEST proposal [35] addresses the instruction bandwidth problem by employing dedicated programmable microcode engines.While it shows promises for enabling real-time instruction issuing, it crucially relies on an assumption from the underlying primeline microarchitecture [23]: that all qubits driven at a given time must share the same frequency.This may hold for some quantum computing platforms, such as cold atoms or trapped ions, but not for superconducting qubits, where frequency differences are likely inevitable and sometimes a design preference.Furthermore, the absence of scaling analysis makes it unclear how the frequency requirement would affect the performance in an actual implementation.
Syndrome decoding has been another well-known concern in the quantum computing community for over a decade [36], with proposals ranging from efficient algorithms to specific microarchitectures [12], [14], [15], [21].Before our work, it remained an open problem if a general-purpose CPU with on-board communication to the control electronics would be sufficient to provide the required decoding throughput.We answer this question affirmatively for the first time by combining the recent parallel decoding schemes [31], [34] with an efficient in-house implementation for the Union-Find decoder and a recent implementation of the Minimum Weight Perfect Matching (MWPM) algorithm [21].

II. ARCHITECTURE DESIGN AND SYSTEM IMPLEMENTATION
A. Architecture Design See Figure 2 for a block diagram of our system design.The MCU contains a dedicated CPU.Besides controlling the qubits via the electronics driver, the CPU can also execute classical tasks offloaded from the host PC, using dedicated cores labeled the classical computing unit (CCU).Such tasks naturally arise from logical quantum program execution, dynamic calibration, and just-in-time compilation, etc.The offloading greatly shortens the communication latency with the QPU.
A quantum program generally comprises both quantum and classical components that collaborate to solve a problem.In Section IV, we will introduce our front-end language and the corresponding compilation support.However, the design and workflow of our system are not restricted to specific quantum programming languages.When a quantum program is executed on a host PC, the quantum subroutines and, depending on the implementation, potentially some classical subroutines will be sent to the MCU.The MCU then processes these quantum or quantum-classical hybrid tasks by issuing both classical and quantum instructions.Classical instructions are carried out on the dedicated CPU for classical control and computations, while quantum instructions are executed through requests to our in-house quantum electronics (IQE) driver.The IQE driver dispatches corresponding "IQE instructions", or "commands", to IQE, which in turn drives the quantum processor.At the CPU level, the "quantum instructions" are implemented as pseudo-instructions that expand to MMIO load/store instructions.These MMIO instructions interact with a special memory region, and the electronics driver decodes them and dispatches "electronics-level instructions", which will be explained shortly, through broadcasting for communication with the control electronics.
The electronics-level instructions specify the pulse sequences and their corresponding timing information to the control electronics.The latter parse the instructions and feed

Phase-locked loop
Fig. 2: Block diagram of the proposed classical architecture.The architecture consists of a host PC, a main control unit (MCU), control electronics (In-house Quantum Electronics).The quantum chip is connected with the control electronics via drive lines, while the host PC, the MCU and the electronics are jointly connected via PXIe.Additionally, the MCU connects with all electronics via a star-like connection.A dedicated unit in the MCU is responsible for driving the control electronics.
The MCU is equipped with a portable CPU, and a portion of it, called the classical computing unit, is allocated for run-time computation-heavy tasks such as syndrome decoding.In our implementation, the digital-analog and analog-digital units are made in-house and are called in-house quantum electronics (IQE), and their corresponding driver unit in the MCU is called the IQE driver.Please note that all other modules can be configured via the command parser; however, we have omitted the corresponding arrows in the diagram for the sake of simplicity.
the pulse sequence information into a local queue.The pulse sequence is not played until a special "trigger" signal arrives at the control electronics, which then plays the pulse sequence through its ports and empties the queue, waiting for the next round of pulse instructions.We use various "instruction" terminologies.For clarity, Figure 4 exhibits a taxonomy, with more details in the main text.
In addition to the aforementioned general setup, a key feature of our architecture is a quantum instruction pipeline that naturally supports a large number of parallel repetition of a same gate, and allows for easy reconfiguration.This is enabled jointly by several components, which we elaborate below.
Reconfigurable quantum instruction set.Exploiting MMIO's flexibility, our modular quantum instruction set comprises of "pulse-level instructions" for qubit control and calibration, and "gate-level instructions" for quantum circuit execution.By having both levels available, it allows for flexibility in implementing quantum algorithms and calibrating quantum devices, similar to other systems [7], [19], [42].We distinctly exploit what we call the brickwork structure found in typical quantum circuits: there is a small number of single-layer sub-circuit of the form i G Si , for some partition {S i } i of either the whole set or a large subset of the qubits into equal-sized subsets, and an identical gate G acting on each subset.We allocate different MMIO addresses for the partition identifiers, and specify the gate type via the message written to the address.Decoding and dispatching of the instruction are left to the underlying microarchitecture.This allows a lightweight specification of application-specific instructions on user-defined qubit partitions, which in turn significantly alleviates the cost of instruction issuing and data transmission.
Instruction dispatching via broadcasting.Some designs may prioritize certain aspects of instruction processing at the expense of others.For instance, adding complex instructions to reduce the instruction issuing rate can lead to more complex decoding and dispatching.However, our microarchitecture support does not come with any hidden costs.This means that we have successfully reduced costs at every stage of the instruction processing pipeline.When dispatching a single gate instruction to multiple control electronics, one-toone communication would scale the cost linearly with the number of control electronics, impeding scalability.We avoid this problem by exploiting the few-distinct-partition property of the brickwork structure through the built-in broadcasting mechanism of the star-like connection.
Each signal transmitted from the electronics driver broadcasts automatically through the star-like connection, thus each IQE instruction is sent to a collection of control electronics simultaneously, regardless if a control electronic is meant to be involved in the instruction.To utilize this, each electronic device holds a "partition mask" specifying which partitions it is in.Each IQE instruction broadcast from the electronics driver comes with a partition identifier.Upon receiving an instruction, each electronic device checks whether the partition identifier of the broadcast instruction matches one of the partition identifiers in its partition mask.If so, it proceeds to process the instruction, and ignores it otherwise.
The MCU thus can issue at once a same instruction to all devices with a common partition identifier, realizing microarchitecture-level single-instruction-multiple-destination (SIMD).The partition masks are stored in local registration entry (REG) files on each electronic device.They are easily reconfigurable, either using PXIe between runs or in real-time via the same star-like connection.
Instruction decoding.To decode a quantum instruction received from the CPU, the electronics driver extracts the electronics-level instruction type determined by the value written through MMIO, and appends it with the partition identifier determined by the MMIO address.Both mappings are stored in a local REG file that can be reconfigured if necessary.The assembled electronics-level instruction is then dispatched through the broadcasting system mentioned before.
This microarchitecture does not incur extra overhead on either decoding or dispatching when the partition size increases (as in the case of more qubits).
Apart from the above quantum instruction pipeline problems, we highlight some design choices that address scaling.
Pulse synchronization via triggers.All of the ADCs and the DACs are driven in the same clock domain through a phase-locked loop and a star-like connection, with one rubidium oscillator used as the system root clock.Our design further synchronizes pulses on different control electronics via a dedicated trigger mechanism.The pulses are not played through DACs immediately upon processing of the electronicslevel instructions, but rather are stored in a local queue.When a trigger instruction is issued from the MCU, the trigger signal arrives at each control electronic device at the same time, guaranteeing pulse-level synchronization.For further information on IQE, please refer to [41].
Portable, tightly integrated but loosely coupled dedicated CPU.Unlike previous schemes [7], [11], [19], [42] that handle the communication of the control electronics and the MCU by new and dedicated CPU instructions, ours aims to avoid substantial CPU modifications thus works solely with the unmodified classical instruction set instead.The MCU and the electronics driver are coupled only through MMIO instructions.This loose coupling provides portability and extensibility, as the same communication scheme can in principle be used with all CPUs supporting the same underlying ISA, or even other classical ISAs, with little to no modification.On the other hand, the dedicated CPU is tightly integrated to the control electronics through onboard communication which significantly reduce the communication cost.

B. System implementation
Our design can in principle be implemented over various classical and quantum hardware.For our particular implementation, we assume room-temperature, as opposed to cryogenic, electronics as they are more widely deployed today.While there is no fundamental reason to prefer RISC-V, ARM, or other instruction set architectures, we choose RISC-V for its potential in future system evolution.For instance, we anticipate that integrating the required quantum instruction pipelines into the RISC-V IP core would be less limited due to its open license business model.
Hardware setup.We implement a prototype by integrating a RISC-V IP core with our room-temperature electronics, which include a timing control module (TCM), four-channel AWGs, four-channel data acquisition modules, a local oscillator, amplifiers, mixers, and a high-precision voltage source.As mentioned above, the in-house AWGs and the digitaizers are collectively referred to as IQE.
We implement a real-time digital signal processing system on built-in FPGAs of the IQE, featuring precise timing control, arbitrary waveform generation, and parallel IQ demodulation for qubit state discrimination.The FPGA in TCM serves as the master FPGA running the MCU and the IQE driver.The master FPGA communicates with the AWGs and the digitizers through high-speed digital backplane transmissions.
In the aforementioned configuration depicted in Figure 2, we have assumed an unrestricted number of connections in the star-like topology.Now, we will explain how we can scale up from chassis-based systems that have a limited number of connections.A standard chassis with 18 slots meet the requirements of 10 qubits' control and readout.In such a onechassis PXIe system, the master FPGA with a soft RISC-V IP core is used to provide triggers and instructions to other AWGs and digitizers.In order to control more qubits, the master FPGA in each one-chassis PXIe system can be interconnected through high-density connectors via a star-like expansion, as illustrated in Figure 3.Only one master FPGA needs to implement the soft RISC-V IP core as the MCU of the whole system.The MCU broadcasts the instructions to the master FPGAs of all those one-chassis PXIe subsystems by a daisy chain interface or star-like interface, and then each master FPGA broadcasts to AWGs and digitizers in the same chassis.
We will now provide a comprehensive overview of various types of instructions present in our architecture, including the IQE instructions as well as the RISC-V quantum instructions.Together they facilitate seamless control over the quantum processor.
IQE instructions.We specify the electronic-level instructions, or commands, broadcast by the IQE driver via the starlike connection, hereafter referred to as the "IQE instructions" for convenience (note that those "instructions" are not directly related to any CPU-level instructions).Currently, there are three types of IQE instructions, as summarized in Table I: • "Trigger": As mentioned above, the "Trigger" instruction tells the IQE driver to actually start executing all  Quantum instruction set.We here present an instantiation of a modular set of pseudo-instructions at the CPU level, consisting of pulse-level instructions for device calibration and gate-level instructions for algorithm implementations.These pseudo-instructions are not implemented directly, but are subsequently expanded to RISC-V MMIO operations via built-in load/store instructions.We also provide an example MMIO layout compatible with existing RISC-V architectures.
Table II illustrates the current design of the quantum instruction set and its corresponding expansion into MMIO load/store instructions.The instruction set features a hierarchical design, consisting of pulse-level, gate-level, and application-specific instructions.Each higher-level instruction can be decomposed into lower-level instructions with the same functionality, but using higher-level instructions reduced the decoding and dispatching overhead.
• "Pulse-level Instructions" play, qwait and trig: specify pulses, their relative timing, and the issuance of the trigger signal, respectively.More precisely, play specifies the actual control pulse sequence, qwait specifies the scheduling of the corresponding pulses, and trig triggers the actual execution of previously issued instructions.Additionally, fmr loads the qubit measurement results from previous runs from the predetermined addresses.menting essential features over surface code quantum computing using our architecture, to argue that known scalability challenges can be resolved through our design.
More specifically, we reach our conclusion by focusing on components involved in large-scale computation or communication, and analyzing how the incurred costs scale with the code distance and the quality of the quantum device.Our architecture design is not specific to surface-code-based quantum computing, thus can be readily generalized to other quantum error correcting codes or fault-tolerant schemes.

A. Surface code quantum computation
Surface code encodes the logical information of one qubit into a patch of d × d physical qubits, such that any error happening on at most ⌊(d − 1)/2⌋ physical qubits can be detected through intermediate measurements and be corrected accordingly.A popular approach to realize logical Clifford operations for the surface code is "lattice surgery" [24].Specifically, patches of logical qubits are arranged on a large grid, with additional physical qubits positioned in the "routing space" [8] between them.Then, lattice surgery allows measuring logical Pauli jointly over multiple patches, using interactions only between pairs of nearest-neighbor physical qubits.
There are alternatives to lattice surgery for realizing logical operations on surface codes (see [6] and references therein).In

Measurement results
Control electronics Pulses Fig. 5: Schematic workflow supporting surface code quantum computation (SCQC).The shaded area illustrates the blurred boundary of "classical" and "quantum" architecture, the former being our main focus.
this work, by "surface code quantum computation" (SCQC), we refer to the approach through lattice surgery.
Figure 5 shows a schematic workflow of SCQC from the perspective of classical control.Upon receiving a pre-compiled quantum program, the MCU issues quantum instructions to an "instruction decoding and dispatching unit" (IDDU) through MMIO when needed.The IDDU then decodes the quantum instructions into pulse-level instructions readily executable on each of the electronics and dispatches them accordingly.The control electronics interact with the quantum hardware and return the raw measurement results to a dedicated memory region.
The incoming syndrome information is fed to and decoded by a classical firmware, called the "syndrome decoding unit" (SDU), that runs on the dedicated CPU.Once decoded, the logical measurement results are fed to the MCU for adaptive real-time generation of the future instructions required by fault-tolerant quantum computing.In our implementation, the IDDU, the electronics and the SDU correspond respectively to the IQE driver, the IQE and part of the CCU.
Two essential subroutines of the SCQC are the "quantum memory experiment" and the "Bell-state experiment".Their quantum circuits are illustrated in Figure 6.The quantum memory experiment benchmarks the capability of the classical architecture for preserving quantum information, and the Bellstate experiment benchmarks that for essential steps in lattice surgery.As SCQC comprises mostly these two components (in addition to the preparation of a physical magic state and a single-patch logical measurement), we use them to validate our architecture.
Besides real-time execution of quantum circuits with large-scale parallel gates, these SCQC subroutines also require fast processing of classical information in "syndrome decoding"."Syndromes" are mid-circuit measurement results indicating errors occurring during the FTQC process.To identify the actual errors and correct them, a dedicated syndrome decoder is needed to deduce the most likely error given the syndrome information.Ideally, the syndrome decoder needs to have a low error rate of inference, and be fast enough in order not to cause exponential syndrome backlog [36].Developing and implementing such a low-error, low-latency and high-throughput syndrome decoder is vital to experimental realization of faulttolerant quantum computation.

B. Validation of scalability
We first establish the feasibility of our design by implementing an end-to-end prototype quantum computing system, and validating that it functions properly with test qubit calibration programs.In addition, we examine the time variation ("jitter") of pulse control with increasing size of the starlike connection, confirming that our design admits scalable pulse synchronization.A low jitter ensures high-precision synchronization of pulses played on different AWG ports, ensuring high-fidelity controls.
To verify that our design resolves the instruction stress, we execute the aforementioned SCQC subroutines.We profile the running time of the classical controller against the running time of the quantum processor.The classical running time is estimated based on the instruction counts of an in-house CPU profiling tool over the QEMU RISC-V simulator.The running time of the quantum processor is estimated based on previously reported running time of each operation on a comparable             superconducting platform [30].We also quantitatively analyze the cost of instruction decoding and dispatching.Although neither is a scaling-up matter under our architecture design, we quantitatively show that the bandwidth of the differential pairs [25] can easily afford parallel gate instruction dispatching even under our proof-of-concept ISA implementation.Real-time classical decoding was previously a hard problem, and even dedicated hardware struggled to achieve real-time decoding for code distance d larger than 11 [5], [14], [38].However, recent advances [21], [31], [34] have made realtime decoding much more realistic even on general-purpose CPU.In particular, the sliding-window decoding schemes, introduced independently in [34] and [31], parallelize in scale: they split the decoding task evenly into an arbitrary number of parallel threads, with only a small constant overhead factor independent of the number of threads.We implement such a parallelized SDU on a RISC-V development board, and benchmark its throughput on increasing code distances.We also give a rough estimation of the bandwidth required for syndrome transmission from the IQE digitizers to the SDU, finding it unlikely to become a bottleneck for our architecture.

A. Real system demonstration
We implement a prototype system by integrating a RISC-V IP core with our room-temperature electronics, and a demo program to validate the end-to-end quantum computing system consisting of the prototype system, a quantum chip and compilation toolchain.The demo program characterizes a qubit's relaxation time, i.e., the so-called T1 experiment.We compile an OpenQASM 3.0 front-end code to a RISC-V executable using our in-house compilation toolchain, and test its correctness both on a pulse-level quantum simulator, and on our in-house superconducting quantum processor.The result of the physical experiment is shown in Figure 7, demonstrating a successful run of the calibration routine.

B. Scalability of maintaining high-fidelity quantum operation
We now evaluate the feasibility of high-fidelity quantum operations when the chassis-based system is scaled up using a star-like connection.
Skew and jitter, which are crucial for system synchronization, can directly affect the accuracy of quantum operations.Skew, caused by variations in electrical connection length, can usually be compensated for as it remains constant.Jitter, on the other hand, is a greater concern as its effects cannot be calibrated.
To verify the fidelity of quantum operations in a larger system, we set up a 5-layer IQE platform with one MCU, one AWG, and one digitizer in each layer.The main trigger and the root system clock were generated by the MCU in the first layer and transmitted to the MCU in the second layer, and so on for the subsequent layers.In each experiment, we pick the first layer and one other layer to test the jitter.The two AWG output channels from the chosen layers were then connected to a digitizer.We then use fixed-point phase analysis to calculate the jitter between these two signals, as a proxy to evaluate pulse synchronization in larger systems.The critical aspect to consider is whether the jitter varies with the layer distance.
As depicted in Figure 8, the histograms display the jitter performance at different layer distances.Our measurements of layer-to-layer jitter show that the standard deviation is approximately 6ps, and jitter does not increase with layer distance, indicating effective pulse synchronization within the system.
Based on the 5-layer results, we conclude that synchronization imprecision of microwave pulses from control electronics across different layers due to phase jitters is minuscule and will not become a major bottleneck for quantum  computation.With a standard chassis that has 18 slots, the star-like expansion scheme is capable of supporting up to 10 4 + 10 3 + 10 2 + 10 + 1 = 11111 chassis and 111110 qubits based on the reasonable assumption that a single chassis can drive 10 additional chassis.

C. Scalability of the instruction pipeline
With the MMIO-based custom instruction design, we can test custom CPU-level instructions with different levels of abstraction against the quantum hardware execution time.In particular, we consider the following hierarchy of custom instruction abstraction, illustrated in Figure 9.
In addition to pulse-and gate-level instructions, the abstraction includes the following instructions.tions within a "logical cycle" [26] into a single instruction.A "logical cycle" refers to a repeated structure with d copies of identical syndrome extraction sub-routines, each consisting of constant layers of parallel operations with fixed patterns, with optional single-layer parallel operations before or after the repeats.Such a repeated structure is necessary for fault-tolerance against measurement errors, and serves as building blocks for the SCQC.In this case, the total number of instructions throughout a quantum application stays a constant, regardless of the code distance, leaving more room for improvement when dealing with other scaling factors, such as the number of logical qubits.We estimate the execution time of custom instructions at each abstraction level, scaling in code distance, in Figure 10.It can be seen that the logical-level instructions stay constant with respect to code distance, and the parallel-gate level instructions scale linearly albeit with a smaller coefficient compared to the quantum running time.The pulse-level instructions scale with Θ(d 3 ) and quickly grow into the millisecond region, thus making them infeasible for surface code with reasonable sizes beyond a proof-of-principle demonstration.
For both the memory experiment and the Bell-state experiment, the majority of the quantum execution time is spent on syndrome extraction.Each syndrome cycles takes about 1µs and takes 15 parallel-gate level instructions.This requires a throughput of 0.96 Gbps on the differential pair.This is well below the theoretical limit of the bandwidth of differential pairs [25].As this estimation does not scale with respect to the code distance, the transmission of IQE instructions to the control electronics would not become a bottleneck for SCQC.

D. Scalability of the syndrome decoder
We implement an SDU with a parallel decoding firmware based on the "Sandwich Decoder" algorithm [31], [34].This firmware splits the decoding task evenly into an arbitrary number of parallel threads, with a small constant overhead factor independent of the thread number, as long as the number of surface code rounds is sufficiently large.
We benchmark its throughput on the aforementioned SCQC subroutines.As a sanity check, we test the SDU implementation on a RISC-V development board, and observe an agreement in results with our QEMU simulator.Benchmarking results are also used to extrapolate the expected throughput if implementing the SDU on other RISC-V IPs.
In deploying the Sandwich Decoder over the integrated multi-core CPU, the underlying inner decoders are an in-house Union-Find implementation and a recent PyMatching v2 [21].We make several implementation-level improvements on the efficiency for our Union-Find decoder.
To benchmark the performance of our SDU, we apply the Sandwich Decoder to the Bell state experiment; this goes a step further than the memory experiment as in [34].For simplicity, we assume that the routing space between the two  logical qubits is small compared to the code distance d and use a single large window to cover the two-qubit measurement part in the overall decoder graph (see Figure 11).All other windows are the same windows used in memory experiments.
In our simulation experiments, we input the description of the large window as well as the weight of each edge in that window to the CCU, which randomly generates error syndromes before invoking the decoding module.
We use the benchmarking results on the development board, shown in Figure 12, to estimate the decoding time when implementing our SDU on various RISC-V SoCs.With Union-Find and PyMatching 2 as inner decoders, the implemented SDU can decode up to distances 13 and 31 on SiFive P650 [32], T-head C910, or comparable alternatives [10] (16 cores at 2.5GHz), or 67 and 57 with ET-SoC-1 [16](1088 cores at 1GHz), all within just 1 microsecond for physical error rate p = 0.0001.Our evaluation of PyMatching 2 shows that its performance is constrained by the limited 1GB memory Note that the center window covers more layers than other windows; an alternative scheme (not depicted) is to divide the center window further so that every window covers the same number of layers, which gives rise to more complicated windows.available on the tested development board.We expect that PyMatching 2 could achieve even better results on a higherend development board with a larger memory.
Besides the decoding throughput, another constraint for the decoder architecture is that the large amount of syndrome information generated throughout the SCQC process must not saturate the communication bandwidth.It is known that raw syndrome measurement results can quickly become a bandwidth bottleneck [14], thus must be compressed.One approach is to record the "detection events", i.e., changes in a sequence of syndrome bits, rather than all the syndrome bits.For a quantum memory experiment with a code distance d and a syndrome extraction cycles n, each ancilla qubit needs to generate p detect • n log 2 n bits on average, assuming that each detection event happens with a probability p detect .Roughly, the bandwidth requirement would become 100 Mbps for p detect = 0.02 and n = d = 33.This compression can be done on each digitizer separately before transmission to the IQE driver.More advanced compression algorithms may achieve a better compression rate, but may require conjoined processing from different digitizers.Such an algorithm can be placed in the IQE driver should there be a bottleneck in the MMIO bandwidth.

V. DISCUSSION AND OUTLOOK
We present a scalable design for the classical architecture of quantum computing.Our design aims towards easy scaling with no significant overhead.We evaluate its scalability on two basic subroutines over a prominent fault-tolerant scheme, and validate its practical feasibility with a prototype implementation.SiFive P650 [32]/T-head C910 [10] ET-SoC-1 [16] (a) Decoding throughput for the quantum memory experiment.Fig. 12: The average syndrome decoding throughput for the quantum memory experiment and the two-qubit joint measurement in the Bell state experiment.Decoding throughput is defined as the average processing time per single layer of syndrome on a single core with 1GHz master frequency.We experimentally benchmark the total running time of both experiments under different code distances and multiple runs, and deduce the per-layer running time.For the ease of comparison, we set the step size as t + 1 = (d + 1)/2 and the window size as 3(t+1) for both experiments.For the quantum memory experiments, the red dashed lines indicate the capability of the specific RISC-V SoCs, converted to the same scale as the experimental data.The data points below a certain dashed line indicate the feasibility of running the corresponding task on the corresponding SoC within 1µs.
A natural next step is to implement the real system with quantum processors of a much larger scale than that in our study.The current design is estimated to scale up easily over thousands of qubits.Such an estimation is based on the size of the allocated MMIO addresses, the picosecond accuracy in the synchronization of the trigger signal across different electronics, and the physical size of the electronics stacks.Although most of the limiting factors can be lifted through a more careful design, it remains uncertain if unforeseen problems may arise with a larger-scale quantum processor.A possible further scaling-up through modularization is to let each MCU control one or a few logical components, such as a single logical qubit or a patch of the routing space, and let an upper-level control unit issue logical instructions to these logical components while maintaining synchronization.
In this study, we assume room-temperature devices for their wide adoption at the time of writing.However, our design in principle is not limited to such, and may in particular work well for cryogenic electronics, such as cryo-CMOS [9] or single-flux-quantum [28], as long as the component functionalities can be implemented.Such demonstrations would be an interesting future direction.
Another important direction is to demonstrate through more sophisticated tasks than our two "toy-model" subroutines.Such experiments may lead to the discovery of currently unknown limiting factors for classical architecture in SCQC.
Classical architecture is just half of the story, as numerous challenges still to be addressed in quantum architecture.Beyond the quantum processor's scale, hurdles such as input/output (I/O) management, interconnection, packaging, and heat and power dissipation must be overcome.Previous research in quantum architecture has often more focused on the feasibility of qubit control than the potential demands of intensive classical computation.Conversely, studies on classical architecture have primarily examined the viability of specific classical computation tasks such as syndrome decoding, either in-fridge or out-of-fridge, under the bold assumption that highfidelity qubit control can be realistically achieved.A holistic evaluation of the FTQC workflow, encompassing both classical and quantum architectures, will aid in identifying potential bottlenecks and determing the most effective steps to move forward.

Fig. 1 :
Fig. 1: An experimental setup for qubit driving and measurement.The dilution refrigerator is depicted as the cyan box, with different temperature zones separated by dashed lines.the PC driving the control electronics is omitted.
< l a t e x i t s h a 1 _ b a s e 6 4 = " s y D H Z C O q 7 H 2 n 7 e 4 Q / E q 2 K T 7 i U 9 Y = " > A A A C I 3 i c b V B N S w M x E M 3 6 W e t X 1 a O X Y B E 8 l V 0 R F Q 9 S 8 O K x g l W h K S W b z r a h 2 e y S z A p l 6 X / x 4 l / x 4 k E p X j z 4 X 0 z b P W j 1 k c D j v Z l J 5 o W p k h Z 9 / 9 N b W F x a X l k t r Z X X N z a 3 t i s 7 u 3 c 2 y Y y A p k h U Y h 5 C b S l U D A q s 8 x C y s W A 9 6 D l q O Y x 2 H Y + 3 X F E D 5 3 S p V F i 3 N V I p + r P j p z H 1 g 7 j 0 F X G H P t 2 3 p u I / 3 m t D K P z d i 5 1 m i F o M X s o y h T F h E 4 C o 1 1 p Q K A a O s K F k e 6 v V P S 5 4 Q J d r G U X Q j C / 8 l 9 y d 1 w L T m v B z U m 1 f l n E U S L 7 5 I A c k Y C c k T q 5 J g 3 S J I I 8 k R f y R t 6 9 Z + / V G 3 s f s 9 I F r + j Z I 7 / g f X 0 D s r y k 5 A = = < / l a t e x i t >

8
t e x i t s h a 1 _ b a s e 6 4 = " d R T p Y F k y D j n 0 F H / 7 Y P z R 5 J u J g 8I = " > A A A C R 3 i c b V B N S w M x E M 3 W 7 / p V 9 e g l W A R P Z V d E P Y n g x a O C 1 U K 3 l m w 6 u w 3 N Z p d k V i j L / j s v X r 3 5 F 7 x 4 U M S j a V 1 E 2 z 4 m 8 H h v Z p K 8 I J X C o O u + O J W 5 + Y X F p e W V 6 u r a + s Z m b W v 7 1 i S Z 5 t D k i U x 0 K 2 A G p F D Q R I E S W q k G F g c S 7 o L B x c i / e w B t R K J u c J h C J 2 a R E q H g D K 3 U r d 3 7 E k L 0 c z + A S K i c a c 2 G R c 4 L 3 6 e / N Y v N L l C 9 c o W v R d T H R r d W d x v u G H S a e C W p k x J X 3 d q z 3 0 t 4 F o N C L p k x b c 9 N s W O X o u A S i q q f G U g Z H 7 A I 2 p Y q F o P p 5 O M c C r p v l R 4 N E 2 2 P Q j p W / 0 7 k L D Z m G A e 2 M2 b Y N 5 P e S J z l t T M M T z u 5 U G m G o P j P R W E m K S Z 0 F C r t C Q 0 c 5 d A S x r W w b 6 W 8 z z T j a K O v 2 h C 8 y S 9 P k 9 v D h n f c 8 K 6 P 6 u d n Z R z L Z J f s k Q P i k R N y T i 7 J F W k S T h 7 J K 3 k n H 8 6 T 8 + Z 8 O l 8 / r R W n n N k h / 1 B x v g H + 5 6 7 1 < / l a t e x i t > 8

< l a t e x i t s h a 1 _
b a s e 6 4 = " x t 2 s O I H e H P I 7 9 O C J u b b e f s g T 6 1 Q = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P B i 8 d W 7 A e 0 o W y 2 k 3 b t Z h N 2 N 0 I J / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b /+ g f H j U 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 n 1 B p H s s H M 0 n Q j + h Q 8 p A z a q z U k P 1 y x a 2 6 c 5 B V 4 u W k A j n q / f J X b x C z N E J p m K B a d z 0 3 M X 5 G l e F M 4 L T U S z U m l I 3 p E L u W S h q h 9 r P 5 o V N y Z p U B C W N l S x o y V 3 9 P Z D T S e h I F t j O i Z q S X v Z n 4 n 9 d N T X j j Z 1 w m q U H J F o v C V B A T k 9 n X Z M A V M i M m l l C m u L 2 V s B F V l B m b T c m G 4 C 2 / v E p a F 1 X v q u o 1 L i u 1 + z y O I p z A K Z y D B 9 d Q g z u o Q x M Y I D z D K7 w 5 j 8 6 L 8 + 5 8 L F o L T j 5 z D H / g f P 4 A 3 d m N C A = = < / l a t e x i t > n < l a t e x i t s h a 1 _ b a s e 6 4 = " p e 9 0 a j t J T 6 C v 7 F i i D u t 4 S D 6 S e 8 A = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P B i 8 d W 7 A e 0 o W w 2 k 3 b t Z h N 2 N 0 I p / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A q u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 k m m G D Z Z I h L V C a h G w S U 2 D T c C O 6 l C G g c C 2 8 H o d u a 3 n 1 B p n s g H M 0 7 R j + l A 8 o g z a q z U C P v l i l t 1 5 yC r x M t J B X L U + + W v X p i w L E Z p m K B a d z 0 3 N f 6 E K s O Z w G m p l 2 l M K R v R A X Y t l T R G 7 U / m h 0 7 J m V V C E i X K l j R kr v 6 e m N B Y 6 3 E c 2 M 6 Y m q F e 9 m b i f 1 4 3 M 9 G N P + E y z Q x K t l g U Z Y K Y h M y + J i F X y I w Y W 0 K Z 4 v Z W w o Z U U W Z s N i U b g r f 8 8 i p p X V S 9 q 6 r X u K z U 7 v M 4 i n A C p 3 A O H l x D D e 6 g D k 1 g g P A M r / D m P D o v z r v z s W g t O P n M M f y B 8 / k D z r G M / g = = < / l a t e x i t > d < l a t e x i t s h a 1 _ b a s e 6 4 = " p e 9 0 a j t J T 6 C v 7 F i i D u t 4 S D 6 S e 8 A = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P B i 8 d W 7 A e 0 o W w 2 k 3 b t Z h N 2 N 0 I p / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A q u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 k m m G D Z Z I h L V C a h G w S U 2 D T c C O 6 l C G g c C 2 8 H o d u a 3 n 1 B p n s g H M 0 7 R j + l A 8 o g z a q z U C P v l i l t 1 5 y C r x M t J B X L U + + W v X p i w L E Z p m K B a d z 0 3 N f 6 E K s O Z w G m p l 2 l M K R v R A X Y t l T R G 7 U / m h 0 7 J m V V C E i X K l j R k r v 6 e m N B Y 6 3 E c 2 M 6 Y m q F e 9 m b i f 1 4 3 M 9 G N P + E y z Q x K t l g U Z Y K Y h M y + J i F X y I w Y W 0 K Z 4 v Z W w o Z U U W Z s N i U b g r f 8 8 i p p X V S 9 q 6 r X u K z U 7 v M 4 i n A C p 3 A O H l x D D e 6 g D k 1 g g P A M r / D m P D o v z r v z s W g t O P n M M f y B 8 / k D z r G M / g = = < / l a t e x i t > d (a) < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 m g 8 r c z Q 0 a W g K 3 1 8 v u a / q U T o l f A = " > A A A C H n i c b V D L S g M x F M 3 4 t r 5 G X b o J F s F V m R F f K y m 4 c a l g H 9 A Z S i a 9 0 w Y z m S G 5 I 5 S h X + L G X 3 H j Q h H B l f 6 N a T v g 8 5 D A 4 Z x 7 b 3 J P l E l h 0 j s u m 9 B L + V 5 A g q 5 Z M Z 0 f C / D 0 A 5 F w S W M K k F u IG P 8 h v W h Y 6 l i C Z i w m K w 3 o n t W 6 d E 4 1 f Y q p B P 1 e 0 f B E m O G S W Q r E 4 Y D 8 9 s b i / 9 5 n R z j 0 7 A Q K s s R F J 8 + F O e S Y k r H W d G e 0 M B R D i 1 h X A v 7 V 8 o H T D O O N t G K D c H / v f J f 0 j y o + c c 1 / + q w W j 8r 4 1 g i O 2 S X 7 B O f n J A 6 u S C X p E E 4 u S M P 5 I k 8 O / f O o / P i v E 5 L Z 5 y y Z 5 v 8 g P P + C T i P o y I = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " 9 m g 8 r c z Q 0 a W g K 3 1 8 v u a / q U T o l f A = " > A A A C H n i c b V D L S g M x F M 3 4 t r 5 G X b o J F s F V m R F f K y m 4 c a l g H 9 A Z S i a 9 0 w Y z m S G 5 I 5 S h X + L G X 3 H j Q h H B l f 6 N a T v g 8 5 D A 4 Z x 7 b 3 J P l E l h 0 j s u m 9 B L + V 5 A g q 5 Z M Z 0 f C / D 0 A 5 F w S W M K k F u IG P 8 h v W h Y 6 l i C Z i w m K w 3 o n t W 6 d E 4 1 f Y q p B P 1 e 0 f B E m O G S W Q r E 4 Y D 8 9 s b i / 9 5 n R z j 0 7 A Q K s s R F J 8 + F O e S Y k r H W d G e 0 M B R D i 1 h X A v 7 V 8 o H T D O O N t G K D c H / v f J f 0 j y o + c c 1 / + q w W j 8r 4 1 g i O 2 S X 7 B O f n J A 6 u S C X p E E 4 u S M P 5 I k 8 O / f O o / P i v E 5 L Z 5 y y Z 5 v 8 g P P + C T i P o y I = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " 9 m g 8 r c z Q 0 a W g K 3 1 8 v u a / q U T o l f A = " > A A A C H n i c b V D L S g M x F M 3 4 t r 5 G X b o J F s F V m R F f K y m 4 c a l g H 9 A Z S i a 9 0 w Y z m S G 5 I 5 S h X + L G X 3 H j Q h H B l f 6 N a T v g 8 5 D A 4 Z x 7 b 3 J P l E l h 0 j s u m 9 B L + V 5 A g q 5 Z M Z 0 f C / D 0 A 5 F w S W M K k F u IG P 8 h v W h Y 6 l i C Z i w m K w 3 o n t W 6 d E 4 1 f Y q p B P 1 e 0 f B E m O G S W Q r E 4 Y D 8 9 s b i / 9 5 n R z j 0 7 A Q K s s R F J 8 + F O e S Y k r H W d G e 0 M B R D i 1 h X A v 7 V 8 o H T D O O N t G K D c H / v f J f 0 j y o + c c 1 / + q w W j 8r 4 1 g i O 2 S X 7 B O f n J A 6 u S C X p E E 4 u S M P 5 I k 8 O / f O o / P i v E 5 L Z 5 y y Z 5 v 8 g P P + C T i P o y I = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " 9 m g 8 r c z Q 0 a W g K 3 1 8 v u a / q U T o l f A = " > A A A C H n i c b V D L S g M x F M 3 4 t r 5 G X b o J F s F V m R F f K y m 4 c a l g H 9 A Z S i a 9 0 w Y z m S G 5 I 5 S h X + L G X 3 H j Q h H B l f 6 N a T v g 8 5 D A 4 Z x 7 b 3 J P l E l h 0 j s u m 9 B L + V 5 A g q 5 Z M Z 0 f C / D 0 A 5 F w S W M K k F u I G P 8 h v W h Y 6 l i C Z i w m K w 3 o n t W 6 d E 4 1 f Y q p B P 1 e 0 f B E m O G S W Q r E 4 Y D 8 9 s b i / 9 5 n R z j 0 7 A Q K s s R F J 8 + F O e S Y k r H W d G e 0 M B R D i 1 h X A v 7 V 8 o H T D O O N t G K D c H / v f J f 0 j y o + c c 1 / + q w W j 8 r 4 1 g i O 2 S X 7 B O f n J A 6 u S C X p E E 4 u S M P 5 I k 8 O / f O o / P i v E 5 L Z 5 y y Z 5 v 8 g P P + C T i P o y I = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " s y D H Z C O q 7 H 2 n 7 e 4 Q / E q 2K T 7 i U 9 Y = " > A A A C I 3 i c b V B N S w M x E M 3 6 W e t X 1 a O X Y B E 8 l V 0 R F Q 9 S 8 O K x g l W h K S W b z r a h 2 e y S z A p l 6 X / x 4 l / x 4 k E p X j z 4 X 0 z b P W j 1 k c D j v Z l J 5 o W p k h Z 9 / 9 N b W F x a X l k t r Z X X N z a 3 t i s 7 u 3 c 2 y Y y A p k h U Y h 5 C b k F J D U 2 U q O A h N c D j U M F 9 O L i a + P e P Y K x M 9 C 0 O U 2 j H v K d l J A V H J 3 U q F 0 x B h C x n I f S k z r k x f D j K x Y g x + v s w 0 N 3 C Z k b 2 + l j r V K p + z Z + C / i V B Q a q k Q K N T G b N u I r I Y N A r F r W 0 F f o p t N xS l U D A q s 8 x C y s W A 9 6 D l q O Y x 2 H Y + 3 X F E D 5 3 S p V F i 3 N V I p + r P j p z H 1 g 7 j 0 F X G H P t 2 3 p u I / 3 m t D K P z d i 5 1 m i F o M X s o y h T F h E 4 C o 1 1 p Q K A a O s K F k e 6 v V P S 5 4 Q J d r G U X Q j C / 8 l 9 y d 1 w L T m v B z U m 1 f l n E U S L 7 5 I A c k Y C c k T q 5 J g 3 S J I I 8 k R f y R t 6 9 Z + / V G 3 s f s 9 I F r + j Z I 7 / g f X 0 D s r y k 5 A = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " s y D H Z C O q 7 H 2 n 7 e 4 Q / E q 2 K T 7 i U 9 Y = " > A A A C I 3 i c b V B N S w M x E M 3 6 W e t X 1 a O X Y B E 8 l V 0 R F Q 9 S 8 O K x g l W h K S W b z r a h 2 e y S z A p l 6 X / x 4 l / x 4 k E p X j z 4 X 0 z b P W j 1 k c D j v Z l J 5 o W p k h Z 9 / 9 N b W F x a X l k t r Z X X N z a 3 t i s 7 u 3 c 2 y Y y A p k h U Y h 5 C b k F J D U 2 U q O A h N c D j U M F 9 O L i a + P e P Y K x M 9 C 0 O U 2 j H v K d l J A V H J 3 U q F 0 x B h C x n I f S k z r k x f D j K x Y g x + v s w 0 N 3 C Z k b 2 + l j r V K p + z Z + C / i V B Q a q k Q K N T G b N u I r I Y N A r F r W 0 F f o p t N x S l U D A q s 8 x C y s W A 9 6 D l q O Y x 2 H Y + 3 X F E D 5 3 S p V F i 3 N V I p + r P j p z H 1 g 7 j 0 F X G H P t 2 3 p u I / 3 m t D K P z d i 5 1 m i F o M X s o y h T F h E 4 C o 1 1 p Q K A a O s K F k e 6 v V P S 5 4 Q J d r G U X Q j C / 8 l 9 y d 1 w L T m v B z U m 1 f l n E U S L 7 5 I A c k Y C c k T q 5 J g 3 S J I I 8 k R f y R t 6 9 Z + / V G 3 s f s 9 I F r + j Z I 7 / g f X 0 D s r y k 5 A = = < / l a t e x i t > s h a 1 _ b a s e 6 4 = " p e 9 0 a j t J T 6 C v 7 F i i D u t 4 S D 6 S e 8 A = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P B i 8 d W 7 A e 0 o W w 2 k 3 b t Z h N 2 N 0 I p / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A q u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 k m m G D Z Z I h L V C a h G w S U 2 D T c C O 6 l C G g c C 2 8 H o d u a 3 n 1 B p n s g H M 0 7 R j + l A 8 o g z a q z U C P v l i l t 1 5 y

1 <
O H l x D D e 6 g D k 1 g g P A M r / D m P D o v z r v z s W g t O P n M M f y B 8 / k D z r G M / g = = < / l a t e x i t > d < l a t e x i t s h a 1 _ b a s e 6 4 = " M S a + g K 9 o Y g 5 K H x g F J N M Z d U u M S 4 c = " > A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K R I 8 B L x 7 j I w 9 I l j A 7 m U 2 G z M 4 u M 7 1 C W P I J X j w o 4 t U v 8 u b f O E n 2 o I k F D U V V N 9 1 d Q S K F Q d f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z e J U M 9 5 k s Y x 1 J 6 C G S 6 F 4 E w V K 3 k k 0 p 1 E g e T s Y 3 8 z 8 9 h P X R s T q E S c J 9 y M 6 V C I U j K K V H l T f 6 5 c r b t W d g 6 w S L y c V y N H o l 7 9 6 g 5 i l E V f I J D W m 6 7 k J + h n V K J j k 0 1 I v N T y h b E y H v G u p o h E 3 f j Y / d U r O r D I g Y a x t K S R z 9 f d E R i N j J l F g O y O K I 7 P s z c T / v G 6 K 4 b W f C Z W k y B V b L A p T S T A m s 7 / J Q G j O U E 4 s o U w L e y t h I 6 o p Q 5 t O y Y b g L b + 8 S l o X V a 9 W 9 e 4 u K / X 7 P I 4 i n M A p n I M H V 1 C H W 2 h A E x g M 4 R l e 4 c 2 R z o v z 7 n w s W g t O P n M M f + B 8 / g A D P I 2 s < / l a t e x i t > n l a t e x i t s h a 1 _ b a s e 6 4 = " p e 9 0 a j t J T 6 C v 7 F i i D u t 4 S D 6 S e 8 A = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P B i 8 d W 7 A e 0 o W w 2 k 3 b t Z h N 2 N 0 I p / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A q u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 k m m G D Z Z I h L V C a h G w S U 2 D T c C O 6 l C G g c C 2 8 H o d u a 3 n 1 B p n s g H M 0 7 R j + l A 8 o g z a q z U C P v l i l t 1 5 y O H l x D D e 6 g D k 1 g g P A M r / D m P D o v z r v z s W g t O P n M M f y B 8 / k D z r G M / g = = < / l a t e x i t > d < l a t e x i t s h a 1 _ b a s e 6 4 = " p e 9 0 a j t J T 6 C v 7 F i i D u t 4 S D 6 S e 8 A = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P B i 8 d W 7 A e 0 o W w 2 k 3 b t Z h N 2 N 0 I p / Q V e P C j i 1 Z / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A q u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 k m m G D Z Z I h L V C a h G w S U 2 D T c C O 6 l C G g c C 2 8 H o d u a 3 n 1 B p n s g H M 0 7 R j + l A 8 o g z a q z U C P v l i l t 1 5 y

2 < 3 <Fig. 6 :
Fig. 6: Illustration of the quantum memory and the Bell-state experiments.A memory experiment on a d×d lattice initiates n rounds of syndrome extractions.A Bell measurement experiment on two patches of d × d lattices with routing space length m initiates n 1 rounds of syndrome extractions on each patch, then initiates n 2 rounds of syndrome extractions on the joint patch by merging the two patches with the routing space, and finally initiates n 3 rounds of syndrome extractions on the two split patches.All data qubits are measured under the Z-basis before and after their corresponding syndrome extraction cycles.

Fig. 8 :
Fig. 8: Histogram of the channel-to-channel jitter of two AWGs across chassis in different layers.
(a) Quantum memory experiment.The number of syndrome extraction rounds is set to n = 7 2 (d + 1).(b) Bell-state experiment.The routing space length is set to m = 3d, and syndrome extraction rounds are set to n1 = n2 = n3 = d.

Fig. 10 :
Fig. 10: Comparison of the estimated classical execution time on the MCU against the quantum hardware execution time.The latter is estimated based on 20ns for each singlequbit gate, 40ns for each two-qubit gate, and 600ns for each measurement and reset.The classical run time consists of two parts: 1) For classical instructions, the run time is estimated with the master frequency of the MCU CPU is 1 GHz, and cycle counts from our in-house CPU profiling tools; 2) quantum instructions are executed via expansion into RISC-V base instructions for MMIO, and it takes up to 17 cycles for each MMIO communication between the MCU and the IQE driver through the system bus.The run time is scaled piecewise to reflect different running time scaling.The quantum execution time is marked separately with white hatches; note that the proportion of the quantum execution time versus the total classical execution time is distorted owing to the scaling distortion.

Fig. 11 :
Fig. 11: Illustration of a division of the overall threedimensional decoder graph of the Bell-state experiment (Figure 6) into windows.Under the assumption of a small routing space, even though the center window is large, its size is still O(d) × O(d) × O(d).Note that the center window covers more layers than other windows; an alternative scheme (not depicted) is to divide the center window further so that every window covers the same number of layers, which gives rise to more complicated windows.
Decoding throughput for the Bell-state experiment.

TABLE I :
Summary of instructions to the IQE driver.

TABLE II :
Summary of RISC-V pseudo-instructions designed for communicating with the AQE driver.ADDR_ * are memory addresses that are determined at design time and thus are constant in the assembler.

TABLE III :
An example of MMIO address layout.