Systems Architecture for Quantum Random Access Memory

Operating on the principles of quantum mechanics, quantum algorithms hold the promise for solving problems that are beyond the reach of the best-available classical algorithms. An integral part of realizing such speedup is the implementation of quantum queries, which read data into forms that quantum computers can process. Quantum random access memory (QRAM) is a promising architecture for realizing quantum queries. However, implementing QRAM in practice poses significant challenges, including query latency, memory capacity and fault-tolerance. In this paper, we propose the first end-to-end system architecture for QRAM. First, we introduce a novel QRAM that hybridizes two existing implementations and achieves asymptotically superior scaling in space (qubit number) and time (circuit depth). Like in classical virtual memory, our construction enables queries to a virtual address space larger than what is actually available in hardware. Second, we present a compilation framework to synthesize, map, and schedule QRAM circuits on realistic hardware. For the first time, we demonstrate how to embed large-scale QRAM on a 2D Euclidean space, such as a grid layout, with minimal routing overhead. Third, we show how to leverage the intrinsic biased-noise resilience of the proposed QRAM for implementation on either Noisy Intermediate-Scale Quantum (NISQ) or Fault-Tolerant Quantum Computing (FTQC) hardware. Finally, we validate these results numerically via both classical simulation and quantum hardware experimentation. Our novel Feynman-path-based simulator allows for efficient simulation of noisy QRAM circuits at a larger scale than previously possible. Collectively, our results outline the set of software and hardware controls needed to implement practical QRAM.


INTRODUCTION
Quantum computers hold the potential to solve problems that are beyond the reach of conventional digital computers.Such quantum speedup, as understood theoretically, arises from the utilization of quantum mechanical properties such as superposition and entanglement to process information more efficiently and rapidly [45].Some of the most promising quantum computing applications include quantum searching [26], optimization problems [55], molecular simulation [21,39], data processing for machine learning [4,30], and cryptography [52].For example, the quantum algorithm by Grover [26] for searching an unordered database of size  makes only order of √  queries to the database.This is a √  -speedup over the best classical algorithms, which require order of  queries when given access to the same database.
Over the past three decades, technology for building quantum computing hardware has advanced steadily -prototypes of universal quantum processing units (QPU) housing 100+ individually programmable qubits are becoming available for the first time, and there is great interest in practically realizing these quantum applications.The development of scalable quantum computers is still in its early stages.Current Noisy Intermediate-Scale Quantum (NISQ) hardware [50] is limited by its system size (number of qubits) and fidelity (coherent lifetime of qubits and error rates of quantum gates).Remarkable progress [8] has been made in improving the performance of QPUs, through better quantum control, error correction architectures, as well as compiling and noise mitigation software.
One critical, yet largely missing, ingredient for realizing quantum speedup in practice is the implementation of quantum queries [4,45], which allow data to be loaded into quantum states that the QPU can process.While QPU architectures are designed to process data rapidly, they often cannot encode classical data efficiently and quantum random access memory (QRAM).There is a need for rethinking the systems architecture for QRAM because QRAM presents a distinct set of architectural constraints.A larger radial coordinate indicates a relatively more stringent requirement -our QRAM design alleviates requirements in multiple dimensions.
or robustly.This inhibits the practical deployment of many quantum algorithms.This issue is known as the data input and output (I/O) bottleneck [4,27].In terms of gate and qubit overhead, the costs for encoding a large set of data into quantum states can be prohibitively high and dominate the costs of quantum algorithms.
Traditional classical random access memory (RAM) [34] allows data stored in memory cells to be loaded rapidly into a central processor unit (CPU).Similarly, a QRAM [23] has been proposed to enable quantum-mechanical loading of data into memory cells.This involves querying the QRAM simultaneously in a large superposition of different addresses.More precisely, like in classical RAM, a memory cell can be accessed by specifying its address.That is, the data value   is stored at address  ∈  , where  is the memory size.
where   is the amplitude of each address in the superposition, and |•⟩ A (|•⟩ B ) is the address (bus) qubit register storing the input (output).
While the design principles of a QRAM are similar to those of a classical RAM, there are unique challenges in scaling up QRAM in practice [22,27]: (i) Query Latency.Like a RAM device, a QRAM allows any data to be accessed in almost the same amount of time regardless of the specified address.This includes a superposition of all addresses at once.Naively, entangling data from one memory cell at a time incurs latency, scaling with  in the worst case.This latency can translate to a slowdown in the application, impeding its practical deployment.(ii) Memory Capacity.Quantum algorithms, in principle, offer better quantum speedup when given access to large memory.However, existing QRAM architectures require a rapidly growing number of gates or ancillary qubits when scaling up the memory capacity.(iii) Fault Tolerance.Errors in the QRAM (e.g., due to various types of noise in the circuit) could seriously impact its utility, and in many cases eliminate the quantum advantage of an algorithm altogether [51].As such, it is critical to guarantee the error robustness of QRAM, through either intrinsic noise resilience [28] or error correction.
We propose a general-purpose QRAM architecture to address these challenges, drawing insights from classical RAM, quantum compiling, and quantum error correction.Specifically, we make the following novel contributions: (1) Goal: Small QRAM; large virtual memory.
Solution: We propose a new practical architecture that provides a virtual address space that can exceed the capacity of the physical QRAM.By hybridizing two previous query architectures [2,4], we achieve asymptotic savings in space and time.As a result, our architecture enables queries to large memory that cannot be accomplished by either architecture alone.
Solution: QRAM requires strongly entangling  address qubits with data from  (2  ) memory cells.
Whether we can efficiently embed QRAM on practical (sparsely connected) two-dimensional hardware is highly non-trivial.
For the first time, we provide a positive answer to this question.We present a constructive mapping of QRAM on 2D lattice architectures with minimal routing/communication overhead.(3) Goal: Noise-robustness of QRAM.
Solution: We show that our small-scale QRAM can be implemented on current hardware or near-term hardware with moderately improved error rates We also demonstrate that small error correction codes allow us to substantially scale up QRAM with low overhead.
The rest of the article is organized as follows.In Sec. 2, we review the background on quantum compiling and architecture designs for QRAM.Sec. 3 introduces a new QRAM architecture designed explicitly for hybrid QRAM.In Sec. 4, we present the algorithm for mapping QRAM on realistic hardware.In Sec. 5, we analyze the biased-noise resilience property of our circuit and leverage it to reduce error correction overhead.Finally, we validate the results via classical simulation and quantum hardware experiments in Sec.6 and Sec. 7. In Sec. 7, we also compare the resource usage of different QRAMs and show an asymptotic scaling advantage.

BACKGROUND 2.1 Principles of Quantum Computing
In quantum computing, a quantum bit (qubit for short) is the fundamental computing unit.Unlike its classical counterpart, a qubit can be in a superposition state, that is a linear combination of 0 and 1.In the Dirac notation, | ⟩ =  |0⟩ +  |1⟩, where ,  are complex coefficients satisfying | | 2 + | | 2 = 1, and |0⟩ = [1 0]  and |1⟩ = [0 1]  are the computational basis vectors.In quantum algorithms, quantum logic gates are used to manipulate the state of the qubits.Some common quantum logic gates are shown below:

Quantum Compiling
A quantum compiler transforms a high-level quantum program or mathematical algorithm into a sequence of native instructions that the hardware backend recognizes [11,13].The transformed quantum circuit must be logically equivalent, resource-efficient, and robust to hardware noise.Due to strict architectural constraints, compiler optimization will have to break traditional abstractions across the software stack and be adapted to algorithmic and device characteristics.We now highlight several important transformation passes in a compiler software.

Gate Synthesis.
One of the first steps in quantum compiling is to decompose high-level unitary into a sequence of native gates from an instruction set [29].A common instruction set is the Clif-ford+T gates, such as {H, S, CX, T}.When implementing quantum queries using these gates in practice, they can introduce latency into quantum algorithms if the circuit depth is too high.
As such, it is preferable to implement QRAM with a tailored gate set natively, such as the classical reversible gates, including X, CX (controlled-X), Toffoli (doublecontrolled-X), MCX (multi-controlled-X), and CSWAP (controlled-SWAP) gates.Otherwise, the multi-qubit gates can be decomposed into Clifford+T gates.For example, we can decompose a CSWAP gate to a circuit of depth 12, T depth 3, with no ancillae required [1,18].

Qubit Mapping and Routing.
To execute a quantum circuit, another critical step is to map all logical qubits onto the physical hardware.Current NISQ hardware has a limited number of qubits and restrictive qubit connectivity.Only adjacent qubits can interact with each other, while interactions between distant qubits are resolved via routing qubits closer to each other.Common routing strategies include physically moving qubits (e.g., for trapped ions) [57] or logically swapping qubits (e.g., for superconducting circuits) [44].Different routing strategies would incur different routing overhead, in terms of the number of additional operations.Future Fault-Tolerant Quantum Computing (FTQC) hardware will have similar constraints.For example, in surface codes, logical qubits can be laid out in a 2D grid topology.Logical gates and qubit routing can be accomplished via lattice surgery [33].
Although the general qubit mapping problem has been proven to be NP-hard [41], it is often useful to leverage information about the circuit and the hardware to improve the quality of a mapping strategy-a well-structured circuit can be easier to map and route [32,43,59].Moreover, hardware noise-aware mapping strategies can enhance the circuit performance significantly [3,14,37,44,54].

Quantum Query Architectures
QRAM is an integral part of quantum computer architecture, as it enables quantum computers to efficiently encode classical data into a quantum state for the QPU to process.As shown in Equation 2, a QRAM read operation therefore involves properly entangling data   with the corresponding address  in the superposition input.
General-Purpose versus Domain-Specific -Query architectures can be categorized into two classes: (1) General-purpose (GP) architectures that load any possible data values   for an arbitrary address    |⟩.(2) Domain-specific (DS) architectures that implement only particular data function(s)  () =   .DS architectures are useful when running applications in a target domain, as they are highly tailored to maximize efficiency or fault tolerance.
We now describe the leading designs and implementations of QRAM, namely gate-based and router-based architectures.

Gate-Based
Architecture.Conventional wisdom is to use a partition of a universal QPU to implement the functionality of QRAM.As such, many proposals involve synthesizing the QRAM operation using a sequence of quantum logic gates and optionally using ancillary qubits.We provide two examples of these gate-based architectures below: Sequential Query Circuits and Reversible Logic Circuits.
Sequential Query Circuit (SQC) -A quantum query can be implemented by a quantum circuit consisting of sequential MCX (multi-controlled-X) gates with no ancillary qubits.In the literature, the SQC is also known as a basic query circuit (BQC) or quantum read-only memory (QROM) [2]. Figure 2c provides a simple example of such a circuit.In this circuit, a sequence of  MCX gates is applied to query a memory of size  , where each gate has log  controls on all the address qubits and one target on the so-called bus qubit that holds the queried data.Each MCX gate is responsible for loading data stored at one corresponding address, and the full query is realized by iterating sequentially over all possible addresses.The SQC is a general-purpose architecture, capable of querying any function  () =   for the memory cell data.It uses  (log  ) qubits and has  ( ) query latency.
Reversible Logic Circuit (RLC) -An alternative implementation of quantum queries directly on a QPU using quantum gates is through an RLC.When the function that computes the data value is known, that is   =  (), one can synthesize this function directly with classical reversible gates (such as X, CX and Toffoli) and ancillary qubits.Because different circuits are required to implement different functions, RLC implements a domain-specific query.Classical reversible circuits are shown to be easier to optimize and verify than generic quantum circuits [15,49].However, as with any DS architecture, the circuits must be synthesized and optimized for each domain application [19].This is useful if we want to implement a quantum computer to support a particular application.For example, the modular exponentiation step in Shor's algorithm [52] can be implemented by either an RLC or a hand-optimized quantum circuit [20].

Router-Based Architectures.
To minimize query latency, several router-based architectures are proposed.The defining feature of these general-purpose architectures is some form of quantum routing, wherein quantum data is routed to multiple different locations Fanout QRAM [45] -Fanout QRAM is the first architecture to achieve an  (log  )-latency query.It arranges quantum routers in a binary tree, recursively using the outputs of the parent router as the inputs of the children routers.A query is implemented in two stages, address loading and data retrieval.During address loading, all routers are first initialized in |0⟩, then a series of CX gates entangle the address and routers.In particular, all 2  routers at level  of the tree are flipped from |0⟩ to |1⟩ conditioned on the  th address, resulting in the preparation of Greenberger-Horne-Zeilinger-like (GHZ) states [25] across each level of the tree.During data retrieval, a bus qubit is routed down from the root of the tree, following the path indicated by the routers' states.Once it reaches the bottom of the tree, classically-controlled gates copy the data   =  () into the state of the bus.Subsequently, the bus qubit is back routed out of the tree, and the routers are returned to the all-|0⟩ state by uncomputation.The fanout QRAM's  (log  ) latency results from the fact that both the address loading and data retrieval can be implemented with only  (log  )-depth circuits.However, Fanout QRAM is shown to suffer from decoherence problems due to the high entanglement of GHZ states [28].
Bucket-Brigade QRAM [22,23] -Bucket-Brigade QRAM improves Fanout QRAM by modifying the address loading stage to reduce the entanglement among the routers, as shown in Figure 2e.Instead of using CX gates to entangle the address and routers, in Bucket-Brigade QRAM the address qubits are themselves routed into the tree, with the states of earlier address qubits controlling the routing of later ones.The resultant state of the routers is more akin to a W state [9] than a GHZ state.The former has less entanglement entropy, which has been shown to greatly reduce the sensitivity of Bucket-Bridage QRAM to noise and errors [28].Importantly,  (log  ) query latency is still achievable with this improvement, making it a competitive query architecture for the NISQ era.

Other Architectures.
Other constructions of quantum queries have been proposed.For example, Select-Swap QRAM [40] can be viewed as a combination of the gate-based and router-based architectures.Select-Swap QRAM employs a two-stage approach.During the first stage, one sequentially iterates over all possible states of a subset of the address qubits, loading corresponding blocks of data for each.During the second stage, the remaining address qubits are used to route this data through a network of CSWAP gates, routing the queried data to a definite location.The sequential iteration in the first stage is analogous to the gate-based SQC, while the coherent routing in the second stage is reminiscent of the routerbased architectures.
As another example, arbitrary state preparation algorithms can sometimes be used as a subroutine in QRAM.Some existing work includes the general unitary synthesis method [53,60] and parameterized circuit method [24,48].The general unitary synthesis is more complex than router-based QRAM, with specific gates for the different classical data sets.This increases the difficulty of changing the classical dataset.The parameterized circuit is a popular NISQera application that has been proposed to have the ability to realize approximate quantum query with  (1) depth and  (log  ) number of qubits with the price of long training time and approximate queries.
Figure 3: Overview of the proposed virtual QRAM architecture ( = 1,  = 2) and its interaction with QPU.The QPU qubits are swapped to the buffer for a quantum query and returned to QPU once the query is done by QRAM.

PROPOSED QRAM ARCHITECTURE
This work demonstrates an end-to-end architecture design for QRAM.The prior works, including Bucket-Brigade QRAM and FANOUT, have unaffordable overhead in terms of resource consumption when querying a large database.Our proposed system architecture is built upon a novel router-based construct that allows us to query a virtual address space larger than that is physically available, as shown in Section 3.1.We also provide a series of optimizations in Section 3.2 to significantly cut down both space and time costs.Our new design decouples the address-loading stage from the data-retrieving stage in a quantum query, and deeply optimizes each stage.As a result, we can achieve an asymptotic advantage over prior work (which is shown in Section 7).

Two-Stage Query Overview
We introduce a novel router-based QRAM using  () qubits, where  = 2  .We define the address width  as the number of bits to specify an address and the capacity  the size of the address space.For illustration purposes, we include an example architecture shown in Figure 3, and a step-by-step query procedure outlined in Figure 4, with the detailed circuits shown in Figure 5. Like in the router-based QRAM architectures, our construction consists of two stages: address loading and data retrieving.We explain these two stages in turn below.
3.1.1Stage 1: Address Loading.The task in the address loading stage is to route the  address qubits into the small QRAM.For this purpose, we follow the approach of other router-based QRAM architectures and arrange a collection of quantum routers in a binary tree structure.We affix a layer of  data qubits to the outputs of routers at the lowest level of the tree, with each data qubit corresponding to one of the  classical memory cells.These data qubits will be used to facilitate operations during the data retrieval stage.
The address-loading stage follows the conventional Bucket-Brigade procedure: the address qubits are sequentially routed into the tree, with the states of earlier address qubits controlling the routing of later ones.The only difference is that, in our scheme, the data qubits are subsequently prepared in a special state conditioned on the states of the routers (Figure 4a).Specifically, when address  is queried, the  th data qubit is flipped from |0⟩ to |1⟩.This flipping is implemented via a collection of CX gates, with the bottom layer of routers as controls and the data qubits as targets, see Fig 5a as query state preparation.

Stage 2: Data Retrieval.
We propose a novel data-retrieval stage, which, at a high level, performs data compression from the bottom data qubits to the root data qubit in the QRAM.First,  classically-controlled gates act on the bottom of the tree to write |  ⟩ on the data qubits when address |⟩ is queried.Specifically, each data qubit is paired with an ancillary qubit initialized in |0⟩.We refer to the two-qubit system of a data qubit and its ancilla as a data node.Then, conditioned on   , a SWAP gate is applied between the two physical qubits in a data node.This has the effect of encoding the classical data in a dual-rail encoding, i.e.   = 0 (resp.  = 1) is encoded as |10⟩ (|01⟩).Fig. 4(b) shows the resultant state.
Then, an array of CX gates (Figu 5b) is used to propagate this data up to the root node of the tree, as shown schematically in Fig. 4c.Next, the data at the root node is copied to the bus qubit, conditioned on the  remaining address qubits, using a MCX gate.After that, the CX array is applied again to uncompute and disentangle QRAM qubits, returning the QRAM to the state for the next data retrieval stage (Fig. 4d).We can then repeat the data retrieval stage for the next segment/page of the memory, after swapping the new segment into the bottom of the tree.Crucially, in this new data-retrieval step, the only non-Clifford gate involved is the MCX gate.

Putting It
Together: Virtual QRAM.The goal is to implement quantum query access to a memory of capacity  = 2  , where  > .That is, we consider a practical scenario where  (2  ) qubits are not physically available.Can we still implement the query, and if so, at what cost?This scenario resembles classical memory architecture design where "virtual memory" allows a small physical RAM can access a large address space by swapping segments (also known as pages) of memory from disk storage [36].Our "virtual QRAM" design follows precisely this intuition: we implement a virtual QRAM that allows a small QRAM with  () qubits to access a large address space  ( ) by swapping segments of classical memory, as shown in the right panel of Fig. 3. To accomplish this, we need to design a QRAM architecture that allows us to query segments of memory coherently.
We first partition the full size- classical memory into  = 2  continuous segments (pages), each of which has  = 2  memory cells.It is equivalent to viewing the original -bit address as two parts: the most significant -bits (which we call SQC width) and least-significant -bits (which we call QRAM width), where  +  = .Our design hybridizes gate-based SQC and router-based QRAM, where a quantum query consists of 6 basic steps: (a) loading  address bits into QRAM, (b) preparing leaf data qubits for data retrieval, (c) retrieving data to root qubit, (d) uncomputing data retrieval, (e) repeat (b-d) for each segment of classical memory, (f) uncomputing address loading.More details are illustrated in Figure 4.A distinct feature of our design is the "load-once" property.Our method only loads the  address qubits into QRAM once at the beginning (and at the end for uncomputation) as shown in Figure 5c, whereas in a previous design from [28] the  address qubits need to be loaded 2  times.This is one of the major sources of gate savings in our method.We present a pseudocode algorithm to describe the entire query procedure in Algorithm 1.Note that, for illustration purposes, this algorithm does not include the optimization techniques introduced in Section 3.2.

Implementation and Optimizations
3.2.1 Key Optimization 1: Address Qubit Recycling.As shown in Figure 4, the internal router qubits (orange nodes) are not being used during steps (b)-(e).By recycling/reusing these qubits in replacement of the data qubits, we do not need any data qubits (blue nodes) for the quantum routers internal to the tree.Figure 5 illustrates that the data retrieval stage reuses router qubits for copying data via the CX array.As such, in diagrams such as Figure 6c and Figure 7, we do not draw blue data qubits for internal quantum routers.

Key
Optimization 2: Lazy Data Swapping.As shown in the data-retrieval stage in Figure 5b, classical data is loaded and unloaded sequentially for each segment of classical memory.We observed that if the subsequent classical data corresponding to the memory address on the next page is equal to the previous one, unloading and reloading data qubits are redundant and unnecessary.Instead, by computing  ′  =    +2  , it is necessary to load the next classical data   only when  ′  = 1.At the final data-retrieving stage, an alternative classical data unloading is accepted, with a classical value as   =  ∈ {0,1,..., }   .Adopting this technique, named lazy data swapping, provides SWAP gate savings of  (2 −1 ) in average cases, since the subsequent classical data can be the same as the original data with a probability of  = 0.5, assuming a uniform distribution for the classical data   .

Key Optimization 3: Address
Pipelining.With pipelining, we can reduce the depth of address-loading from  ( 2 ) to  ().In the naive approach to address loading, address qubits are routed into the tree sequentially, with the (ℓ+1) th address qubit waiting to be routed until the ℓ th address qubit has reached its destination at level ℓ of the tree.The total routing time is thus ∼  ℓ=1 ℓ =  ( 2 ).Instead, the addresses can be routed in a pipelined manner: the (ℓ + 1) th address qubit is routed into the tree immediately after the ℓ th qubit has been routed one layer down, i.e. without waiting.Removing the (QRAM) waiting reduces the depth to  ().While the resulting circuit is equivalent to the parallel schedules introduced in [28] and [12], we identify the origin of such parallelism as coming from pipelining.

MAPPING QRAM IN 2 DIMENSIONS 4.1 Mapping and Routing: Challenges
Qubit mapping and routing are important steps in a quantum compiler to implement a quantum algorithm on hardware.Routing overhead refers to the number of operations needed to execute a gate operation on two (possibly physically distant) qubits.This overhead can cause a significant increase in the query latency of the quantum algorithm.
Mapping QRAM (of capacity ) is particularly challenging because it is required to map and entangle  () qubits.The tree-like structure in router-based QRAM means that the  th layer involves  (2  ) qubits and multi-qubit gates.Naively, keeping a router qubit in layer  equal-distance with its parent router in layer  − 1 and its children routers in layer  + 1 is only possible in hyperbolic geometry.To embed a tree in 2D Euclidean space, the root ( = 0) of the tree can be far apart from the next layer down ( = 1) due to the large size of the two subtrees.
Our research shows that QRAM can be embedded in a 2D nearestneighbor grid without incurring asymptotic routing overhead.In other words, we can map the QRAM circuit on a 2D grid and route the qubits without increasing the  (log ) depth of the original circuit.This is achieved by combining a mapping strategy via topological minor graph embedding (Sec 4.2) and a routing method based on teleportation (Sec 4.3).

Mapping QRAM via H-Tree Recursion
To map QRAM onto a 2D architecture, we need to find an embedding of a binary tree in the connectivity graph of the hardware.In addition, we require the embedding to be a topological minor graph embedding.This allows us to implement the teleportation-based routing method by ensuring all routing qubits do not carry any logical information.Given a simple, undirected graph , another graph  is a topological minor of  if  can be obtained from a subgraph of  by deleting edges, vertices, and contracting edges.
We reduce the problem of mapping QRAM to embedding a complete binary tree in a 2D grid.The problem of embedding complete binary trees into grids has been extensively studied in classical VLSI design [16,31,38,46].The H-tree recursion is the first efficient mapping strategy introduced by [7].In Figure 6a, we present the optimal embedding of  2 into Grid(3, 3) by H-tree recursion, which is also the base case for the recursion.The embedding involves three QRAM router qubits, one unused qubit, one routing qubit, and four data qubits.Note the distinction between a router qubit (in QRAM) and a routing qubit (for teleportation).This design ensures the root QRAM qubit can route to the border of the grid.Recursively, we can construct an embedding of  +2 into Grid(2 + 1, 2 + 1) as shown in Figure 6.As for even-addressed width QRAMs, we can cut half the grid and make it a rectangular one with  (2 + 1, ) to embed  +1 QRAM into the 2D grid.

Routing via Teleportation
Our teleportation routing method is based on a technique called entanglement swapping [47], commonly used in the context of quantum repeater networks [6,56].If the intermediate qubits between two logical qubits are unused, they can be used as ancillae (routing qubits) to perform teleportation, as shown in Figure 6d and 6e.Local Einstein-Podolsky-Rosen (EPR) pair preparation and Bell State Measurement (BSM) are performed in parallel.As such, we can teleport a qubit over a long routing distance with a constant depth circuit.Each QRAM operation (e.g., remote CSWAP) can thus be implemented in  (1) step, regardless of the routing distance.
As a result, our embedding from Sec 4.2 is optimal in terms of routing latency.Since teleportation only introduces an  (1) depth to each gate, the overall QRAM circuit depth remains in  (log ).Though H-tree is efficient enough to provide the optimal query latency, there are further optimizations are provided by [16,38,46], where improved versions of the H-tree recursion are found.As such, we can embed QRAM into a (constant-factor) denser grid.

NOISE-ROBUST IMPLEMENTATIONS
QRAM is highly susceptible to noise and errors, which can cause information loss and reduce the fidelity of the stored states.Thus, QEM or QEC is essential for improving the reliability and accuracy of QRAM operations.This is critical for the success of many quantum algorithms and the development of fault-tolerant quantum computation.
Prior work by [28] revealed intrinsic noise resilience in Bucket-Brigade QRAM by carefully analyzing the error propagation in QRAM circuits.We observed that virtual QRAM shares a similar biased-intrinsic noise resilience property, meaning that Z error in virtual QRAM is constrained to local qubits and will not propagate to the entire circuit even without any active quantum error correction or mitigation.To quantify the noise-resilience of virtual QRAM, we define query fidelity for a single query |  ⟩ as , where   is the true output, and  ′  is the expected output.
With respect to this definition of query fidelity, we will show that in virtual QRAM, the query fidelity is lower bounded for an arbitrary |  ⟩.The infidelity is polynomially in terms of the address width m, rather than the overall tree size 2  .

Biased-Noise Resilience Analysis
To quantify the noise-resilience of virtual QRAM, we define query fidelity for a single query  ⟩ as , where   is the true output, and  ′  is the expected output.With respect to this definition of query fidelity, we will show that in virtual QRAM, the query fidelity is lower bounded for an arbitrary |  ⟩.The infidelity is polynomially in terms of the address width m, rather than the overall tree size 2  .
To construct our Z-biased noise model, we assume that each qubit is subjected to the following phase-flip noise quantum channel,  →  ′ = (1 − ) +   .Equivalently, a  error is applied to each qubit with probability .We show that in the presence of this qubit-based error channel, the QRAM part (beside SQC) in virtual QRAM has a lower bound in the query fidelity as where  = log() is the address width of QRAM.We first present an outline of our methodology.Similar to the approach used in [28], the locality behavior of the noise prevents the error from propagating throughout the entire QRAM, protecting the overall query fidelity.An example of this effect is illustrated in Fig. 7 -a Z error in the control qubit of a subsequent CX gate never propagates to the target qubit by commutator relationship of quantum gates.This property also holds in quantum routers using CSWAP gates, ensuring the error locality for the address loading stage.However, our virtual QRAM cannot prevent other Pauli errors, such as X and Y errors, from propagation.We will show that the fidelity under Z and error channels has an exponential difference with respect to the QRAM width .
Consider the computational basis states |0⟩, . . ., |2 + − 1⟩ corresponding to different memory addresses.For a query , let   ⊆ {0, . . ., 2 + − 1} be the subset of  such that  behaves ideally on |⟩: there are no  errors on any routers on the path of the branch corresponding to .On the initial state where |  ⟩ is a super-position of all non-ideal branches (those with  errors).where the inequality holds since |  ⟩ is normalized.Thus, it suffices to show that   is sufficiently large in expectation, i.e. most branches perform ideally.If  and  ′ are sets of indices of the same size, they share equal possibility to coincide with   , as our error model on the router is independent of the router itself. Letting where the expectation is taken over all possible errors.By the fact that the Z error will not propagate up across the tree to destroy other branches, the fulfillment condition for an ideal branch is that all the routers in the path are correct.Since each branch contains  routers, the probability a branch behaves ideally is (1 − )  2 , with Combining (3) and ( 4), we conclude that where the final inequality is valid for  ≥ 1,  ≥ 0. This proves the fidelity bound in Equation (3).Precisely, if dual-rail encoding introduced in Section 3 is adopted, the router qubits and data qubits are duplicated, which doubles the errors of each router in the circuit.The above derivation, however, is still valid because the locality of the Z error behavior is not relevant to the choice of encoding.Using the same methodology, we arrive at a bound with only a constant factor difference for dual-rail-encoding virtual QRAM: On the contrary, the circuit has no noise-resilience property for X errors.Any single X error will propagate to the root qubit of the QRAM, leading to a complete destruction of the query state.Thus to achieve an ideal state, it suffices to show that all the qubits in QRAM are correct.As such, for X error channel with error prob , the lower bound of infidelity is 1−8 •2  , exponential in the total number of qubits.Similar to the X error behavior in the QRAM part, any single Pauli error in SQC is fatal for the query fidelity.Consequently, for an SQC width k, the query fidelity under arbitrary Pauli errors with error rate  is lower bounded by 1 − ( • 2  ).Combining these bounds, We conclude the virtual QRAM with QRAM address width m and SQC width k will lower bound the overall query fidelity as for Z errors and X errors, respectively.Additionally, our biasednoise analysis can be easily extended to a gate-based error channel with errors randomly applied using Monte Carlo sampling to quantum gates, up to a constant factor difference.By the observation that each branch of the QRAM intersects with at most  () gates, the lower bound of the fidelity has the same asymptotic scaling as under the qubit-based noise model.

Asymmetric Error Correction
In this section, we explore the fault-tolerant implementation of the virtual QRAM in future quantum hardware, using rectangular surface code to combine error correction design with intrinsic QRAM noise resilience.We assume that the error rate of physical qubits is unbiased with respect to both X and Z, whereas a logical qubit generated by the rectangular surface code exhibits a biased error rate.Adopting this biased-error surface code as qubits in virtual QRAM, balanced fidelity for X and Z errors can be achieved.
Based on the different lower bounds of query fidelity under X and Z error channels, we need a careful choice of the surface code for the different parts in virtual QRAM.First, the logical error rate ratio of X and Z is related to the physical error rate, surface code threshold, and the code distances  and  [5]: To balance the logical error rates of X and Z error channels, we adopt the bound from equation 5 and 6 and let   =   , then We obtain the strategy of designing rectangular surface code for each physical qubit in QRAM is choosing lengths of the surface code  and  as: Since the SQC does not have biased-noise resilience, we can encode  address qubits using regular square surface code to achieve full protection for the entire virtual QRAM.

EVALUATION METHODOLOGY 6.1 Baseline Architectures
We theoretically analyze the performance of the new circuit and perform simulation in comparison to two baseline architectures, BB (Bucket-Brigade QRAM) and SS (Select-Swap QRAM), described in Sec 2. These two QRAM architectures present state-of-the-art QRAM architecture designs in both consumption of quantum resources and performance of quantum circuits including circuit depth and noise-resilience properties.

Feynman Path Simulation
We utilized a classical simulation technique called Feynman-Path Simulation (FPS) to efficiently compute the result of QRAM queries.In FPS, each memory address corresponds to one path.Despite path number and running time scaling exponentially in the address space, FPS can still efficiently simulate and analyze larger QRAMs, on account of the following desirable property.QRAM circuits are constructed from a small, fixed set of classical-reversible gates, meaning that all these gates do not map a single computational basis state to a superposition over basis states.As a consequence, the storage overhead does not increase exponentially in the depth of the circuit, and instead remains constant.Notably, Pauli gates have the same property to ensure we execute noisy QRAM simulations with negligible overhead.Compared to the special-purpose simulator in [28], our Feynman Path implementation is the first general-purpose QRAM simulation that is capable of handling arbitrary input (e.g., memory capacity, address state, and noise models).

Experimental Setup
Noise-free or Pauli-noise circuit simulations are implemented with a single core on a single node, with the largest simulations using 1.5 MB of RAM.For each value of QRAM address width , we execute 1024 shots to achieve the average fidelity, using a gate-based error model applied via Monte Carlo sampling.We assume devices with 2D square grid connectivity, commonly used in NISQ or FTQC architectures.

QRAM Resource Estimation
The improvements breakdown by different optimization techniques in Section 3 are listed in Table 1.Table 2 presents a comprehensive comparison of multiple current quantum architectures, including concrete parameters such as circuit depth, number of qubits, and T gate depth, among others.Notably, the asymptotic scaling in the table indicates that all three architectures have the same scaling with respect to qubit count.SQC+BB (Baseline B) is a load-multiple-times architecture that suffers from deficiencies in exponential  (2  ) overhead in T depth and T counting complexity.SQC+SS (Baseline S) is a load-once architecture; however, its swap-network is not as efficient as the router-based QRAM, since it lacks a pipelining strategy to load the address discussed in Section 3. Consequently, the circuit depth and Clifford depth of SQC+SS will be quadratically larger than our new QRAM architecture with factor  ( 2 ).Therefore, our virtual QRAM outperforms or at least matches any resource counting compared with the state-of-the-art QRAM architectures.

Mapping and Routing Overhead
Figure 8 presents the results of our constructive mapping strategy, counting the extra operation depth induced from connectivity constraints in NISQ hardware.The conventional swap-based routing, when applied to QRAM, leads to exponential extra SWAP depth overhead and loss of the logarithmic scaling of query depth in terms of QRAM width m.In contrast, the teleportation-based routing consistently outperforms the other, exhibiting an exponential advantage in extra operation depth.Figure 8 also indicates that QRAM under embedding only introduces a linear overhead for circuit depth, which protects the query latency to be unchanged under qubit mapping and routing.Meanwhile, our H-tree embedding and the teleportation scheme reduce the circuit depth exponentially but with only constant qubit resource overhead.With  = 2 address qubits, unused qubits occupy 25% of the total qubits: (2  − 1) 2 vs (2 (+1) − 1) 2 .The proportion of unused qubits can be further reduced by improved tree embedding discussed in Sec 4.3.

Biased-noise simulation
As illustrated in Section 5, the QRAM architecture is robust to Zbiased error.We further validate this property by calculating the virtual QRAM fidelity and comparing it to Baseline B, which is robust to arbitrary error channels but has the same scaling as the new virtual QRAM for Z error.As expected, Baseline S exhibits no noise resilience.Figure 9 shows the scaling of the fidelity under Pauli X and Z errors for different architectures with error rate  = 10 −3 .Moreover, our simulations in Figure 10 demonstrate the fidelity gap between Z-biased noise and X-biased noise, with far better performance in the former.We also provide Figure 11 to illustrate the trade-off between the QRAM width m and SQC width k under the single qubit Z and X error model.Our plot indicates that the fidelity decays exponentially faster when increasing the SQC width parameter  than the QRAM width .

RELATED WORKS
Recently a similar architecture design for QRAM by Chen et al. designed a "load-once" QRAM architecture capable of querying memory with more than 1 bit in each memory cell, i.e., a data width of  ≥ 1 [10].The difference between theirs and our work is that we target creating larger address widths rather than simply repeating the data retrieved multiple times to query the same address for more bus qubits.Notably, some previous works, e.g., Connor et al. [28], mentioned similar strategies to generalize to high data width.Our virtual QRAM is compatible with a data width larger than 1 by repeatedly querying memory cells one bit at a time, by taking advantage of the parallel retrieval from [10] in our virtual QRAM.
During the preparation of this work, Jaques and Rattew made similar claims on qubit mapping in QRAM in a recent paper [35].One major limitation of QRAM highlighted in both our work and [35] is the signal latency for communicating qubits within the QRAM.[35] assumes a quantum bus line for communication, which has latency linear to its distance, while our work proposes a novel teleportationbased routing scheme to overcome this bottleneck.Ultimately, via teleportation, we can propagate information across QRAM qubits faster (only limited by classical communication or the speed of light), with negligible quantum delay.This is an important step towards achieving quantum advantage for QRAM with relatively small query delay.
There are significant advances in hardware development towards QRAM using superconducting devices and cold atom arrays.For example, the key element of deterministic CSWAP operations between superconducting cavities have been demonstrated in previous work [17,58].CSWAP operations using superconducting transmon devices have also been demonstrated [42].Toffoli gate and CSWAP for QRAM have also been investigated using the Rydberg blockade.

CONCLUSION
The use of QRAM is ubiquitous in quantum algorithms.A successful implementation of practical QRAMs could unlock the full potential of quantum computing and bring us closer to realizing practical applications such as optimization, machine learning, and cryptography.Our proposed QRAM architecture addresses the challenges of memory capacity, query latency, and fault-tolerance through innovations in virtualizing QRAM, latency-free mapping to 2D grid architecture, and leveraging intrinsic biased-noise resilience in the circuits.We have shown an end-to-end systems architecture for performing high-fidelity queries of large memories, and identified key technology advances (such as gate error rate reduction or error correction code distance) needed to scale up QRAM and quantum computing platforms as a whole.

Figure 1 :
Figure1: Requirement map by quantum processor unit (QPU) and quantum random access memory (QRAM).There is a need for rethinking the systems architecture for QRAM because QRAM presents a distinct set of architectural constraints.A larger radial coordinate indicates a relatively more stringent requirement -our QRAM design alleviates requirements in multiple dimensions.

Figure 2 :
Figure 2: Different quantum query architectures and corresponding elementary units.(a) Quantum query example with address width 3. (b) Multiple-Controlled-X gate (MCX) as a unit for circuit-based query architecture.(c) Quantum router as a unit for router-based query architecture.(d) Sequential query circuit consisting of sequential MCX gates.(e) Parallel Bucket-Brigade QRAM consisting of recursive quantum routers.

Figure 4 :
Figure 4: The step-by-step procedure of a quantum query to virtual QRAM ( = 1,  = 2).(a) Load the  address qubits (orange) in QRAM and prepare the specific data qubit (blue) for data retrieval.(b) Coherently write classical data   in data qubits to state |  ⟩.(c) Retrieve data by copying to root qubit via CX gates.Data is copied to the bus qubit conditioned on the state of  address qubits (dark gray).(d) Uncompute data retrieval and swap in next segment memory.(e) Repeat data retrieval in (c).(f) Uncompute data retrieval.

Figure 5 :
Figure 5: Circuit for the architecture design of virtual QRAM, with SQC width  = 1 and QRAM width  = 2. (a) Circuit for Address Loading Stage.(b) Circuit for Data Retrieval Stage.(c) Outline of a full virtual QRAM circuit.(d) Classical-controlled SWAP gates with dual-rail encoding (e) Qubit state in query state preparation and data retrieval.

Figure 7 :
Figure 7: (a) and (b): Commutator relationship for Z and CNOT gates.(c) and (d): Z-error behavior in QRAM.A Z error only propagates to the subtree highlighted in red, due to the direction of the CX gates.

Figure 8 :
Figure 8: Additional operations after mapping in 2D nearestneighbor architectures.Swap-based communication scales exponentially worse than teleportation-based communication.

Figure 9 :Figure 10 :
Figure 9: Fidelity comparison for different QRAM architectures.We observe fidelity decays polynomially for Z errors in virtual QRAM and BB QRAM, but for X errors only in BB QRAM.

Table 2 :
Resource overhead comparison between different implementations of virtual QRAM.All costs are in Big-O.