QuComm: Optimizing Collective Communication for Distributed Quantum Computing

Distributed quantum computing (DQC) is a scalable way to build a large-scale quantum computing system. Previous compilers for DQC focus on either qubit-to-qubit inter-node gates or qubit-to-node nonlocal circuit blocks, missing opportunities of optimizing collective communication which consists of nonlocal gates over multiple nodes. In this paper, we observe that by utilizing patterns of collective communication, we can greatly reduce the amount of inter-node communication required to implement a group of nonlocal gates. We propose QuComm, the first compiler framework which unveils and analyzes collective communication patterns hidden in distributed quantum programs and efficiently routes inter-node gates on any DQC architecture based on discovered patterns, cutting down the overall communication cost of the target program. We also provide the first formalization of the communication buffer concept in DQC compiling. The communication buffer utilizes data qubits to store remote entanglement so that we can ensure enough communication resources on any DQC architecture to support the proposed optimizations for collective communication. Experimental results show that, compared to the state-of-the-art baseline, QuComm reduces the amount of inter-node communication by 54.9% on average, over various distributed quantum programs and DQC hardware configurations.CCS CONCEPTS• Computer systems organization → Quantum computing; • Software and its engineering → Compilers.


Remote EPR entanglement
Optical fiber Data qubit Compute node Comm.qubit Figure 1: (a) The common distributed quantum computing architecture [36].Nodes form a quantum network.Communication qubits emit photons, which are transferred through optical fibers, to establish remote entanglement.Data qubits are used to store program information.(b) An exemplar distributed circuit of the decomposed CCZ gate.

INTRODUCTION
Quantum computing is promising and can be used to solve classically intractable problems [4,44].One critical problem that hinders the practical application of quantum computing is the limited qubit resource of quantum computers.Distributed quantum computing (DQC) provides an optimistic way to scale up quantum computing and has been demonstrated in recent experiments [13,21,34].Research attention on DQC emerges both in hardware design [13,21,34,36] and program compilation [2,7,11,12,14,15,18,20,27,42].Distributed quantum computing, as shown in Figure 1(a), integrates many independently fabricated quantum processors (aka compute nodes) to run quantum programs.DQC relies on inter-node quantum communication to share or move quantum data between compute nodes so that nonlocal gates (e.g., the CX gates in Figure 1(b)) become executable.In the common DQC model [2,7,11,12,14,15,18,20,27,42], each invocation of inter-node communication consumes one remote EPR pair while the generation of remote EPR pairs between two different nodes is far more error-prone and time-consuming than applying local quantum gates [34].To mitigate the infidelity caused by inter-node communication, researchers have investigated many strategies, e.g., designing more reliable communication hardware [36], employing quantum error correction (QEC) [8], and using EPR entanglement purification [31].A much cheaper strategy to mitigate infidelity, however, is to reduce the amount of inter-node communication needed in a distributed quantum program through a DQC compiler.This strategy is always useful, regardless of if we have made better hardware or enabled QEC/entanglement purification.This paper focuses on designing a compiler for reducing inter-node communication.
Unfortunately, existing compilers for DQC lack deep analysis of distributed quantum programs and are limited to either qubit-toqubit communication or qubit-to-node communication.Most DQC compilers focus on either optimizing the qubit layout [2,7,11,12,14,27] to reduce nonlocal gates or shortening the communication footprint of applying each nonlocal gate [15,18].Those works do not inspect the intrinsic communication patterns in distributed quantum programs.State-of-the-art DQC compilers such as [42] identify the burst communication pattern between one qubit and one node and propose executing a group of nonlocal gates together by 1-2 invocations of inter-node communication.Though significantly better than previous works, their work still does not consider the communication among multiple nodes, similar to other DQC compilers.Considering the example circuit in Figure 1(b) where there exists a group of nonlocal gates across three nodes (i.e., collective communication), existing DQC compilers [15,18,42] would require 5 invocations of inter-node communication for the 5 internode gates.In contrast, if we implement those inter-node gates by moving both  1 and  2 to  3 's node and keep them there, only 2 invocations of inter-node communication are needed.
Therefore, considering collective communication would enable a wider scope on optimizing distributed quantum programs and present optimization opportunities invisible from the low level.In this paper, we formally define a collective communication block as a group of inter-node gates whose inter-node qubit interaction forms a connected graph over multiple nodes.We require each collective communication block to absorb quantum gates as much as possible, as long as the overall communication cost is reduced.Moreover, we identify three key challenges for utilizing collective communication patterns to reduce inter-node communication.Firstly, in the program level, collective communication is usually not directly accessible.For many quantum circuits, e.g., those decomposed to the Clifford+T basis [30] or the CX+U3 basis [3], collective communication is hidden in the details of scattered inter-node CX.Secondly, for an arbitrary DQC network topology, it is unclear how to route collective communication based on its patterns, e.g., into which node we place all involved qubits could lead to the least inter-node communication.
No such problem exists in routing the two-qubit gate.Finally, for the lowest-level implementation, the DQC system may not have enough resources to directly support a collective communication block, e.g., the block may require one node to have several remote EPR pairs ready simultaneously.This is harsh considering the potentially limited number of communication qubits per node [21,34].
The identified challenges require three continuous compiler optimizations.To this end, we developed the first compiler framework, named QuComm.As shown in Figure 2, QuComm consists of three key stages for collective communication optimization which is unexplored by existing DQC compilers.The first stage is communication fusion, which inspects program information, aiming to unveil collective communication blocks from low-level circuit details.The insight is that a collective communication block should require less inter-node communication for execution, compared to implementing each nonlocal gate independently.The second stage is communication routing, which further incorporates DQC network topology information.The insight is that the overall communication footprint can be reduced by adapting the data transfer path of target collective communication to the underlying DQC

Quantum Circuit
Optional quantum circuit transformations, e.g., gate unrolling, gate cancellation, unitary synthesis, burst communication, etc. architecture.The final stage is communication system design, which concerns the lowest-level implementation of routed collective communication blocks in view of intra-node communication resources.We formalize the concept of the communication buffer that utilizes data qubits for buffering remote EPR pairs so that large collective communication blocks are executable.Evaluation shows that, QuComm reduces the amount of inter-node communication by 54.9% on average, over various distributed quantum programs and DQC hardware configurations, compared to the state-of-the-art baseline [42].

BACKGROUND
In this section, we only introduce the essential background knowledge of distributed quantum computing (DQC).We refer the readers to [30] for the fundamentals of quantum computing.The qubit discussed in this section can be a logical qubit protected by QEC [8] or just a physical qubit.Quantum entanglement and DQC: Inter-node quantum communication relies on the remote EPR (Einstein-Podolsky-Rosen) entanglement.A remote EPR pair holds the entangled two-qubit state in a pair of qubits that belongs to different quantum nodes.A remote EPR pair can be regarded as a quantum communication channel between nodes.
The DQC architecture is based on remote EPR entanglement.DQC nodes form a quantum network that could be of any topology by configuring remote EPR entanglement.For nodes without established EPR entanglement, we would use intermediate nodes that have prepared remote EPR entanglement to relay the transfer of quantum data.In a DQC node, not all physical qubits can serve to establish the remote EPR entanglement [2].Qubits able to construct remote EPR pairs are called communication qubits [10].The compute node also contains data qubits, which are designed to store program information.E.g., in Figure 3,  0 and  ′ 0 are communication qubits while  1 and  ′ 1 are data qubits.
Remote EPR entanglement and inter-node communication are the most error-prone parts of DQC.Besides improving communication hardware [36], researchers also propose two major software schemes to mitigate their infidelity.The first one [8] is to encode a group of physical qubits into a logical qubit by the QEC code and perform error correction after transferring the logical qubit by inter-node communication.The other one is entanglement purification [31], which performs error distillation on several established remote EPR pairs to generate a high-fidelity EPR pair.Physicists have demonstrated that these two schemes are equivalent [16].This paper instead focuses on reducing inter-node communication through compiler optimization, which is compatible with the two mentioned software schemes.

Quantum communication protocols:
The quantum no-clone theorem [30] makes quantum data not replicable between compute nodes.To overcome this limitation, two EPR-entanglement-based quantum communication protocols emerge.The first one is named Cat-Comm, which, as shown in Figure3(a), utilizes the cat-entangler and -disentangler to first share the state of  1 to  ′ 0 , perform the target controlled-unitary block and then revoke the sharing.The second one, named TP-Comm, exploits quantum teleportation [30] to move qubits between compute nodes.As shown in Figure 3(b), TP-Comm first moves  1 to  ′ 0 and then performs the target unitary block.Cat-Comm is shown to be more efficient [17] if just readonly operations are performed on the qubit being shared, e.g., using the qubit as the control line of inter-node operations.In contrast, TP-Comm is more efficient if we need to perform both read & write on the qubit moved, i.e., the qubit may be used as the control line  [41] with the OEE qubit-node mapping [32].Dot (x,y%) on each curve means there are y% multi-qubit gates involving at least x node.Results are averaged over circuits in [41].
of one CX and the target line of another CX.Besides the difference in supported inter-node operations, another critical difference between Cat-Comm and TP-Comm is that Cat-Comm does not change the qubit layout while TP-Comm does.No matter for TP-Comm or Cat-Comm, one invocation transfers only one qubit's data and consumes one remote EPR pair.
In a distributed quantum program, both TP-Comm and Cat-Comm may appear in (be a part of) a general collective communication.To avoid ambiguity, in the following sections, when we say sharing one qubit to another node, we refer to Cat-Comm; when we say moving or teleporting one qubit to another node, we refer to TP-Comm.

PROBLEM AND MOTIVATION
In this section, we study the collective communication hidden in distributed quantum programs and discuss the opportunities and challenges of collective communication optimization.

Collective Communication in DQC
Being essential, inter-node communication greatly degrades the fidelity of distributed quantum programs [22].The main goal of this paper is to reduce the amount of inter-node communication in distributed quantum programs by efficient compilation.The insight of compiler optimizations in this paper originates from our analysis of collective communication.In this paper, we formally define a collective communication block as a group of inter-node gates which has a connected inter-node interaction graph on qubits over multiple nodes.In the interaction graph, We would draw an edge between any two qubits if they are involved in the same inter-node gate.We require the interaction graph to be connected since in coherent collective communication, nonlocal gates should depend on each other (in terms of qubits).The definition of collective communication in this paper is more general and flexible than the one in [20], containing but not limited to broadcast, reduce, etc.
We observe that distributed quantum programs have abundant collective communication.By collecting statistics on various quantum circuits (arithmetic functions, encoding circuits, etc.) from the widely studied quantum benchmark [41] (having circuits up to 143 qubits), we observe that on average 77.6%, 20.2%, and 10.6% of quantum gates involve more than 3, 6, and 9 qubits, respectively.This demonstrates the potential existence of abundant collective communications when executing these circuits on DQC hardware.Figure 4 shows the statistics of collective communication when qubits of various quantum circuits in [41] are mapped to DQC systems with 4, 6, and 8 nodes (blue, orange, and red curves) with each node holding '# circuit qubit/# node', respectively.The qubit-node mapping uses the widely-adopted OEE algorithm which tries to maximally reduce inter-node quantum gates.As shown in Figure 4, there are about 28.4% of the multi-qubit gates require communications between more than 3 nodes, when running circuits on 4 compute nodes.The percentage grows up when we run these circuits on 8 compute nodes, where 49.8% of multi-qubit gates involve computation on more than 3 nodes.
In summary, we observe that the efficient implementation of collective communication in distributed quantum programs is critical to promoting DQC's computational potential.

Opportunities and Challenges
First, we can greatly reduce the amount of inter-node communication through pattern analysis of collective communication.Let us revisit the distributed quantum circuit in Figure 1(b).Realizing that the circuit forms a collective communication on three nodes, we can decrease the amount of inter-node communication from 5 (if using existing DQC compilers [42]) to 2 (by placing all three qubits into the same node).This example demonstrates the importance of implementing inter-node gates collectively, i.e., analyzing the pattern of a group of inter-node gates as a coherent whole (i.e., collective communication) and finding the most communicationefficient way to implement them from a higher level.Finally, another optimization opportunity emerges from the underlying system design.There are cases where the underlying DQC system may not always have enough communication resources to support collective communication.For example, for the collective communication block in Figure 1(b), if each node only has one communication qubit and does not use data qubits to store the generated remote EPR pairs, then each node at most accommodates one EPR pair at any time, making it impossible to simultaneously move both  1 and  2 to the node holding  3 .However, if we use two data qubits to buffer the EPR pairs generated by the communication qubit, it becomes possible for one node to have two EPR pairs at the same time, making the collective communication in Figure 1(b) directly executable.Overall, an EPR pair buffer is critical to enable collective communication, especially for DQC nodes with a limited number of communication qubits.
While being promising for reducing DQC's communication overhead, the identified optimization opportunities also impose difficulties for the compiler design: 1) Collective communication is usually not directly accessible.For circuits decomposed to basic gates, collective communication is hidden in the details of scattered inter-node gates.Collective communication is also affected by qubit placement, gate ordering, etc.
2) Given a DQC network topology, it is unclear how to efficiently utilize collective communication patterns to route nonlocal operations.Existing DQC compilers lack a higher-level routing that considers both architecture information and communication patterns.
3) The DQC system design directly affects the efficiency of executing collective communication.This design needs high-level information like the collective communication blocks one node needs to accommodate, which is hard to extract without deep program analysis.

FRAMEWORK DESIGN
In this section, we introduce the compiler designs that tackle the identified challenges and enable efficient utilization of high-level information in collective communication to reduce inter-node communication in distributed quantum programs.The qubit in this section can be a logical qubit or just a physical qubit.
QuComm includes three stages: the communication fusion which is used to unveil collective communication from circuit details, the communication routing which utilizes identified collective communication patterns to route inter-node gates onto the underlying DQC architecture, and the communication system design that improves the efficiency of collective communication by buffering EPR pairs.

Communication Fusion
The availability of collective communication in distributed quantum programs is affected by various factors, e.g., whether the distributed program is decomposed or not, the qubit mapping onto each node, and the gate ordering.The insight for identifying collective communication is that inter-node gates forming collective communication should require much less inter-node communication when implemented collectively, compared to implementing each of them independently.Based on the insight, we adopt a greedy strategy to construct collective communication blocks, as shown in Algorithm 1.
To enable node-aware search while being general for any network topology among nodes, this stage only requires information about the maximum number of EPR pairs each node can accommodate at the same time, i.e., the EPR capacity of each node.The EPR capacity of a node also indicates the maximum number of external qubits this node can hold simultaneously.For an ideal DQC architecture where each node has infinite communication qubits, the EPR capacity per node is ∞.The output of Algorithm 1 would contain a series of collective communication blocks.There are two important steps in Algorithm 1: 1) Aggregation: This step is to maximize collective communication opportunities through circuit rewriting (with rules in [29]), regardless of how scattered the input circuit is.The circuit after communication fusion.
2) Fusion: This step is based on the insight that efficiently constructed collective communication blocks should always lead to less inter-node communication.
Let  (  ) be the EPR capacity of node   ;  ( 0 ) be {# qubits involved in  0 };  ( 0 −   ) be {# qubits involved in  0 but not in   }.Then we define the cost of implementing  0 +  1 as follows: That is to say, if  ( 0 +  1 ) ≤  (  ), we can simply transfer all qubits involved in  0 and  1 to   ; otherwise, for each remaining qubit, we will perform an inter-node SWAP gate (by two TP-Comm or three Cat-Comm) to exchange it into   (ref.the first summand in Equation ( 1)).Equation ( 1) is only an estimation of the implementation cost but serves as a good metric for identifying profitable fusion.
To demonstrate Algorithm 1, let us consider the example circuit in Figure 6(a).The finalized circuit after the communication fusion stage is shown in Figure 6(b).The first collective communication block 1 ○ starts from gates between node A and node B. The gates between node B and node C will be merged into block 1 ○ since three invocations of inter-node communication are reduced when implementing them together, according to Equation (1).Block 1 ○ is further enlarged by incorporating gates between node C and node D (note that the gate CX q d1 , q c1 is aggregated to block 1 ○ by circuit rewriting).Block 1 ○ also contains the local gate CX q b1 , q b2 since it does not incur extra communication.Unfortunately, the two gates between node A and node E cannot be merged into block 1 ○ as no communication reduction is observed.The gates between node A and node E then form block 2 ○.Block 2 ○ is also a collective communication block if node A and node E are not directly connected where we need intermediate nodes to relay quantum data transfer.

Communication Routing
With the identified collective communication blocks, we then need to route these blocks for the underlying DQC network topology.The insight of this stage is to match the data transfer path of collective communication with the underlying DQC architecture so that the inter-node communication induced by routing overhead can be reduced.In this stage, we would examine the pattern of collective communication and identify efficient routing optimizations correspondingly.

Routing an individual collective communication block:
The main idea is still to move/share all involved qubits to the same node while we can transform the data transfer path to fit into the underlying DQC network.The length of data path (ℎ) in the DQC network is computed as The weight can be distilled/raw EPR fidelity of each inter-node link or just 1 for the uniform DQC hardware.Without loss of generality, here we assume all link weights are 1.Overall, we identify three routing optimizations for data transfer paths.Firstly, we should select the node for qubit aggregation based on the underlying network topology information along with the EPR capacity for each node.For example, for the collective communication block 1 ○ in Figure 6(b), if the underlying DQC network is fully-connected, it makes no difference to transfer all qubits to node B or node C.However, for the nearest-neighbor DQC architecture in Figure 7, transferring all qubits to node C is cheaper (ref.Figure 7(b)).With the node for qubit aggregation selected, we pick data transfer schemes as suggested by [42]: it is more efficient to use Cat-Comm for readonly data transfer and TP-Comm for writable data transfer.For example,  1 is transferred by Cat-Comm while  1 is transferred by TP-Comm.
Secondly, enabling early execution along the data path can eliminate unnecessary data transfer.With the node for qubit aggregation determined (node C in Figure 7), the next step is to determine the data transfer path for each involved qubit.Figure 8 shows two different shortest data paths that share  1 to node C. Compared to the path in Figure 8(a), the path in Figure 8(b) enables the execution of the two CX gates between  1 ,  1 and  2 in node B. Since  1 is not involved in later inter-node operations, we would stop further transferring of  1 to node C, saving one invocation of inter-node communication.Thus, we should consider not only the data path length but also the early-execution opportunity along the path when scheduling the transfer of a qubit.
Thirdly, the early execution strategy can be further extended to the parity computation process of large multi-controlled blocks to reduce the amount of inter-node communication.An n-qubit generalized Toffoli gate can be seen as computing the parity of the (n-1) control lines and this parity computing process can be decomposed by separating control lines into different groups [6].For example, CCCCX q 0 , q 1 , q 2 , q 3 , q 4 can be decomposed into gates CCX q 0 , q 1 , q 5 ; CCX q 2 , q 3 , q 6 ; CCX q 5 , q 6 , q 4 ; CCX q 0 , q 1 , q 5 ; CCX q 2 , q 3 , q 6 , where the parity computing results on groups { 0 ,  1 } and { 2 ,  3 } are stored in  5 and  6 , respectively.Here in this stage, we will implement a group of multi-controlled gates collectively and adaptively on the underlying DQC architecture which may not be fully-connected, for the first time.As an example, given the collective communication block 3 ○ in Figure 6(b), we would analyze the overall parity propagation and apply routing optimizations correspondingly: 1) Merge parity along the propagation path (i.e., early execution).As in Figure 9(a), when transferring the parity computed in node E to the node for qubit aggregation (i.e., node C), the parity data would go by node B. Since node B also have parity data, we can combine the parity data from node E and B into one, as shown in Figure 9 ○ both depend on the parity from the group { 1 ,  2 }, thus we can avoid recomputing and resending the parity data from the group { 1 ,  2 }, saving one inter-node communication.
After the collective communication block is executed, the node for qubit aggregation may be occupied by external qubits and cannot accommodate more EPR pairs and other external qubits.To release the occupation, we would inspect future collective communication blocks and transfer those external qubits to positions where they are needed for future multi-qubit gates.This process may help reduce the transition overhead between collective communication blocks, as discussed below. (5) (5) (5) Shorter path

Routing transition between collective communication blocks:
When transiting the routing from one collective communication block to another one, we can use data transfer that happened in the former one to reduce the routing overhead of the current block.For example in Figure 10(a), after routing block 1 ○,  1 is coherently in node A and node B. Therefore, to execute the CX in block 2 ○, we can use Cat-Comm to share  1 from node B directly, saving one inter-node communication, as shown in Figure 10(b).Thus, when transiting between two collective communication blocks, we should first inspect the qubit layout change (e.g., one qubit may coherently exist in multiple nodes) caused by the former block and then shorten the data transfer path of the next block accordingly.

Communication Buffer Design
As stated in Sec.3.2, # communication qubits on each node may potentially limit the efficient execution of collective communication.Our insight to overcome it is to use data qubits to buffer (through using local SWAP gates) remote EPR pairs generated by the communication qubits so that we can use these data qubits to accommodate data of external qubits.We say those data qubits form a communication buffer.This is the first formalization of the communication buffer concept in DQC compilers.The communication buffer essentially provides an abstraction or intermediate layer that is able to approximate the ideal DQC hardware (the one with infinite communication qubits).As long as the communication buffer is large enough, we can implement collective communication without inter-node qubit swapping.The size of the communication buffer requires careful design.Using too many data qubits in the communication buffer would cause each node to have fewer qubits to store program information, thus requesting more nodes to support the same program.Intuitively, for a given program, using more nodes would induce more inter-node communication.However, if we use only a few data qubits in the communication buffer, the communication reduction by optimizing collective communication would be small as well.To balance these two effects, we propose a program-adaptive communication buffer design so that the communication buffer in each node is just able to support collective communication in the program.
In this stage, we first perform the communication fusion stage assuming each node has a large number of communication qubits, obtaining collective communication blocks related to each node.We then configure the communication buffer of each node, starting from the node associated with most inter-node gates, with the following steps: 1) For a node   , find the qubit  0 in   which incurs the least increment (say,  ( 0 )) of inter-node communication when it is moved to another node (say,   ) with idle data qubits. ( 0 ) can be easily computed by counting multi-qubit gates that involve  0 , in   and   .
2) According to Equ (1), re-inspect collective communication blocks associated with   to compute overall inter-node communication reduction by adding one data qubit in the communication buffer of   .Denote the reduction by (  ).
3) If  ( 0 ) ≤ (  ), we would place  0 into   , and add one data qubit in the communication buffer of   .Repeat this process until there is no further improvement.
The proposed communication buffer design only increases the size of a communication buffer if and only if the amount of internode communication is reduced.The whole configuration process is computationally-cheap and iterates over each node linearly.The iteration cost on each node is bounded by the size of related collective communication blocks, which is often a constant factor.Table 1 shows an example of configuring the communication buffer on node C. The overall inter-node communication is reduced when moving  4 to another node.However, we may not gain any benefit by moving  3 .The configuration process thus terminates.

EVALUATION
In this section, we compare the performance of QuComm to the baseline [42] and analyze the effect of optimizations proposed in QuComm.

Experiment Setup
DQC hardware model.For evaluation, we adopt the mesh-grid network [24] for DQC: . In the DQC architecture, we assume 8 compute nodes and 40 data qubits per node.We assume each data qubit is a logical qubit protected by QEC codes [8,19].Thousands of physical qubits may be required to build one logical qubit.We also assume each compute node has an independent magic state distillation unit [9] to enable local logical T gates.Further, we assume each node can only establish communication with neighboring nodes.We consider configurations of 1 or 3 or 5 logical communication qubits per node to evaluate the performance of QuComm on DQC systems with limited or abundant communication resources.Since our work focuses on communication optimization and only concerns the number of inter-node communication, we do not make any assumption about the logical qubit topology inside each node.We would consider more DQC architecture options in Section 5.3 to evaluate the benefit of QuComm.Benchmark programs.The fault-tolerant benchmark programs used in the evaluation are obtained from [41] and summarized in Table 2. Programs in Table 2 include the quantum XOR gate, the quantum ripple carry adder (RCA), the quantum Fourier transformation (QFT) algorithm, and Grover's algorithm.Those programs For Grover, we consider the secret string with all ones and repeat the iteration by 1000 times.All programs are decomposed into the Clifford+T basis [30], except for the raw XOR gate (XORR), which is specifically decomposed toward the DQC architecture according to [37].We would further consider near-term applications in Section 5.3 to evaluate the impact of QuComm on the NISQ (Noisy Intermediate Scale Quantum) era [35].Baseline.As the baseline, we implement the DQC compiler Auto-Comm [42].AutoComm groups a series of remote CX gates between one qubit and one node into a burst communication block and implements the burst communication block (all remote CX gates in it)

Compared to Baseline
In this section, we analyze the relative communication reduction of QuComm compared to the baseline and discuss the effect of designs in QuComm.Table 2 and 3 summarize results of QuComm.Overall, QuComm significantly reduces the amount of inter-node communication across all benchmarks and device configurations tested, compared to the baseline.QuComm on average reduces the amount of inter-node communication by 60.8%, 52.6%, and 51.3% on configurations of 1, 3, and 5 communication qubits per node, respectively.The effect of program patterns and L1 optimization.QuComm behaves differently on programs of distinguished patterns.Collective communication in programs can be classified according to  the extent of qubit correlation.For strongly correlated distributed quantum programs, e.g., XOR and Grover in Table 2, each collective communication block has an inter-node interaction graph (on logical qubits) close to the complete graph.For example, for a decomposed three-qubit XOR gate (i.e., the Toffoli gate) distributed over three nodes, its inter-node interaction graph is a triangle since there are inter-node gates on each two of the three involved logical qubits.In contrast, for loosely correlated distributed quantum programs e.g., RCA and QFT in Table 2, the interaction graph of each collective communication block has few edges compared to the vertex count.
Strongly correlated distributed quantum programs would gain more benefits from the L1 pass of QuComm.With abundant communication resources (e.g., 3 or 5 communication qubits per node), the communication reduction by 'QuComm L1' on strongly-correlated distributed programs is on average 59.8% higher than on looselycorrelated distributed programs.The reason for the discrepancy is that strongly correlated collective communication benefits more from the implementation by aggregating qubits to the same node.For loosely correlated distributed programs, the communication block discovered by QuComm L1 is almost similar to the qubitto-node burst communication [42] and cannot benefit from the L1 stage.Overall, L1 offers QuComm significant communication reduction of 48.3% (averaged over various program size and # comm lqb/node) on strongly correlated distributed programs (which have abundant collective communication), compared to the baseline which can only handle burst communication.The effect of L2 optimization.The L2 optimization also provides a great reduction in inter-node communication, especially on loosely correlated distributed programs.With abundant communication resources (3 or 5 logical communication qubits per node), the communication reduction for loosely correlated distributed programs by L2 is on average 43.4%, far surpassing 3.5% for strongly correlated distributed programs.On the other hand, compared the baseline which lacks communication-aware routing, L2 offers QuComm on average 28.9% communication reduction on loosely correlated programs, despite the communication fusion of QuComm on loosely correlated programs is similar to the baseline.
Results in Table 3 validate the designs of L2 by comparing 'QuComm L1+L2' to 'QuComm L1'.QuComm L2 moves logical qubits to the next inter-node communication position by examining current and future communication patterns.For RCA, this prevents the frequent TP-Comm back and forth caused by the baseline, leading to a 50.0%communication reduction in the RCA benchmark (see Table 3 Column 6-7).
Moreover, for QFT where   controls all   ( > ), QuComm L2 would utilize existing read-only copies of the control line to shorten the qubit transfer path while the baseline does not have such a capability.This optimization on average leads to a 61.0%communication reduction on the 300-qubit QFT program (see Table 3 Column 6-7).Further, XORR evaluates the effect of early execution in L2.The early execution strategy leads to a 34.2% communication reduction in the XORR benchmark, on average (see Table 3 Column 6-7).Finally, for the XOR and Grover benchmark, the benefit of L2 lay in selecting the proper node for qubit aggregation.For 3 or 5 logical communication qubits per node, the topology-aware node selection in L2 on average leads to 8.3% communication reduction on the 100-qubit XOR and Grover programs.The effect of L3 optimization.The communication buffer is more important for DQC systems where limited communication qubits are available.With the buffer, 'QuComm L1+L2+L3' significantly reduces the communication overhead, compared to 'QuComm L1+L2' (by 21.5% on average) and AutoComm (by 54.9% on average) over all device configurations and programs tested, as shown in Table 3.This is because the routing and implementation of both burst and collective communication will be severely affected when lacking communication (buffer) qubits.Further, we show the size of the designed communication buffer in Figure 11.
As shown in Table 3, the importance of the communication buffer increases as the number of communication qubits decreases.Compared to 'QuComm L1+L2', 'QuComm L1+L2+L3' further reduces the amount of inter-node communication by 1.6%, 3.3% and 50.6% on average, for the configurations of 5, 3, and 1 communication qubits per node.For the DQC system with only one logical communication qubit per node, the communication buffer not only facilitates the direct execution of collective communication blocks but also provides resources to relay quantum data transfer.Further, from Figure 11, we can draw two observations: a) the size of the communication buffer grows as the program size increases; b) strongly correlated distributed programs require more help from the communication buffer than loosely correlated distributed programs.
Finally, we claim that our communication design will not hurt the scalability.Firstly, our design features an input-adaptive buffering module which is configured to favor using idle data qubits on each node, instead of requesting extra nodes.Secondly, as shown in Figure 11, a small buffer that uses less < 2% data qubits would generally suffice for reducing communication at a large scale.Lastly, even without the communication buffer, 'QuComm L1+L2' still greatly reduces the communication overhead compared to the baseline, as shown in Table 3.

Additional Studies
In this section, we further evaluate the performance of QuComm on heterogeneous DQC architectures and discuss the impact of QuCommon the NISQ era.The effect of heterogeneity for QuComm.In Figure 12(a), we consider a heterogeneous DQC system consisting of eight compute nodes, with 20, 20, 30, 30, 50, 50, 60, and 60 logical qubits, respectively.For comparison, the homogeneous DQC comes with eight nodes but all with 40 logical qubits.We assume one logical communication qubit per node for both heterogeneous and homogeneous DQC systems.As shown in Figure 12(a), compared to the baseline, QuComm can significantly reduce inter-node communication in both heterogeneous and homogeneous DQC systems.This is because collective communication is widely available for distributed programs mapped to DQC systems and the node heterogeneity may not hurt collective communication optimization.
We further evaluate the performance of QuComm on diverse DQC networks where the connectivity of nodes may be heterogeneous.We consider the fully connected and star-like network topology in addition to the mesh-grid topology.For all network topologies, we assume 8 compute nodes, 40 logical qubits per node, and 1 logical communication qubit per node.As shown in Figure 12(b), compared to the baseline, QuComm achieves significant communication reduction on all considered DQC architectures.We can see that, the more sparse a DQC network is, the more benefits QuComm can provide.QuComm achieves the largest communication reduction on the mesh grid topology whose average node-tonode distance is 2.0, longer than 1.86 and 1.0 for the star-like and fully connected topology, respectively.The effect of QuComm on NISQ.We further evaluate QuComm on near-term applications and devices.In the evaluation, we assume physical qubits for data qubits of compute nodes.Inter-node communication protocols are directly executed on physical qubits.Near-term programs are decomposed into the CX+U3 basis [3].For both Figure 13(a)(b), we assume the mesh grid network topology on 8 compute nodes, 40 data qubits per node, and 1 communication qubit per node.Figure 13(a) shows the required error rate of inter-node communication to achieve specified overall communication fidelity for the 300-qubit QAOA (Quantum Approximate Optimization Algorithm) program.The result in the figure indicates that, ensuring the same level of overall communication fidelity, QuComm can admit on average 174.1% higher inter-node communication error rate, compared to the baseline.Further, with QuComm communication buffer design, we can equip each DQC node with fewer communication qubits.Thus, with QuComm, it may be possible to demonstrate DQC in the near term.
Figure 13(b) demonstrates the result of QuComm on the IBM heavy hexagon architecture [3], compared to the baseline.Gate latency and fidelity are derived from [42].Other experiment settings follow the one in Figure 13(a).For test programs, we consider the 300-qubit BV (Bernstein-Vazirani algorithm) and QAOA.As shown in Figure 13(b), QuComm does not necessarily induce more local SWAP gates than the baseline.On the one hand, the communication buffer may increase the SWAP overhead of performing CX between same-node data qubits since qubits in the buffer may hinder the SWAP path (see BV300 in Figure 13(b)).On the other hand, when executing inter-node CX, we need to move data qubits closer to the communication buffer.The SWAP overhead between data qubits and buffer qubits is less than that between data qubits and the communication qubit since the buffer spans the device area used for inter-node communication.This reduction of local SWAP overhead outweighs the overhead of swapping EPR pairs from the communication qubit into buffer qubits if many inter-node CX gates are executed (see QAOA300 in Figure 13(b)).
Furthermore, as shown in Figure 13(b), the local SWAP overhead change induced by QuComm is minor (<0.05%) because of the small average size of communication buffers (<2 per node as shown in Figure 11).QuComm always tries to use a small communication buffer since a large communication buffer may instead lead to more inter-node communication, as discussed in Section 4.3.This observation, on the other hand, indicates that the program latency reduction and fidelity improvement (in Figure 13(b)) mainly stem from inter-node communication reduction.Comparing QuComm to more DQC compilers.We further compare QuComm to two more recent DQC compilers, called GP-CAT [2,18] and GP-SWAP [5,18] for simplicity.GP-CAT executes each inter-node CX gate by solely using Cat-Comm.GP-SWAP executes remote CX by swapping qubits between node and making the remote CX local.The experiment setting is the same as the one for Table 3.As shown in Figure 14, compared to GP-CAT and GP-SWAP, QuComm significantly reduces inter-node communication, on average by 4.88x.Specifically, compared to GP-CAT, the communication reduction by QuComm scales with the inter-node gate count in each program.For programs with denser inter-node communication, e.g., QFT, QuComm can build larger collective communication blocks for more aggressive communication reduction.Compared to GP-SWAP, the benefit of QuComm comes from its communication-aware routing, which avoids repeated and unnecessary quantum data movement between quantum nodes by utilizing the higher-level program information provided by uncovered collective communication blocks.Sensitivity analysis of of QuComm.Finally, we study the sensitivity of QuComm to # node and # data qubits per node, with results shown in Figure 15.In the figure, we consider XOR and QFT as test programs, which represents strongly-correlated and loosely-correlated distributed programs, respectively.We also consider the star DQC architecture as it has a deterministic shape when node size changes and has similar performance as the mesh-grid architecture, as shown in Figure 12(b).One logical communication qubit per node is assumed while other settings follows the one for Table 3.
As shown in Figure 15, the communication reduction by QuComm is stable for the strongly-correlated distributed program.This is because the collective communication uncovered by QuComm is consistently better than the burst communication by [42].For example, for the decomposed CCZ gate in Figure 1, by our collective communication optimization, 60% fewer communication invocations are need, compared to by that burst communication optimization [42], no matter how # node and # data qubits per node change.In contrast, for loosely correlated programs, the advantage of QuComm mainly comes from its communication-aware routing.For QFT, the amount of inter-node communication increases quadratically with respect to # node, surpassing the communication reduction by QuComm's routing.Thus, as shown in Figure 15, the benfit of QuComm decreases as # node increases.On the other hand, by keeping # node fixed, the benefit of QuComm is stable as # data qubit per node increases, as shown in Figure 15.The is as expected as '# data qubit per node' does not affect QuComm's routing.Overall, for various combinations of '# node' and '# data qubit per node', QuComm is always better than [42].

RELATED WORK
Compilers for single-chip quantum computing.A variety of quantum compiler optimizations [1,3,23,25,28,38,45] have been developed for the single-node setting over the last few years.They investigate different inter-qubit features (e.g., qubit topology, twoqubit gate fidelity between qubits) to optimize the mapping and routing when compiling quantum programs to a single-node quantum computer.Those compilers may be extended to DQC by using SWAP gates to make inter-node gate local [5,18].Unfortunately, such extension is not efficient for DQC [42].Further, unlike this paper, these compiler ignore collective communication.
Compilers for DQC.Compilers for DQC can be mainly divided into two categories.Works in the first category [2,5,7,11,12,14,27] ignore the low-level quantum communication protocols and perform program optimizations on the logical level.These works focus on qubit placement and do not consider advanced transformation and routing strategies for reducing inter-node communication.These works are orthogonal to our work.Works in the second category [15,18,20,42] consider communication optimization when routing programs to physical DQC hardware.These works only consider qubit-to-qubit or qubit-to-node communication and lack collective communication optimization.These works induce a L3 then estimates the communication reduction by relocating   according to Equ (1), and is thus of complexity  ().In the worst case, the buffer design process may go through  nodes and  data qubits for each node.Thus, the computational complexity of QuComm L3 is  ( ) in the worst case, and  ( ) in the best case (the buffer design process for each node often ends in 1-2 iterations).QuComm L3's space complexity is  ( ) as L3 only needs to keep track of the assignment of communication buffer qubits and remote gates for each node.Overall, the computational complexity of QuComm is  ( 2 +  log  ) in the worst case, and  (  +  log  ) in the best case.Note that  ( ) is  () as the qubit count of a program is often far fewer than its gate count.On the other hand, the total space overhead of QuComm is  ( +  2 ).

Impact of QuComm on QEC
Further, we study the effect of communication reduction by QuComm on the QEC requirement.As a case study, we use surface code as the underlying QEC facility, where   ≈ 0.03(/ ℎ ) /2 .Here   is the error rate of the logical qubit;  ℎ is the error threshold of surface code and can be set to be 0.01;  is the physical error rate; and  is the code distance.In the following reasoning, we assume  local gates,  remote gates, and  invocations of inter-node communication protocols by QuComm for the distributed program.
First, QuComm enables smaller code distance or tolerates higher physical error rate for remote communication, while preserving the overall program fidelity.We also assume the overall communication fidelity we want to achieve is 1 −   .Then QuComm reduces the code distance for remote communication by '(1 − log(  /(0.03 ) ) log(  /(0.03) ) )*100%'.For   = 0.01 and programs in Table 2, the code distance reduction by QuComm is on average 29.5%, up to 47.1%.On the other hand, QuComm can tolerate '((/) 2/ − 1)*100%' higher physical error rate for remote communication than the unoptimzed case.For  = 5 and programs in Table 2, the (tolerated) physical error rate upper bound of communication by QuComm is on average 238.6% higher, up to 1488.1%.
Finally, QuComm reduces the need of EPR pair generation by at least  − .For the same overall EPR pair fidelity 0.99, the same 2-to-1 entanglement purification protocol [43] and programs in Table 2, QuComm can tolerate on average 419.5% (up to 3059.4%) higher error rate for EPR generation.

Trade-off and Future Work
QuComm is designed to be scalable for inter-node communication optimization.Although our framework significantly surpasses existing works, there are many remaining trade-offs in the compiler design, leaving space for future works.More fine-grained optimizations.QuComm's optimizations is based on the heuristic function Equ (1) which estimates the implementation cost of a collective communication block.It is possible to provide better estimate and optimizations with more complex methods, e.g., SMT solvers or neural networks, when the scalability is not critical problem.Co-designing with QEC.To ensure the generality of our framework, we make few assumptions for the underlying QEC structure of the DQC architecture.For specific QEC, we can also optimize the communication cost of primitive QEC operations, e.g., logical gates and stabilizer measurements.Also, for communication routing, it is interesting to determine the best location to place magic state distillation units.DQC hardware-software co-design.For wide applicability of our framework, our framework assumes a general DQC architecture.We can use QuComm to guide the DQC network design: modifying the network topology and inspecting the communication cost of distributed programs compiled by QuComm on the modified topology.

CONCLUSION
We present the first DQC compiler for optimizing collective communication in distributed quantum programs.We propose two stages (communication fusion and routing) to unveil and utilize patterns of collective communication to reduce the amount of inter-node communication.We propose the communication buffer to further improve the efficiency of collective communication.Experimental results show that QuComm reduces the amount of inter-node communication by 54.9% on average, over various distributed programs and DQC hardware configurations.

FusionFigure 2 :
Figure 2: The compilation flow of quantum circuits on the DQC architecture and the overview of QuComm.

Figure 3 :
Figure 3: (a) The Cat-Comm protocol.(b) The TP-Comm protocol.For simplicity, we use block to denote a group of gates.

Figure 5 :
Figure 5: Two examples of routing the collective communication block in Figure 1(b) onto the nearest-neighbor architecture.(a) Data path: move  1 and  2 to node C. (b) Data path: move  1 and  3 to node B.

Figure 6 :
Figure 6: (a) An example distributed circuit for illustrating Algorithm 1.   * ,   * ,   * ,   * ,   * are in node A, B, C, D and E, respectively.We assume each node's EPR capacity is 5.(b) The circuit after communication fusion.

Figure 7 :
Figure 7: Examples of selecting the node for qubit aggregation on a nearest-neighbor DQC architecture, targeting block 1 ○ in Figure 6(b).Gray curves represent data paths.(a) When node B is selected.(b) When node C is selected.

Figure 8 :
Figure 8: Two shortest paths for sharing  1 to node C. The path in (b) enables early execution of CX gates between  1 and node C. Gray arrows between data qubits mean CX gates.Data transfer paths for other qubits are omitted.

2 )
(b).One remote communication is thus reduced due to reduced parity data.Share parity across multi-controlled gates.For example in Figure 9(b), the multi-controlled gates in block 3

Figure 10 :
Figure 10: (a) The qubit layout after executing collective communication block 1 ○.(b) It is shorter to transfer the read-only copy of  1 (by Cat-Comm) from node B to node E rather than from node A to node E.

Figure 11 :
Figure 11: The communication buffer design results of QuComm.'MAX', 'AVG', and 'TOT' denote the maximum size, average size, and total size of communication buffers in the DQC system.

Figure 12 :
Figure 12: Communication reduction by QuComm on heterogeneous DQC systems, compared to the baseline [42].(a) The effect of heterogeneous node size.(b) The effect of diverse node connectivity.

Figure 13 :
Figure 13: (a) The effect of nonlocal communication on the near-term application.(b) The effect of local SWAP on DQC nodes with IBM architecture [3].Results are by comparing QuComm to the baseline [42].

Table 1 :
The table shows the configuration process on node C, assuming each node only has 3 communication qubits, i.e.EPR capacity is 3. Block 1 ○ and 3 ○ are from Figure6(b) where each node's EPR capacity is 5. '# inter-node comm' is derived from Sec. 4.2.

Table 2 :
Fault-tolerant benchmarks and results by QuComm.
[32] at most two invocations of communication protocols.Auto-Comm represents the state-of-the-art effort in optimizing quantum communication overhead in distributing quantum programs, as far as we know.However, without the communication buffer, Auto-Comm cannot optimize more general collective communication that may involve more than two compute nodes.We adopt the same circuit partition algorithm-OEE algorithm[32], which maximally reduces inter-node gates induced by partition, for both QuComm and AutoComm, in order to eliminate the difference caused by the circuit partition.Metric.We use the number of invocations of inter-node communication protocols (Cat-Comm or TP-Comm), i.e., the amount of inter-node communication based on logical qubits, to characterize the communication overhead of the compiled distributed quantum circuits.The amount of inter-node communication in a quantum program is equivalent to the number of EPR pairs (on logical qubits) required to execute the program on DQC hardware.Notations.Before diving into results, we first introduce some notations and abbreviations.The communication reduction by QuComm refers to '1-# comm by QuComm/# comm by baseline', with 'comm' meaning communication.'# comm lqb/node' means the number of logical communication qubits per node.For simplicity, we use L1, L2, and L3 to denote QuComm's stages: communication fusion, communication routing, and communication buffer design.