Optimized compiler for Distributed Quantum Computing

Practical distributed quantum computing requires the development of efficient compilers, able to make quantum circuits compatible with some given hardware constraints. This problem is known to be tough, even for local computing. Here, we address it on distributed architectures. As generally assumed in this scenario, telegates represent the fundamental remote (inter-processor) operations. Each telegate consists of several tasks: i) entanglement generation and distribution, ii) local operations, and iii) classical communications. Entanglement generations and distribution is an expensive resource, as it is time-consuming and fault-prone. To mitigate its impact, we model an optimization problem that combines running-time minimization with the usage of that resource. Specifically, we provide a parametric ILP formulation, where the parameter denotes a time horizon (or time availability); the objective function count the number of used resources. To minimize the time, a binary search solves the subject ILP by iterating over the parameter. Ultimately, to enhance the solution space, we extend the formulation, by introducing a predicate that manipulates the circuit given in input and parallelizes telegates' tasks.

Even if quantum processors are already available, distributed architectures are at an early stage and must be discussed from several perspective to grasp what we need.A promising forecast to what such architectures will look like is based on telegates as fundamental inter-processor operations.Each telegate relies on several tasks: (i) the generation and distribution of entangled states among different processors, (ii) local operations and (iii) classical communication.Such tasks make the telegate an expensive resource, especially in terms of running time 1 .As a consequence, they have critical impact on the performance of the overall computation.In contrast to such a limit, telegates offer remarkable opportunities of parallelization.In fact, much circuit manipulation is possible to keep computation independent from telegates' tasks.Therefore, we aim to model an optimization problem that embeds such opportunities.

A. Contribution
We give a deep analysis on what we can do to mitigate the overhead caused by telegates, which are the main bottleneck to computation on distributed architectures.
Figure 1 gives a step by step overview of our work, with particular attention to the problem modeling.As reasonable, we begin with some minimal assumptions (Section II).Namely, as computation model we consider quantum circuits with a universal operator set available.The set is based on local operations and, as said above, telegates as fundamental interprocessor operations.Here, we optimize telegates to efficiently scale with connectivity restrictions.
We move on by defining rigorously the problem (Section IV).To come up with our formulation we rely on a wide literature from the Operation Research field, dealing with network scenarios.Specifically, we noticed several analogies between our problem and those on dynamic networks, especially the group of multi-commodity flow problems [37], [38], [39], [40], [41], [42], [43], [44], [45], [46].The resulting formulation is particularly remarkable, as it well models an interest into minimizing the running-time, by also keeping as side objective the minimization of resource usage.In this step, formulation is abstract, as it relies on binary relations that are not fully characterized.We believe that this enhances the modularity of the work and its readability.In fact, exploring the solution space is a combination of resource-availability checks and circuit manipulation, the latter is an hard task on its own and deserves dedicated discussion.For this reason, the further step is the characterization of those binary relations (Section V).Namely, through these relations, it is possible to discriminate among feasible and unfeasible manipulations.First, we relate all the operations to follow the logic induced by the order of occurrence.After that, we also relate operations to discriminate whether they can run in parallel or not.This works leads us to generalize the concept of parallelism to a new proposal of ours: the quasi-parallelism.This relation is based on (automated) circuit manipulation which aims to gather telegates within the same time step.The final outcome is a full characterization of the solution space.We evaluate the quality of the solutions available with the quasi-parallelism against no quasi-parallelism -up to an exponential improvement in the objective function -.As conclusion, we notice that the final objective function related to a calculated solution results to be a metric to the running-time of computation.This makes our model particularly interesting from a practical perspective.

II. DISTRIBUTED QUANTUM COMPUTING ESSENTIALS
In this section we describe the main elements, featuring a distributed quantum architecture.
One can encode a quantum processor as a set of qubits and a set of sparse tuneable couplings among qubits.If two qubits are coupled it means that they can interact.We will refer to such couplings as local couplings, to emphasize they belong to the same node in distributed architectures, as opposed to entanglement links, that are couplings between qubits in different processors.As detailed in next sub-section, two remote qubits coupled through an entanglement link, cannot be used for computation: consequently, it is useful to classify qubits as either computation qubits or communication qubits, respectively.While computation qubits process information during the computation, the communication qubits couple distinct processors through the entanglement.Figure 2 shows a toy architecture.The purple lines represent the couplings among distributed processors.We refer to such lines as entanglement links, as detailed in next sub-section.

A. The entanglement link
To couple two processors, a communication protocol is necessary, known as entanglement generation and distribution [1], [47], [2].We describe it here as three main steps: 1) generating a two-qubits maximally entangled state2 ; 2) splitting the state between distributed processors 3 ; 3) storing the partial states in the communication qubits.When the protocol succeeds, the distributed qubits are correlated and can be exploited to perform non-local operations.For this reason we consider this correlation as a virtual link, which we refer to as entanglement link 4 .Entanglement links extend the possible interactions to any distributed computation qubits.Specifically, since the communication qubits are locally coupled with computation qubits, with entanglement links one can perform operations between distributed computation qubits, referred to as telegates.More details on the functioning of telegates are reported in Section III-B.However it is important to keep in mind that, to perform a remote operation, one has to measure the states stored in the communication qubits.As a consequence, an entanglement link is a depletable resource, assigned to a single remote operation.After the measurement, a new round of entanglement generation and distribution takes place.
We now give a mathematical description of a distributed architecture, in order to formally describe the functioning of telegates.

Entanglement link Processor
Local coupling q 6 Fig. 2: Toy distributed quantum architecture with 3 processors.

B. Mathematical description
So far, we presented the main elements occurring in a distributed quantum architecture, which we can now represent mathematically.Formally, let N = (V, P, E) be a network triple representing the architecture.V = Q∪C is a set of nodes describing qubits, therefore it is the union of computation qubits Q = {q 1 , q 2 , . . ., q |Q| } and communication qubits C = {c 1 , c 2 , . . ., c |C| }.We can represent n processors by partitioning V into P = {P 1 , P 2 , . . ., P n }.Therefore, a subset P i characterizes a processor as its set of qubits/nodes.E = L ∪ R is as a set of undirected edges.L represents the local couplings, therefore Notice that there is no particular assumption on connectivity nor cardinality within processors.This keeps the treating hardware-independent and allows for heterogeneous architectures.
R represents entanglement links.Since entanglement links connect only communication qubits, we introduce, for each processor, a set of those qubits only; i.e.C i = C ∩ P i .Therefore, we have Figure 2 shows an exemplary architecture, with three processors in P , six computation qubits in Q, six communication qubits in C, three entanglement links in R and ten local couplings in L.
As regard minimal assumptions, we only care about architectures actually able to perform any operation.This translated into a simple connection assumption.

III. OPERATORS
In the following, the gate model architecture of quantum coputers is considered.There, a circuit describes a timeordered quantum evolution as a sequence of quantum gates consisting of unitary operators.The set of available operators depends on the physical implementations.

A. Computation operators
In order to achieve universal quantum computing, one may rely on a universal set of quantum logic gates capable to approximate any possible unitary operator.In the following, we consider a representative universal set of quantum gates, without loss of generality.A sufficient and compact assumption for local universal quantum computing consists of the three-operators set {H, T, CX}, where H is the Hadamard operator and T is the π 4 -phase shift.With a polynomial number of repetitions of H and T one can approximate any unitary operator with arbitrary precision [49], [50].Other choices of universal sets are possible, such as those based on trapped ions in a cavity [51], suitable for quantum interfaces where the photonic state is transferred to the cavity mode, and then to the electronic state of the ion via laser pulses [52].
To sum up, the universal operator set for local quantum computing we consider is {H, T, CX} 5 .Whenever an operator occurs with subscript, we are giving information about the qubits it is operating on, e.g., CX u,v is a CX operator with control qubit q u and target qubit q v .

B. Universal set
To extend the universality also to distributed architectures, we need at least one remote operator.Since the CX is the only operator involving more than one qubit, we just need to implement an operator equivalent to CX, but applying to distributed computation qubits.We call this operator an RCX.Clearly, CX and RCX are equivalent, but with their different nomenclature we highlight their physical difference.Specifically, while CX represents a local gate, RCX represents a sequence of operations that involves distant qubits.Therefore, in general, implementations of CX and RCX come with different fidelity, latency and required resources.
Specifically to the RCX functioning, this is based on a several fundamental steps, which we describe, in turn, by using operators.The first operator models the entanglement link creation; we refer to that as E or, more explicitly, as E w,r .It sets qubits c w and c r to the maximally entangled state |Φ + .The second operator models a measurement for a communication qubit c w , over the computational basis.Namely, the measurements outputs a classical binary variable b w ∈ {0, 1}.We refer to that as M w and with the circuit component c w b w Figure 3 shows a possible realization of a generic RCX u,v .Here, there are two qubits q u , c w ∈ P i and two qubits q v , c r ∈ P j .Let us separate the protocol in three different steps.The first one is the creation of the entanglement link between c w and c r , i.e. applying E w,r .After that, the second step is the pre-processing: a few local operations occur and then qubits c w , c r are measured, getting b w and b r respectively.The final step is the post-processing.The binary variables are used to assert whether further operations are required.Specifically, if b r = 1, a Pauli Z operator applies to q u and, if b w = 1, a Pauli X operator applies to q v .This phase can be compactly referred with the Z br u , X bw v operators.Notice that b w is local to processor P i and b r is local to P j .But P i uses b r and P j uses b w .In other words, a cross classical communication occurs between P i and P j .
Let us now give a look to some exemplary applications of RCX u,v over the toy architecture of Figure 2.
Fig. 3: Protocol performing an RCX u,v .From an operator point of view, this is equivalent to perform CX u,v .However u and v belong different processors and that is why we use notation RCX.
Fig. 4: Entanglement swap protocol.This scenario has three processors P i , P k , P j .P k has an entanglement link both with P i and with P j , created respectively by E u,v and E w,r .At the end of the protocol c u and c r are in the maximal entangled state |Φ + .From an operator point of view, this is equivalent to perform E u,r .
Example 2. Now assume one wants to run RCX 1,3 .In this case we can still use the entanglement link between c 2 and c 4 .However, qubit q 1 is not coupled with c 2 .To use that link we need to swap the states stored in q 1 and q 2 before and after running CX.
What happens if one wants to run, say, RCX 1,4 ?In such a case, the qubits belong two processors having no entanglement link coupling them.There is a really efficient protocol to overcome this problem: it is called entanglement swap and we describe it within next section.

C. The entanglement swap
As pointed out before, it might be the case where one wants to run an RCX operator between a couple of qubits belonging processors with no entanglement link.Formally, let P i and P j such processors and R ∩ (C i × C j ) = ∅.In the basic scenario, there exists an intermediate processor P k which has an entanglement link with both P i and P j , say via four communication qubits such that c u ∈ C i , c v , c w ∈ C k and c r ∈ C j .As Figure 4 shows, we exploit P k to entangle c u and c r .
The entanglement swap protocol can be generalized to an arbitrary sequence of intermediate processors.To this aim we introduce the concept of entanglement path.
1) The entanglement path: Coherently with the standard definition of path of a graph, an entanglement path is a sequence of entanglement links connecting two processors.Formally, an entanglement path is a sequence {P i1 , P i2 , . . ., P im } of m processors such that, ∀j ∈ [m − 1], there is an entanglement link between P ij and P ij+1 .
We can therefore entangle two communication qubits c u ∈ P i1 and c r ∈ P im by applying a generalization of the entanglement swap -showed in Appendix A -to {P i1 , P i2 , . . ., P im }.
Since at the end of the protocol c u and c r are in the entangled state |Φ + , an entanglement path is a generalization of an entanglement link.
2) RCX with entanglement path: In our scenario, the purpose of applying entanglement swap is to perform RCX.For this reason it is interesting to note that we can combine the entanglement swap protocol together with the protocol for RCX.The result is showed in Figure 5.This result generalizes to every path, no matter the length -see Appendix A -. We further discuss within next section the latency implications coming from this result.

PROBLEM
Usually, in the literature dealing with compiler design [15], [16], [13], [9], a circuit is encoded as a set of layers.Formally, a layer is a set ℓ of independent operators, meaning that each operator in ℓ acts on a different collection of qubits.A circuit is an enumeration of layers L = {ℓ 1 , ℓ 2 , . . ., ℓ |L| }, where the cardinality is also commonly referred as circuit depth.
Usually, a quantum programmer writes a logical circuit, abstracting from the real architecture and assuming that qubits are fully connected, i.e., any couple of qubits can perform a CX operation directly.Such an abstraction holds also when stepping to distributed architectures 6 .
However, NISQ architectures do not provide full coupling.As a consequence, there must be a software interface -namely, a compiler -able to map abstract circuits to an equivalent one, but meeting the real coupling.In general, such a mapping implies overhead in terms of circuit depth.Therefore, finding a mapping with minimum depth overhead is an optimization problem.We refer to it as the quantum circuit compilation problem (QCC), which is proved to be NP-hard [10].Its version on distributed architectures, which we refer to as the distributed quantum circuit compilation problem (DQCC), is likely to be at least as hard as QCC.In fact, while in QCC we deal with local connectivity restrictions, in DQCC local connectivity stands alongside with remote connectivity -i.e. the entanglement links -, which is less dense than the local one 7 .Furthermore, performing a remote operation is much more time consuming than a local operation.Just consider that a remote operation relies on communication of both quantum and classical information.
The above reasons make telegates the bottleneck in distributed computing.Therefore, they are worth of dedicated analysis to minimize their impact.

A. Objective function
To optimize a circuit, the first thing we need to do is choosing an objective function to rate the expected performance of a circuit.A common approach is to evaluate only those operators which are somehow a bottleneck to computation.For example, in fault-tolerant quantum computing [53], this is the T operator [54], [55] 8 ; in experiments on current technologies, this is the CX operator 9 .One can count the total number of occurrence of the subject operator O, which is the O-count; alternatively, one can count the number of layers where at least one operator is O, which is the O-depth.To rate a compiled circuit on distributed architectures, we do something along the lines of this latter approach.Specifically, the main bottleneck is the RCX operator and each RCX implies one occurrence of E. Therefore, we will rate a circuit with its E-depth.
As simple example of E-depth, consider an instance of the problem: a logical circuit where some RCX operators occur.Figure 6 shows an exemplary one.Let us put in the worst-case scenario, i.e., all the four qubits go 10 to different processors.Consequently, all the two-qubits operators are RCX.Without considering the tasks which RCX relies on, there is not much optimization to do and the E-depth is 5.

B. Modeling the time domain
It should be clear that E has central interest in our treating.In fact, we are also going to model the time by scanning it as E occurs.Specifically, notice that link generations among different couples of qubits are independent.For this reason we assume that all the possible links generate simultaneously and, as soon as all the states are measured, a new round of simultaneous generations begins.Clearly, after that a measurement M generates a boolean b, there is at least one post-processing 10 Assigning logical qubits to physical ones is another critical step for compilation and it deserves dedicated analysis [56], [57], out of the scope of this work.

T H T H
Fig. 6: Exemplary logical circuit, assuming universal gate set {H, T, CX, RCX}.In the DQCC, part of the instance is a logical circuit, like this one.Depending on the assignment physicallogical qubit, some of the two-qubits operators will be CX, others will be RCX.
operator that need to wait for that boolean to arrive.Generally speaking, the longer the path the more time b takes to reach its destination.We need to account for that by a proper model.
To this aim, we do some observations.
Remark.Consider a generic single-qubit unitary operator U.The time required to perform U b is largely dominated by the travel time of b, whilst the actual time taken by U can be neglected.Furthermore, the travel of b is independent from computation.Hence, we can compactly refer to the post-processing waiting-time as ∆ U b .A second observation is that the travel of b is also independent by entanglement link creations, which we assume to take time ∆ E .It is also logical to assume ∆ U b ∆ E for the following reasoning: even if b need to cover a longer distance than the one covered by E, b relies on classical technologies, which are way more efficient 11  than entanglement generation and distribution protocols.For this reason, in our treating we neglect ∆ U b , since it happens in parallel with ∆ E .
Stemming from this, we can model the time domain as a discrete set of steps τ ∈ {1, 2, . . ., d}, where d is an unknown time horizon, which is also the E-depth.At the beginning of each time step τ , the whole set of entanglement links is available for telegates.Notice that most of the local operators are expected to run during the creation of the links.Because we relate them to the following inequality where, for a generic operator O, ∆ O is the time to run O.Therefore, since E is independent from local operators, we can always attempt to run these while E is running -and also while classical bits b are traveling, as explained in Section III-C2 -.

C. Modeling the distributed architecture
In light of the above observations, it is reasonable and convenient to consider the whole processor as a network node, and define a function c that provides the number of available links between two processors.Formally, we re-formulate the network graph N = (V, P, E) introduced in subsection II-B, to a more compact encoding, which highlights the bottlenecks of a distributed quantum architecture.Specifically, we consider a quotient graph of N .To define it, we make use of equivalence classes formalism (it will be useful also later on).Let ⋆ be a an equivalence relation defined on the entanglement links in R as follows: The above statement characterizes the set of inter-partition edges, such that when two edges reach common processors, they belong to the same equivalence class, i.e. 11 The design of a distributed quantum architecture can easily adapt to satisfy requirements coming from assumptions on classical technologies, since these are very advanced.
Fig. 7: Quotient graph derived from toy network of Figure 2. The processors become the nodes, the entanglement links between a couple of processors are "compressed" into one edge, with capacity equal to the number of original links.
Consequently, the edge set of the quotient graph will be: The quotient graph is Q = (P, R/⋆).We also associate, to the edges of Q, a capacity function c : R/ It will tell us how many entanglement links are available between two processors.
Ultimately, we define the equivalence class defined on computation qubits, induced by partition P , i.e.: This is useful to recognize the processor, i.e. the partition, which a generic qubit q u belongs to, namely [q u ] P .
In Figure 7 we show the quotient graph of the toy architecture of Figure 2.

D. Single layer formulation
Consider a basic circuit expressed as the singleton L = {ℓ}.Assume that in ℓ there occur k RCX operators.From a logical perspective, all the k operators can run in parallelby definition of layer -.In other words, if the architecture connectivity had infinite capacity -i.e.c(e) = ∞, ∀e ∈ R/⋆ -, we could run L with E-depth 1, that is optimal.As the capacity values decrease, the optimal E-depth value grows, up to E-depth k in the worst-case.For instance, consider a quotient graph as the one in Figure 8; it represents a generic 2-processors architecture.In such a simple case, there is not much margin for optimization.Namely, the E-depth depends on the capacity c(P 1 , P 2 ) and the circuit.Take k operations occurring in the same layer.Whenever c(P 1 , P 2 ) ≥ k, the Edepth is 1.As the capacity goes below k, the depth increases, up to k for c(P 1 , P 2 ) = 1.
Let us now formulate an optimization problem for a singlelayer, multi-processor architecture -we will introduce a generalization to any circuit in subsection IV-E -.Specifically, the quickest multi-commodity flow [37] wraps that basic scenario.
In brief, the goal is to find a flow over time which satisfy the constraints imposed by a set of so-called commodities, which are going to represent the RCX of a quantum circuit.The less time the flow takes, the better.To formalize this problem one can directly model an objective function that evaluates a flow by the time it takes.This is an approach employed in [58], but for single commodity.Alternatively, authors in [37] propose to start from a formulation of the multi-commodity flow problem over time MCF d , where d is a given time horizon 12 , namely a maximal number of time steps Fig. 8: Generic quotient graph Q = ({P 1 , P 2 }, {(P 1 , P 2 )}) for any two processor architecture.
in which the flow is constrained.We prefer this latter way because dynamic flows like MCF d has been deeply studied since long time ago [38], [39].Furthermore, this approach has a main drawback, explained at the end of this sub-section, but it doesn't apply to our scenario.
To formulate MCF d , first, we enumerate the occurrences of RCX in L as a set of commodities [k] = {1, 2, . . ., k}.A set of couples source-sink nodes associates to the commodities.To do that, let P C = {P C 1 , P C 2 , . . .P C k } and P T = {P T 1 , P T 2 , . . .P T k } be two sets of processors, such that the following holds: is the processor where the control (target) qubit of operation i occurs.
The decision variables of the optimization problem are the time-dependent functions f e,i (τ ) ∈ {0, 1}, indicating the flow on edge e ∈ R/⋆ dedicated to operation i ∈ [k] at time τ .The function has a binary co-domain because an operation i uses at most one entanglement link in e.
Remark.When dealing with flows over time, usually a traveltime associates to each edge.Instead, we can assume null travel times, i.e., a flow leaving a source at time τ reaches the sink at same time τ .The fact that we can model a timedependent problem in this way is due to the quantum nature of the links, because there is a non-local correlation between linked processors.This is quite remarkable and may lead new interest into a group of flow problems which are dynamic-static hybrids.
First, let us introduce the flow conservation constraint.Formally, ∀i ∈ [k], ∀τ ∈ [d] and ∀P j ∈ P {P C i , P T i } the following holds: where δ − , δ + : P → R/⋆ are the standard functions outputting the set of entering and exiting edges of the input node, respectively.Since a flow f e,i (τ ) = 1 identifies the usage of an entanglement link in e to perform i, we need to guarantee that the flow going through intermediate links of a path does not stop there.Conversely, whenever an end point of the path occurs in the control or target processor -i.e.P C i or P T i -, the operation demand -or commodity demand -constraint holds instead of the conservation constraint.Namely, ∀i ∈ [k], this can be written as: The above constraint explicitly requests that a flow dedicate to i reaches its target P T i , without exiting.Symmetrically, it leaves its control processor P C i without returning.Notice that constraint (5) forces the operation demand to be satisfied within a single time-step.
The last constraint ensures that, at any time step, the number of operations does not exceed the entanglement resources.Hence, ∀e ∈ R/⋆ and ∀τ ∈ [d], we introduce a capacity bound: Ultimately, the objective function is the total flow τ f e,i (τ ).By gathering the above equations, we obtain the Integer Linear Programming formulation (3), which models MCF d .A flow f perfectly matches a set of entanglement paths used by the telegates.
Notice that solutions with cycles are in general feasible, but are senseless in our scenario.By expressing the problem as a minimization of f , a solver will avoid any cycle and will try to use as few entanglement links as possible.
Once defined a solver for MCF d , we just need to use it as proposed in [37], namely the solver occurs as sub-routine within a binary research on the minimum time where a feasible f e,i (τ ) ≤ min f e,i (τ ) ≤ min solution exists.Since the research space is over time, the algorithm is, in general, pseudo-logarithmic.Specifically to our case, we already know that the worst solution is where all the operations run in sequence -i.e.E-depth equal to k -.Therefore, the time horizon is upper-bounded by k and the binary search has log k calls to the sub-routine.Algorithm 1 shows the steps.
Unfortunately, standard MCF d can't catch the whole features of DQCC when considering any L = {ℓ 1 , ℓ 2 , . . ., ℓ |L| }, we need to consider that operations in [k] are somehow related each other by a logic determined by L. Hence in the following sub-section we are going to model such relations by introducing extra constraints.

E. Any layer formulation
As mentioned, the formulation we just gave is not enough to model the DQCC problem to any L = {ℓ 1 , ℓ 2 , . . ., ℓ |L| }, because a circuit generally follows a logic which is related on the order of occurrence given by L. Therefore, even if it might happen that two operations could run in any order, in general this is not true.One needs to define an order relation which is consistent with the logic of the circuit.From an optimization point of view, a critical matter is to choose an order relation which either wraps most of the good solutions or is prone to optimization algorithms.For this reason and for the sake of clarity, we here refer to a generic, irreflexive, order relation ≺ defined over [k], without giving it a unique definition.Formally, for any i, j ∈ [k], i ≺ j means that to run j we need to ensure that i already ran.
Starting from ≺, we can define a constraint to add to formulation (3).Namely, ∀i ∈ [k], ∀e ∈ δ − (P T i ) the following holds: Notice that the right part of the inequality is a value in {0, 1} and takes value 1 only if all the operations logically preceding i already ran.
The formulation now models DQCC.But, within next subsection, we refine inequality (9) to get a better solution space.

i j
Fig. 9: RCX in logical conflict as both i and j operate on second qubit.

F. Quasi-parallelism
As before, from an optimization point of view, we are interested in considering as many good solutions as possible.To this aim, we propose an interesting approach which should enlarge the space of good solutions.Specifically, we notice that even if two operations i, j ∈ [k] are such that i ≺ j, this does not necessarily mean that they must run at different time steps.They, indeed, may run at the same time step and still respecting the logic imposed by ≺.
Consider the example from Figure 9. Since operations i and j operates over a common qubit, they are in logical Fig. 10: Example of how to achieve quasi-parallelism for two RCX in logical conflict.
conflict.Hence, it is reasonable to think that i ≺ j should hold.However, when considering i and j in their extended form -i.e.where communication qubits are explicit -, we notice that their logical conflict does not map over all the operations involved.As Figure 10 shows, the left part of the equivalence is a naive implementation of i followed by j, where the extended form completely inherits the logical conflict.Instead, the right part of the equivalence is way more efficient and it is still an implementation of circuit of Figure 9.As consequence, even if i and j are in logical conflict, they can run at the same time step.We refer to this property as quasi-parallelism.For this reason we introduce a new binary relation between operations in [k], which we refer to with the intuitive symbol .As before, we don't give here a unique definition of .Specifically, for any i, j ∈ [k], we write i j to mean that operations i and j can run at the same time step, but we did not fix a criterion to establish when holds.Clearly, operations i, j ∈ [k] which can run in full parallelism, are a special case of quasi-parallelism and i j holds.
We can now split the constraint (9), by discriminating between operations which can run in quasi-parallelism and the ones which cannot.Formally, ∀i ∈ [k], ∀e ∈ δ − (P T i ) we introduce two new constraints ) To sum up, we propose (4) as Integer Linear Programming formulation of the DQCC problem.Within next section we fix ≺ and .

V. CHARACTERIZING THE BINARY RELATIONS
So far we modeled the problem without completely characterizing relations ≺ and .This gives a lot of freedom in the way one can tackle the problem.Because, in this way, it is easier to explore different relations, which in turn would bring to different solution spaces.
For practical matters, it is appropriate to keep static definition of ≺ and , meaning that we do not want the relations to change while the solver is running.
Let us start from ≺.We want to make this relation coherent with the order of the layers.More formally, Assume i, j ∈ [k] and let ℓ n , ℓ m ∈ L be such that i ∈ ℓ n and j ∈ ℓ m .The following holds: With such a requirement, whenever j occurs after i in L, to run j at time τ we need to run i at a time τ ≤ τ when i j and τ < τ when i j.We can now show how impactful our proposal can be.

Remark. Consider the following scenario: each layer of L has at most one remote operation of [k] and all the operation in [k]
are in logical conflict one other.This is interesting from an analytical perspective, since, if we ignored the optimization opportunity offered by quasi-parallelism 13 , the feasible solution space collapses to a singleton.Namely, the only solution is a sequence, where at each time step, only one remote operation occurs, in accordance with ≺.The Edepth is then exactly k, which is the worse achievable.Instead, by introducing the quasi-parallelism, as we did in formulation (4), the space of feasible solutions expands.Now consider the case where all the operations of [k] can run in quasiparallelism -i.e.∀i, j ∈ [k], i j -.This means that, up to connectivity availability, a solver may find a solution of E-depth 1.
Notice that the above reasoning is independent from how one characterizes .Hence the goal now is to model to catch as many solutions as possible, while keeping them feasible to the hardware.With this in mind, we propose the following criterion: given any i, j, i j holds whenever i and j can run within a certain "small enough" time lapse.Specifically, the time lapse depends on the coherence time of communication qubits (encoded by C), which are assumed to be much more affected by noise than computing qubits (encoded by Q).
Notice that, when two operations i, j run in quasi-parallelism, the life-time of the employed communication qubits might grow.Therefore, we need to ensure that it does not exceed the coherence time of the entanglement.Formally, let us assume ∆ c being the coherence time of the entanglement -hence, it starts from the moment E ends, up to the beginning of the measurements M -.
A complication arises from the fact that is, in general, an intransitive relation.To understand why this is true, consider the circuit in Figure 11.In such a scenario we are faced with multiple choices.Namely, running 1) all i, j, k at different time steps; 2) all i, j, k at the same time step; 3) i, j together and k afterwards; 4) i only, followed by j, k together.Case 1) is not of interest, because it is the worst solution and no optimization applies.Case 2) is the best solution, but it is not necessarily feasible.In fact, for ∆ c small enough, we are forced to split the operations, as in one of the cases 3) and 4).This explains the non-transitivity, since i j and j k, but i k.
We still need to characterize , hence, we introduce a predicate method which aims to bring RCX closer to each other, so that quasi-parallelism is achievable.

A. Achieving quasi-parallelism, a recursive predicate
As said above, we are now going to introduce a method which verifies if any two telegates can run in quasi-parallelism.Therefore, this method, say A(i, j, ∆ c ), is a predicate, which is true whenever the operations in input can run in quasiparallelism.We can finally characterize : A works in a recursive fashion with three different scenarios as base case.Base case (i): given two operations i, j, if they belong to the same layer, clearly they can run in full parallelism, therefore A(i, j, ∆ c ) is true.
Base case (ii): similarly to (i), if i, j belong to different layers and they are completely independent 14 , A(i, j, ∆ c ) is true.Circuit of Figure 12 gives an example with i, j in contiguous layers.
Base case (iii): assume i, j contiguous -i.e. in contiguous layers -and both operating on, at least, one common qubit.We want to introduce, with this base case, the possibility that multiple operators may run simultaneously, as exemplified in Figures 10.For this reason, algorithm A considers all the operators involved to perform an RCX -recall protocol from Figure 3 -.Namely, A pushes forward the post-processing of i -i.e. the Pauli operations Z b and X b -after the pre-processing of j -i.e. the CX operations -.One can do that by using the following well known transformation rules: After the application of these rules, some post-processing operation, might have been propagated also to communication qubits.Specifically, it may happen that a measurement is preceded by an operation X b .One can always reduce the depth of the circuit by sending b to the target(s) of the measurement.This is indeed what happens in our first example -Figure 10 -, where, instead of performing X b1 in the communication qubit, we opt to put it in combination with X b3 , achieving a single operation X b1⊕b3 -see also Figure 13 for a circuit representation -.At the end of the circuit manipulation, the life-time of the communication qubits may have risen.If it does not exceed ∆ c , then A(i, j, ∆ c ) is true; otherwise, A(i, j, ∆ c ) is false.
Recursion: consider now the case where i and j are separated by a sequence of local operations O 1 , . . ., O n15 , assumed to be confined to the universal set {H, T, CX}.In this case, A applies, recursively, transformations for both i and j.Specifically, as long as possible, it pushes forward the post-processing of i by using former rules together with: • TZ b ≡ Z b T • HX b ≡ Z b H Ultimately, as long as possible, A pushes backward the preprocessing of j by using the following standard rules: Fig. 14: An expansion, obtained by applying rules from A. In this example scenario, RCX are interspersed with single-qubit local operators.Notice that boolean variables travel simultaneously.Hence, the assumption we made in Section IV-B -i.e.∆ U b ∆ E -holds also for complex evaluations as Z b1⊕b4 and X b6 Z b3 .
• CX(T ⊗ I) manages to make i post-processing and j pre-processing contiguous, the validity check reduces to the base case scenario.Otherwise, A(i, j, ∆ c ) is false.
So far, we defined A only for i, j without any other remote operation in between.Before generalizing the method to any i and j we prove that our definition of A can be implemented in polynomial time.We need this requirement to keep things tractable.
Theorem 1.A is polynomial.
Proof.Assume there occur n local operations, say O 1 , . . ., O n , between i and j.If A manages to push i forward O 1 , it means that its post-processing run after O 1 and it may only propagate vertically, over different qubits -by construction of the rule set -.As consequence, the depth of the circuit has not increased.Furthermore, the post-processing is still composed by Pauli operations of the kind Z b and X b.Hence, this holds for any O 1≤n≤n and the recursion is upper-bounded by O(n).
Symmetrically, if A manages to push j backward O n , it means that the pre-processing can run before O n .Also in this case, the depth has not increased and the pre-processing is still composed by two independent CX operations -again, by construction of the rule set -. Hence, this holds for any O 1≤n≤n and the recursion is upper-bounded by O(n).
We can now move on to the general case.Formally, between i and j a remote operation k may occur, which is also in logical conflict with both.For such a scenario, we just add a recursive rule.Namely, A(i, j, ∆ c ) holds iff the following holds: Take a moment to appreciate why this kind of recursion is feasible.Specifically, one might think that validity of A(i, k, ε • ∆ c ) and A(k, j, (1 − ε) • ∆ c ) are not independent, because they both operate on k.However, in the former function, A evaluates the pre-processing of k, while, in the latter, it evaluates its post-processing.Therefore they can be evaluated independently.Theorem 2. Generalized A is polynomial.
Proof.Assume there occur k 1 , . . ., k m between i and j.For the purpose of the proof let m being a power of 2. A(i, j, ∆ c ) can choose any of the k 1 , . . ., k m operations for the recursion.To keep symmetry, let A(i, k m 2 , ε • ∆ c ) and A(k m 2 , j, (1 − ε) • ∆ c ) be the recursive call.Notice that operations considered by A(i, k m 2 , ε • ∆ c ) are m 2 , as well as the ones considered by The result is a recursive binary tree of height log m and, therefore, O(m) calls to A. The leaves correspond to the base case of the recursion, which is proved to be tractable in Theorem 1. Figure 14 shows an example scenario where we used rules as in A -in addition to the first one of Figure 10 -.Clearly, our modular architecture is prone to modifications or extensions of A, if future research highlighted more refined requirements.
Remark.Notice that we managed to define A to be independent by the connectivity of Q.This was possible thanks to the way we modeled telegates via efficient entanglement paths III-C2.In other words, A(i, j, ∆ c ) works for any solver and regardless of the path this chooses to perform i and j.As consequence, the characterization of A -and therefore also of -is static and depends only by the logical circuit and global factors, i.e. ∆ c .Furthermore, we can relate coherence time and entanglement link creation to ∆ E + ∆ c ≈ ∆ E .As consequence, whatever ∆ c is, A does not significantly affect the duration of each time step.This makes the E-depth a particularly good index for the running time of the overall computation.

A. Summary evaluation
In what follows we evaluate our model for DQCC.(i) By expressing the problem as a quickest flow problem, we could give a formulation corresponding to a multicommodity flow problem over fixed time.This approach is particularly well fitting with our goals, because a quickest flow expresses the need to run a circuit as fast as possible, while a flow over fixed time brings a side interest into the minimization of resource usage, which is clearly a desideratum, but still secondary to the overall running-time.
(ii) Constraints ( 10) and (11) give the possibility to consider more interesting solutions.In fact through efficient circuit manipulation -see predicate A -, we managed to gather logically sequenced telegates within the same time step, achieving quasi-parallelism.
(iii) We built our model step by step, each of which rigorously explained.The result is an highly modular work.For example, if one can consider only circuits where operations can all commute each other, formulation (3) is enough and approximation bounds are available.Instead, when considering any circuit, one can easily shape the extra constraints of formulation (4).Consider, for example, the quasi-parallelism relation , we characterized it as the predicate A. By just extending the way A works, the space of good solutions gets larger.
(iv) Since we modeled the problem as a network flow problem, one can also exploit the huge related literature to get inspiration in the way of tackling the problem.In next sub-section, we do an extensive discussion in this sense.

B. Tackling the problem
Formulation (4) is a particular case of MCF d , as it slightly recedes from the standard formulation.As expected, the problem is still intractable.To understand that, consider this simple scenario: an instance [k] with k = 2 such that 1 2. We can restate the problem as follows: assert if there exists a solution at first time step.If not, just put operation 2 at second time step.Unfortunately, asserting if such a solution exists is NPhard.Indeed, in [59], authors proved the hardness of such a decision problem, even for single capacity edges.Therefore, it is reasonable to look for approximations of DQCC.
To this aim, we think a good line of research would be to follow a common technique for tackling MCF d : the timeexpansion [38].Namely, a re-definition of the instance graph, from Q to a new graph Q d .Such a technique is useful because, instead of tackling MCF d over Q, one can tackle its static version MCF over Q d .Let us introduce it formally for our scenario.
A time-expansion of Q is a graph Q d = (P d , R d /⋆).Accordingly to this criterion, an edge (P i , P j ) ∈ R/⋆ taking discrete travel time θ would translate into directed edges (P i (τ ), P j (τ + θ)), (P j (τ ), P i (τ + θ)) ∈ R d /⋆, with a shared constraint on the capacity.Nevertheless, edges in Q are assumed to have null travel time.Hence, a time-expansion of Q is particularly efficient, since one just needs to introduce a repetition of Q for each time-step τ , which we refer to as Q(τ ) = (P (τ ), R(τ )/⋆).As consequence, time-dependent sets P C (τ ) and P T (τ ) replace P C and P T .We keep using P C and P T as the nodes encoding the commodities, nonlocalized in time.For each i and τ , we introduce edges (P C i , P C i (τ )) and (P T i (τ ), P T i ), both with unit capacity.Since only integral flow are allowed and the demand is exactly 1, for any operation i, only one of the edges {(P C i , P C i (τ ))} τ -as well as only one in {(P T i (τ ), P T i )} τ -will have a non-zero flow.
Now that we gave a first intuitive way to encode the sources of the problem, let us optimize it.Notice that operation 1 can always run at time 1, and it is a waste of time and space considering other options.As consequence, for operation 1, we only introduce (P C 1 , P C 1 (1)) and (P T 1 (1), P T 1 ).This extends to any operation, which can always run in a time between 1 and min{i, d}, by assuming that a solution exists with time horizon d.Therefore, for each operation i, we introduce the sets of edges {(P C i , P C i (τ )) : ∀1 ≤ τ ≤ min{i, d}} and {(P T i (τ ), P T i ) : ∀1 ≤ τ ≤ min{i, d}}. Figure 15 shows the final graph for instance [k] with k = 3, time horizon d = 2 on an architecture with 4 processors.
As said, the time expansion Q d is a common way to tackle MCF d as a static flow problem and it is particularly efficient in our scenario.Indeed, we already pointed out in Section IV-D how formulations (3) and ( 4) belong to a group of flow problems which are dynamic-static hybrid.For this reason we could model Q d by simply introducing d repetitions of Q.
To the best of our knowledge, even if approximation algorithms for MCF [40], [41] and variants [42], [43], [44], [45] have been extensively studied, there seems to be no proposal relatable to ours, modeling DQCC.More formally, no efficient reduction seems possible from our problem to standard formulations, while approximation algorithms proposed in literature usually rely on LP-relaxation, or on greedy criteria that don't fit with constraints (10) and (11).Hence, further studies along this line would be useful to (i) place the problem within its most proper complexity class and to (ii) guarantee approximation ratio.

VII. CONCLUSION
We addressed the compilation problem on distributed architectures.In line with literature, we assumed telegates as fundamental inter-processor operations.To mitigate their impact to computation, we modeled a minimization problem with the running-time as objective function.Even if the main interest was in minimizing the running-time, this highly depends on (i) resource usage and (ii) circuit manipulation.Hence, finding a rigorous model that efficiently consider all these parties can be really tricky.To overcome this, we exploit the wide literature on dynamic flows.Specifically, as done for quickest multi-commodity flows, we embedded an ILP solver within a binary search over time.Hence, even if the primary goal is to find short-time solutions, this passes through a solver addressing (i) and (ii).More in detail, the objective function of the ILP formulation is the resource usage (i), but the constraints are based on circuit manipulation (ii).In other words, we embedded an evaluation of equivalent circuits through (automated) circuit manipulation.Specifically, the evaluation considers to gather telegates within the same time step.As expected, integrating circuit manipulation to the formulation improved the quality of the solution space, in terms of running-time.In fact, our proposal -see predicate A introduced in Section V-A -, introduces many better solutions (in terms of running-time) than without it.To quantify that, we showed a group of circuits that, without A, are forced to run in the worst running-time possible -i.e. as many time steps as the number of remote operations -, while A achieves best possible solutions -i.e.E-depth 1 -.

A. Entanglement swap generalization
Within this section we show how to efficiently implement an entanglement path.In Section III-C, we introduced the entanglement swap as a circuit of depth 5. We also claimed that such a depth is fixed when generalizing the entanglement swap to the entanglement path.To this aim, we give an inductive proof for such a statement, starting from the base case with entanglement path of length 2. Theorem 3.An entanglement path {P i1 , P i2 , . . ., P im } has an implementation with depth 5.
Proof.Consider, as base case, that we want to create a path of length 2. Clearly, we could do that by just putting in strict sequence two entanglement swap.This way is showed in Figure 16.The colored operators are the only ones we are going to optimize; since the others are independent and no optimization can be applied.The optimization is shown in Figure 17.Specifically, circuit on the right of equation has post-processing composed by Z b on first qubit and X b on last qubit.Furthermore, now the measurements are independent from other operations.
By assuming that such a shape is preserved in the inductive step, we show that this transformation can be applied to any length -see Figure 18.This proves that we can always consider an entanglement path {P i1 , P i2 , . . ., P im } to have circuit depth 5.
We just showed an efficient implementation for the entanglement path.Now we do one last step to exploit such a result and performing a generalized remote operation efficiently.Theorem 3 allows us to assume that, to perform a remote operation by using a path of length m, the computing qubits interact only with two communications qubits and depend only by Pauli operations Z b1⊕b3⊕•••⊕b2m−1 and X b2⊕b4⊕•••⊕b2m .We can propagate such operations as in the equivalence of Figure 19.In this way the measurements are independent and the depth of the circuit has not increased.
Fig. 16: Naive implementation of a path with length 2 as two entanglement swaps in sequence.
Fig. 19: Final equivalence for the generalized remote operation.

Fig. 1 :
Fig. 1: Manuscript overview.Blue blocks denote the steps in the problem modeling, scanned by blue arrows.Red blocks are the main ingredients to the entry blue blocks.

Fig. 13 :
Fig. 13: Propagation of X b .First wire no longer need information of b. Second wire need information given by b ⊕ b.Notice that measured b is not the same value in the two cases.