Lifetime-Based Optimization for Simulating Quantum Circuits on a New Sunway Supercomputer

High-performance classical simulator for quantum circuits, in particular the tensor network contraction algorithm, has become an important tool for the validation of noisy quantum computing. In order to address the memory limitations, the slicing technique is used to reduce the tensor dimensions, but it could also lead to additional computation overhead that greatly slows down the overall performance. This paper proposes novel lifetime-based methods to reduce the slicing overhead and improve the computing efficiency, including, an interpretation method to deal with slicing overhead, an inplace slicing strategy to find the smallest slicing set and an adaptive tensor network contraction path refiner customized for Sunway architecture. Experiments show that in most cases the slicing overhead with our inplace slicing strategy would be less than the Cotengra, which is the most used graph path optimization software at present. Finally, the resulting simulation time is reduced to 96.1s for the Sycamore quantum processor RQC, with a sustainable single-precision performance of 308.6Pflops using over 41M cores to generate 1M correlated samples, which is more than 5 times performance improvement compared to 60.4 Pflops in 2021 Gordon Bell Prize work.


Introduction
Quantum computer has the potential to provide exponential speedups over classical counterparts in specific tasks.The declaration "Quantum Advantage" refers to those tasks that can only be solved in reasonable time by using quantum computers [1][2] [3].However, despite such advantages in the computing capability, low fidelity is still the major challenge for quantum computers [4].Classical simulators are therefore important to provide validations for quantum computer design [5].Furthermore, scientists and researchers in areas that heavily rely on reliable computing resources, such as quantum algorithm, quantum programming language, and quantum compiler, can work on classical simulators and obtain close performance [6].Considering both the exponential complexity and the urgent demand, improving the efficiency of the state-of-the-art classical simulator is important.
Directly storing an arbitrary quantum state on a classical computer requires an exponential amount of memory against the number of qubits.So using traditional state vector method to simulate quantum circuits is currently limited to less than 50 qubits [7].In recent years the tensor network contraction (TNC) [8] algorithm has demonstrated promising potential, especially in simulating random quantum circuits (RQCs).When combinated with the slicing technique, the TNC algorithm could efficiently simulate larger quantum circuits using only a limited amount of memory.
In TNC algorithm, qubits and quantum gates are represented as tensors, and the whole quantum circuit is treated as a tensor network [9].The problem of computing amplitudes of the output quantum state is transformed into contracting the corresponding tensor networks.The performance of TNC is mainly determined by the TNC path.In a network with hundreds of tensors, a good contraction path can easily reduce the complexity by several magnitudes compared to the bad ones.In the meantime, during the contraction process, huge intermediate tensors could appear.The slicing technique can reduce the memory requirement [10] [11], at the price of some computational overheads.This work aims to reduce the computational overhead caused by slicing.
Unlike previous efforts that predominantly use heuristic algorithms, this work introduces a new conception, lifetime, in order to provide better interpretability for the slicing optimization in tensor network contraction.Lifetime describes how an edge affects the TNC process by analyzing all the tensors and contractions it is involved in.By using lifetime, one is able to transform the computation into an equivalent form with much less memory requirement.Slicing works on each two adjacent manually controllable levels on a multi-level storage system.For Sunway architecture (a most advanced supercomputing system in China), both the data exchange from hard disk to main memory, and from main memory to local data memory (LDM) (i.e.process level and thread level), can be optimized by slicing.For process level, we slice tensors for distributed storage and parallelization, and apply lifetime-based slice finder to reduce the overhead of process division.For thread level, slicing helps design a fused algorithm to reduce memory access, and transform the memory-intensive kernels into computationally intensive ones in some cases, thus improving the optimization capability.

Major contributions of this work include:
• A conception, lifetime of graph edges (tensor dimensions), is proposed to explain the influence of each dimension, how slicing eliminates this influence, what is the origin of the slicing overhead, and how to avoid this overhead.• Guided by lifetime, a dynamic slicing method with a new target function is proposed, to helps find smaller slicing set with less slicing overhead.• Indicated by lifetime, a high performance algorithm for TNC that fits well with the Sunway architecture is presented.

Background
2.1 Tensor Network Contraction and Slicing 2.1.1Notation.A tensor network can be treated as an undirected graph, where tensors and dimensions are denoted by vertices and edges, respectively.We define the notation similar with the work of cotengra [12]: We denote a graph by  = ( , ). is the vertex set, and  is the edge set.The incident set   of a edge  contains the two vertices the edge connects.Map  :  →  denotes the edge weight, as well as the size on each dimension.Particularly in classical simulation,  () is the power of 2 for all  ∈ .The incidence set   of vertex  is defined as   = { :  ∈ ,  ∈   }.Vertex contraction merges two vertex into one by removing shared edge and keeping the rest.In the generated graph

𝐺
′ , the contracted vertices  0 ,  1 are replaced by a new vertex  2 , and the edges between  0 and  1 are removed in the new edge set.The uncontracted edges in   0 and   1 will be inherited by   2 .
If we perform the pairwise contraction sequentially till there is only one vertex left, the order would form a contraction path.As a matter of fact, a equivalence class can be made up by reordering some independent steps, and all equivalent paths can be uniquely described by a tree structure.A contraction tree  is defined as  = (  ,   ), with each edge denoted by a vertex of original graph or an intermediate vertex generated by contraction.Specially, the edges connecting to leaf nodes are denoted by vertices in  .Nodes expressed as triplet  = ( 1 ,  2 ,  3 ) refer to contractions.In particular, leaf nodes such as (1,  1 ,  1 ) can be treated as a multiplication with a scalar 1.Then leaf nodes can represent the reading process.A certain contraction path determines the direction of each contraction.By adding the final contraction as the root, as shown in Fig. 1, a rooted binary tree is acquired.
As for a given contraction tree, the time and space complexity of the corresponding contracting process can be quantitatively evaluated.The space cost is the biggest intermediate tensor, i.e max  ∈   ∈   ().And the Figure 1.Tensor network and its contraction tree.When a certain contraction path is formed, an additional contraction for the final scalar should be added.time complexity is: To minimize the memory and time cost, the two expressions above are usually treated as the target function.
For large tensor network, memory limitation is always a serious issue.So after a path is found, it is necessary to decide whether slicing is needed to reduce the dimension of tensors.Slicing a tensor alongside a dimension (an edge in the TN) can transform an dimension tensor into  different ( −1)-dimension subtensors.Slicing temporarily cancels the correlation between these sub-tensors and provides parallelism.Fig. 2 shows a simple example of slicing.In this work, slicing are performed on the original , in which there is  () = 2 for all  ∈ .As a result,  sliced edges make up 2  independent tasks, and the result will be accumulated after their individual calculations.Total time complexity may increase after slicing under some circumstance, then slicing overhead is defined as: where  denotes the set of sliced edges.Slicing overhead comes from redundant calculations due to the split of tasks.From a higher perspective, slicing provides a distributed storage strategy and natural stream-like design since large tensors are sliced to stored and calculated in multi-processes, respectively.Compared to traditional methods based on heavy communication, slicing shows its advantage through independent subtasks.However, as the price, slicing overhead determines the effect, so one of the primary targets of this work is to explain and reduce such overhead.In some quantum advantage circuits, there are tensors with dimensions of more than 60 [12].They can occupy up to 1000 PB, which exceeds most storage systems.Slicing helps reduce the memory demand to the level of TBs or even GBs to make simulation feasible.
Cotengra [12] integrates several anytime methods and claims over 10000× speedup compared to the estimation made by Google [1].It introduced a series of effective heuristic algorithms to search for contraction path, such as community [13] and graph partition [14].After the pre-process which is implemented in [15], rank-1 and rank-2 tensors are absorbed, and the tensor network is largely simplified.A greedy-based slicing strategy is built in contengra.It repeatedly chooses a dimension that leads to the most minor overhead to slice, until the memory demand is satisfied.Like most greedy methods, local minimum exists in this slicing strategy.In our work, contengra is applied to find a proper contraction path.
Alibaba proposed a simulator based on an observed structure called 'stem' [16].A stem is a computationallyintensive region in the contraction tree, including most of high-rank tensors.The inexpensive parts are called branches.Within a single stem, a bigger tensor sequentially absorbs smaller ones, and about 99% computation cost happens during these contractions.For slicing, with a similar greedy strategy, they perform local tuning of contraction tree between two steps of slicing picking.This dynamic design highly reduced the inherent slicing overhead of a contraction tree.However, if the condition of local tuning is not satisfied, this strategy may not be able to find an optimal slicing set.This work can provide 10 18.8 times complexity, slicing overhead of 4, and a 14.7% FLOPS efficiency.Due to the high efficiency, dynamic design is applied or updated by many other softwares, like cotengra [12].
As for the New Sunway System, several efforts study high-performance algorithms for TNC.Sw_Qsim [17] provides a series of methods for matrix multiplication and tensor permutation, especially for narrow tensors.The work that won the 2021 ACM Gordon Bell prize [6] implemented Transpose Transpose GEMM Transpose (TTGT) algorithm on Sunway architecture, which is a fused design of tensor permutation and matrix multiplication in order to reduce memory access.These efforts significantly improve the FLOPS efficiency, but all concentrate on one step of the whole path instead of taking a holistic view.Optimization will soon reach the ceiling as a bandwidth-constrained problem according to Roofline model [18].For extremely deep circuits, other tensor based methods like matrix product state are also implemented on Sunway architecture [19].In this work, we focus on the hardware-motivated and highly entangled circuits, which have a clear 2D geometry and are relatively shallow.

Sunway 26010Pro Processor
As a heavy time-and memory-consuming part, performing the actual contraction generally requires support from sophisticated supercomputers.A new Sunway supercomputer is selected for this work.
The new-generation Sunway supercomputer has a similar architecture as Sunway TaihuLight [20], with the major computing capability provided by core groups (CGs).A heterogeneous many-core processor, SW26010pro, is designed for this supercomputer.Fig. 3 shows the structure of the chip.Each processor chip has 6 CGs.The computing processing elements (CPEs) of each CG are arranged as an 8 by 8 grid.
Each CG contains a 16GB main memory, and each CPE contains 256KB local data memory (LDM).Direct memory access (DMA) with a bandwidth of 51.2 GB/s is provided between LDM and main memory.Due to the enormous arithmetic intensity, the memory access bottleneck often turns to the vital problem for optimization.Remote memory access (RMA) with a peak bandwidth of more than 800 GB/s is designed for data exchange between CPEs within one CG.

Definition
Here is a simple example showing where the slice overhead comes from.Fig. 4 shows a contraction process on a 4 × 2 tensor network, and the target rank is limited to 3. The ideal case is that the time complexity of all the contractions in one subtask becomes just 2 − after slicing  edges, and the original complexity is kept.However, from the right part, the first contraction and the last 2 contractions in a subtask have not changed after sliced edge , then they will be calculated twice.These redundant calculations make up the overhead.Actually, the sliced edge  determines which contraction will be redundantly performed.Considering the contractions before slicing, edge  involves four contractions: and these contractions are just inside the dotted box in Fig4, which will not be redundantly computed.
Based on the analysis above, this work proposed a new concept, lifetime, to describe the scope of influence of a sliced dimension during the slicing.The notation comes from section 2.1.1.Definition 1.Given a tensor network  = ( , ) and a contraction tree  = (  ,   ), the concept lifetime of a edge index  ∈  refers to a set of tensors With the definition of lifetime, the scope of influence of a sliced edge is clearly represented.We can conclude that, after slicing an edge  from a tensor network, the size of tensors on the lifetime of  will be halved while the size of the others will not change.The time complexity of the contractions corresponding to these tensors will not be changed, while the time complexity of the other contractions will be doubled.
Not merely on the whole contraction tree, lifetime can be conveniently defined recursively on each sub-tree or intermediately generated tree during the contraction process by simply replacing the contraction tree in the definition.The advantage here is that we can focus more on the region with intensive computation.

Lifetime and Slicing Overhead
Dozens of edges need to be sliced for acceptable size in large tensor networks from big circuits like Sycamore [1] , which makes multi-edge slicing a critical problem for complexity analysis.Considering that lifetimes of different edges are independent, we can analyze the lifetime of each sliced edge sequentially.Fig. 5 provides an example of the slicing of two edges with index  and index .From left to right, the contraction tree is divided into 5 parts, lifetime of index  goes through part 2 and 3, while  goes through part 3 and 4. Index  and  double the time complexity of part 1, 4, 5 and part 1, 2, 5 respectively, and make the whole multiple of redundant calculation shown in Fig. 5 as a piecewise production.Expanding to a more general circumstance, if a tensor contains  sliced indices with  in total, it will be divided into 2  smaller tensors.For all 2  subtasks, there will be 2 − times of redundant data.Then the time complexity will be multiplied by 2 − times.As a result, the density of lifetimes plays a decisive role in memory and computation complexity in a region.
More sliced edges tend to lead to higher overhead.As a simple condition, when a slicing set is given, if we additionally pick more edges to slice, then the overhead will grow unless the lifetimes of the added edges go across the whole contraction tree, which rarely happens.However, for an arbitrary slicing set with  edges and  + 1 edges, can we find a partial ordering relation between their overhead?We can prove the theorem below with a given target size.It provides interpretation for redundant slicing more profoundly and casts the theoretical cornerstone for our slicing strategy.It shows that a valid smaller slicing set indicates a lower slicing overhead in most conditions.Fig. 6 compares the whole time complexity before and after slicing on a common contraction tree of Sycamore circuit [1].From the figure, we can conclude that the key to a low overhead is that the time complexity of the main computation-intensive part is kept, which means big tensors are contained in lifetimes of as many as sliced edges.Ideally, time complexity and the slicing caused by multiple are negatively correlated.

Route of Slicing Optimization
Due to the memory bound, we have to do slicing and stand with some overhead.However, on a multi-level storage system, under some circumstances, the computation overhead can be translated into data movement.
For TNC, we will introduce two main methods to deal with the overhead, i.e., to reduce overhead by searching for better slicing sets or to replace overhead by data movement via stacking, in the following sections, respectively.Stacking is the inverse operation of slicing, and is feasible when the capacity of the low storage or distributed memory is enough for the whole memory requirement.For example, we can store the rank-53 tensor in the hard disk, get a rank-30 slice at a contraction step by IO, and then put it back after calculation.This process performs slicing by getting and stacking by putting naturally.If stacking immediately after the lifetime of a sliced edge ends, the overhead of this edge is naturally eliminated, and the memory demand is resolved.Nevertheless, there will be vast costs of data exchange during data access (use low-level storage) or communication (for distributed memory).Searching for a better slicing set can avoid memory access and communication, but it introduces overhead from the redundant calculation.The choice depends on the specific value of bandwidth and overhead.In case of low bandwidth and low overhead, slicing optimization has better performance, while stacking is more suitable in case of high bandwidth and high overhead.
Fig. 7 shows a typical overhead distribution for different target sizes.Data access costs are translated into equal overhead by arithmetic intensities (AI) of different levels.The line of equal overhead and the storage capacities divide the graph into several parts.The commonest manually controllable storage levels include hard disk, main memory and LDM (for Sunway architecture).For memory access,  ℎ  ≪  ℎ  ≪  ℎ  .Then, from hard disk to LDM, stacking becomes more potential, and in contrast, slicing optimization should be applied.
These methods are not customized for a certain architecture.Actually, all we need is a multi-level storage system.

Overview of Slicing Optimization
The Low bandwidth of IO restricts stacking, and the slicing overhead directly influences the parallel scalability.As shown in Theorem. 1, optimization of slicing can be divided into two steps: searching for a smaller slicing set, and tuning the selected slicing set to achieve a lower overhead.The following two subsections discuss these two steps separately.Minimizing the overhead while fitting memory bound describes an optimization problem with constraints.For a contraction tree  = (  ,   ), the target function can be described by the total time complexity after slicing.
where  is the set of sliced edges, and   is the set of edges which involve the contraction of  in the TN.The target size which does not exceed the memory capacity is denoted as .For every  ∈   , | ∩   | ≥ |  | − .The target function and the formula together can determine the optimization problem.

Slicing Strategy
Here we define lifetime on the stem [16], while the definition of stem is somehow different.In this work, stem is defined as the most computationally intensive path on the contraction tree.According to graph theory, on a path, each contraction step has data dependence with its next step.Since most branches have nothing to do with the memory constraints, they are pre-contracted, and only stems remain for optimization.After precondition, the stem becomes a new contraction tree, and we can detect lifetime of every edge on it.
Assuming the optimal size of the slicing set is , the stem is then separated into two continuous parts:  =  1 +  2 , and let the optimal size of  1 be  1 .For  1 , a number of candidate sets share the same size  1 .According to the discussion above, the more tensors lifetimes of  1 pass through  2 , the fewer edges shall be sliced in the updated  2 .
Strictly speaking, longer lifetime does not necessarily lead to lower slicing overhead.However, the containment relationship of lifetime does.So is the effect of reducing memory.In particular, for the leaf nodes of the Though the above strategy does not always provide the slicing set with the lowest overhead, sometimes even worse than the one found by contengra, we have successfully satisfied all the prerequisites of the theorem above.The method above can find a slicing set as small as possible for a given contraction tree.Combined with the dynamic process for path searching, the overhead may drop lower.Then, the refiner will find a better slicing set to reduce the overhead.

Slice Refiner based on Simulated Annealing
Our slice finder can not guarantee to find an optimal slicing set, and it looks for one as small as possible.So, the main task of this section is to find the best one given a certain number of sliced edges.Considering that the size of the slicing set is fixed, the permutation from the former to the latter can be described as edge replacement.
After slicing, there are some tensors whose dimensions are exactly equal to the target dimension, and they are named "critical tensors".If the lifetime of a sliced edge contains none of the critical tensors, this edge does not contribute to memory reduction, and can be removed from the slicing set.Moreover, from another angle of view, to replace a sliced edge with index , a direct way is to find another unsliced index  whose lifetime contains all critical tensors in the lifetime of .
Simply replacing indices greedily sometimes leads to a local minimum.For instance, the optimal set contains index  and index  while the one we found contains  and , but replacing  from  may increase the overhead, or fail to match the memory bound.Exhausted searching or dynamic programming will lead to the optimal solution, but suffers from a combination explosion of search space.Then, stochastic algorithm becomes a choice to balance the time and quality of searching.Simulated annealing endures a temporary complexity increase to avoid local minimum, and provides a practical speed, as the Algorithm.2 shows.

Difficulty for Thread Level Optimization
For thread level, if organizing the whole path on the CPEs, considering the 256-KB LDM, there will be a huge overhead.As a result, previous works optimize on thread-level step by step [17] [6] to avoid it, turning to memory access instead of vast redundant calculation.However, frequent memory access also limits the FLOPs efficiency a lot.
Contraction between tensors is implemented by matrix multiplication [17].Based on this strategy, more than 70% of the peak performance can be obtained for squarelike matrices.However, for narrow matrix multiplication, in particular, when two of the , ,  are less than 16, which dominated many simulation tasks [1], we can easily deduce Θ( ) ≈ Θ( +   + ), so that GEMM will change from a computational intensity problem to a bandwidth-constrained problem.Another constraint comes from the 256-KB LDM on each CPE, which can only hold a rank-13 tensor for multiplication.Then, calculating a contraction by a rank-31 tensor at least needs 2 12 times low-efficiency GEMM for a CG.
To take full advantage of the 512-bits single instruction multi-data (SIMD), a 4 × 4 complex GEMM kernel is applied.When ,  are both less than 4 in one thread, the demand of data length will cause a conflict between the high-performance kernel and the 2D data distribution.In contrast, 1D distribution avoids most padding, and slices the GEMM process into simply paralleled subtasks.Most importantly, the implicit slicing embedded in data distribution, as a method to overcome the memory limitation of LDM, has a similar problem with slicing on the process level.
We propose a more efficient slice-stack strategy guided by lifetime, which can heavily reduce the cost of data movement.Specific for thread level, it is named as secondary slicing (first slicing happens on the process level).Secondary slicing between LDM and the main memory will provide a fused operator to organize parts of the contraction path on CPEs, and greatly save the memory access cost.

Slice-stack design
If we do slicing on the whole contraction path, a subtask can be performed on a process without intermediate communication and IO while standing with computation overhead.In contrast, if we accept the communication or memory access, the computation overhead can be avoided.
Such a strategy does not work on the process level due to the low bandwidth of IO and low computation overhead, but is available between LDM and the main memory.Since LDM on the new Sunway supercomputer is only feasible for a rank-13 tensor, if the contraction is performed step by step, we should frequently do DMAget and DMA-put between every two steps.Considering the arithmetic intensity above, memory access will be dominant.To reduce the frequent data exchange between LDMs and the main memory, performing  steps of the contraction path successively in one computation kernel helps.At the beginning of those steps, we do DMA-get once to get the bigger tensor, and after n steps of calculations, the result will be written back by DMA-put.Then,  − 1 times of DMA-get and DMA-put are reduced.
As the top part of Fig. 8 shows, CPEs can obtain a lower rank tensor from the one in the main memory at each time, throwing out some dimensions.In a particular contraction, commonly, there are parts of dimensions absorbed, and the others held.According to this observation, at the beginning of the computation, if we choose all dimensions which will be absorbed in the following  steps to form a tensor, and get it into LDMs by DMA with stride, then each CPE can do the calculation independently, like the bottom part of Fig. 8 shows.As for the memory bound, we tune the parameter  to determine the rank of the formed tensor.
In this strategy, the chosen indices remain, and the other indices are sliced.Different CPEs do different subtasks generated by slicing, which are embarrassingly parallel.The main question is how to choose sliced indices and find a proper .The fundamental prerequisite for the slicing set is that the sliced indices will not be contracted in  steps, and fortunately, such a condition is just the definition of lifetime.So with a particular start position on the path, we slice the indices which have the longest lifetime and traverse the stem until the lifetime of any one of the sliced indices ends.If the size of the slicing set added by 13 (the capacity of LDM) is less than each tensor on the sub-path, the fused calculation is finished.With this strategy, there's no slicing overhead, and the last DMA-put plays the role of stacking with no extra communication.

Permutation Map Reduction by Recursion
Formula.In situ computing map and the pre-calculated map are two primary methods for the tensor permutation.The in situ computing map needs an  ( ) time complexity for a size- tensor with rank  ( ), and an  (1) space complexity.In contrast, the latter provides  ( ) time complexity for the first computation and  ( ) for the next ones, and space complexity of  ( ).In our fused algorithm, permutations before every step of contractions become one of the hot spots.Even though using short type,  pre-calculated maps are too big to store.However, for the strategy based on in situ map, the tensors' ranks are about 10, which leads to more than 10 times the cost.
A feasible solution is to combine the advantage of both strategies.For a specific contraction,     = , in which  denotes permutation, a typical indices order of a rank-9  after permutation is like 0, 1, 2, 4, 5, 7, 8, 3, 6, the indices need to be absorbed will be organized at the end, i.e., 3, 6.Contrary to , such indices are placed at the beginning, like 3, 8, 0, 1, 2, 4, 5, 6, 7.For , obviously, the first 3 dimensions will not participate in the permutation, so only a 1/8 map is enough.And for , the last 4 dimensions are the same during the permutation, so the   9.An example of secondary slicing.There are 5 edges whose lifetime go across the whole.After slicing these edges, the maximum memory demand is reduced from 2 8 to 2 3 , and 2 5 subtasks are generated for thread-level parallel.Only the computation inside the dotted box will be performed in each core, and memory access happens at the beginning and the end of the computation.adjacency of the 4 elements shaped by 4, 5, 6, 7 will be held, and the size of the map is reduced to 1/16.With an offset, map of the corresponding can be calculated by  [ + ] =  [] +  *     when  <  in  (1), the same as pre-calculated map.The space complexity will be reduced to  ( /2  ), where m is the number of continuous indices at the end or beginning.

Eliminate Discrete Memory
Access by Communication.Though the fused strategy heavily reduces DMA, the efficiency of DMA becomes an obstacle.Since we have sliced some edges from the original tensors, the sub-tensors we need are often discretely distributed in the main memory.If we want to get a sub-tensor without the last index (assume that it has been sliced), there will be a space interval between every two elements we need.As a result, the granularity of DMA will be extremely small, which harms the bandwidth.In practice, the sliced edges are scattered in different positions.Under this circumstance, the bandwidth of DMA can only achieve less than 0.1% of the peak performance, and makes negative optimization.
To improve the bandwidth, the cooperation of 64 CPEs becomes essential.Under our parallel framework, the last 6 sliced indices are organized between 64 CPEs.Data exchange is much faster between CPEs by Remote memory access (RMA), whose peak performance can achieve 800GB/s for a whole CG.Then, we treat the sub-tensors of a 64 CPEs as a whole, and let every CPE access the memory continuously.This strategy guarantees a basic granularity of 512B for DMA, which provides more than 50% of the peak performance.After memory access, RMA is applied to rearrange the data between CPEs.An additional permutation is applied to improve

Result
Related efforts (e.g., Cotengra [12], Alibaba [16] and [6]) have all simulated Sycamore RQC [1].To make an appropriate comparison, unless otherwise specified, the contraction trees used in this work come from the tensor network of Sycamore as well.
6.1 Slicing Overhead and Scaling Result Figure 10.Slicing size and overhead compared with cotengra.The red points shows the number of extra slicing edges by cotengra compared with ours; and the green points shows the ratio of overhead.When the difference is 0 and the ratio is 100%, two method perform equally.
According to the slice strategy in section 4, given a contraction path, we can find a slicing set with lower overhead in expectation.Fig. 10 shows the comparison between our work and cotengra.We find 400 contraction paths by contengra, and apply our method and cotengra respectively to search for slicing sets.Compared with cotengra, the slicing sets we found are potentially smaller, and lead to lower overhead.Specific to each path, our strategy performs better on more than 98% of cases.As for several exception cases, an observed phenomenon is that the computationally intensive regions of the paths are dispersed.And also, since SA is a stochastic method, it can not guarantee to find an optimal solution.
The best overhead result we found is less to 1.05, and we comprehensively choose a path with low complexity to do the thread level optimization.
Due to slicing, processes are highly independent with each other, and will only do allReduce once at the end of program.The strong scaling results and weak scaling results are showed in Fig. 11.    12 shows that our work is able to significantly improve the computing efficiency.Using 1024 nodes, a perfect sample or 1M correlated samples can be generated in 10098.5s.Considering about the scaling result, we project that we can reduce the whole time cost using 107520 nodes (41,932,800 cores) to 96.1s.The sustainable single-precision performance is projected as 308.6Pflops.
According to Fig. 12, the time of memory access is largely reduced by secondary slicing, and the time of permutation and GEMM keep similar.This result verifies our prediction that, secondary slicing can reduce memory access with the price of some python based pre-conditioning, which needs only 1 core and ignorable time.Another conclusion is that, after secondary slicing, the computation kernel is transformed from a memory intensive one to a computation intensive one in some cases.According to Roofline model [18] and the arithmetic intensity of Sunway architecture, the number of floating point operations should exceed 42.3 times of number of bytes of memory access to achieve a peak performance.The average fused steps of our strategy is about 10, and there is  /( +  +  ) ≈  for a small  with a average of 4, then in some cases the arithmetic intensity of TNC can break through 42.3.Fig. 13 shows the Roofline model for a particular case on thread level.Due to permutation, which consists of LDM access, there is still a gap between our performance and the peak performance.However, compared with the original arithmetic intensity of 2.6 (mixed precision) and 1.22 (single precision), the optimization limit is largely improved.

Implication
In this paper, we presented novel lifetime-based methods to reduce the slicing overhead and improve the computing efficiency for parallel optimization at both the process level and thread level.Due to the introduction of lifetime definition, A series of lemmas and theorems based on lifetime intensify our awareness of a tensor network and the contraction trees, then an interpretable method to deal with slicing overhead, and an in-place slicing strategy to find the smallest slicing set, so that we can reduce the slicing overhead.Our in-place strategy can work independently, and can also optimize the slicing sets found by dynamic methods.As a result, it becomes a general post-process when searching for slicing sets.
Furthermore, based on lifetime, we adopted stacking to avoid heavy overhead from the redundant calculation, and generalize it for arbitrary multi-level storage systems.Slicing optimization and stacking make up the two main strategies for TNC.Then, we design a discriminant based on arithmetic intensity to choose the strategy on different hierarchies of memory.For Sunway architecture, we applied it at the thread level.At the necessary price of more affordable memory access by DMA and faster communication among the CGs by RMA.The most inspiring result is that, in some cases, the computation kernels are changed from memoryintensive into computation-intensive, which reshapes our understanding of the problem.
As a widely applied approach, tensor networks can be found in many research fields like statistical physics [21], data science [22], sociology [23], and so on.A series of problems in physics can be reduced to TNC.For large TNC, the extreme time and space complexity become the major difficulty.Translating the huge memory demand into a data stream that can be solved as a pipeline, slicing plays a significant role during these TNC processes.In addition to TNC, if we need to deal with a series of large matrix multiplications which have data dependence, slicing can still work for task distribution.
This work proves that lifetime is essentially promising for the tensor networks of quantum circuits.Moreover, with the simple definition, lifetime can be easily generalized to tensor works with different features and helps analyze both the time-and space-complexities of the contraction.It is foreseeable that lifetime can play important roles in more fields.

Figure 2 .
Figure 2.An intuitive example of slicing.The second half expresses with graph as a parallel.2.1.2Related Work.The demand for slicing originates from the considerable memory cost of high-rank tensors.In some quantum advantage circuits, there are tensors with dimensions of more than 60[12].They can occupy up to 1000 PB, which exceeds most storage systems.Slicing helps reduce the memory demand to the level of TBs or even GBs to make simulation feasible.Cotengra[12] integrates several anytime methods and claims over 10000× speedup compared to the estimation made by Google[1].It introduced a series of effective heuristic algorithms to search for contraction path, such as community[13] and graph partition[14].After the pre-process which is implemented in[15], rank-1 and rank-2 tensors are absorbed, and the tensor network is largely simplified.A greedy-based slicing strategy is built in contengra.It repeatedly chooses a dimension that leads to the most minor overhead to slice, until the memory demand is satisfied.Like most greedy methods, local minimum exists in this slicing strategy.In our work, contengra is applied to find a proper contraction path.Alibaba proposed a simulator based on an observed structure called 'stem'[16].A stem is a computationallyintensive region in the contraction tree, including most of high-rank tensors.The inexpensive parts are called branches.Within a single stem, a bigger tensor sequentially absorbs smaller ones, and about 99% computation cost happens during these contractions.For slicing, with a similar greedy strategy, they perform local tuning of contraction tree between two steps of slicing picking.This dynamic design highly reduced the inherent slicing overhead of a contraction tree.However, if the condition of local tuning is not satisfied, this strategy may not be able to find an optimal slicing set.This work can provide 10 18.8 times complexity, slicing overhead of 4, and a 14.7% FLOPS efficiency.Due to the high efficiency,

Figure 3 .
Figure 3. SW26010pro Processor.The main memories of 6 CGs are united to form a cross dump with 16 × 6 GB in this work, in order to hold large tensors.slicing edges, and the original complexity is kept.However, from the right part, the first contraction and the last 2 contractions in a subtask have not changed after sliced edge , then they will be calculated twice.These redundant calculations make up the overhead.

Figure 4 .
Figure 4.An example to show the origin of slicing overhead.The left part is a tensor network represented by graph, and the arrow gives the contraction order.The right part shows the complexity before and after slicing.After slicing, the total complexity is the summation of all subtasks.

Figure 5 .
Figure5.The superposition rule of overhead.A contraction order is shown from left to right before (the upper half) and after (the lower half) slicing.Each branch denotes a tensor, and the contractions happen at the intersection points.Lines with different colors in the tensors denote the edges, and the collections of all the lines with the same color are naturally lifetime of these edges.Theorem 1.For a tensor network  ( , ), a contraction tree (  ,   ) and an -edge slicing set  1 , if an ( − 1)edge slicing set  2 is found, and the intersection of  1 and  2 is not empty, then there must exist an ( − 1)-edge slicing set  3 , whose overhead is less than  1 .

Figure 6 .
Figure 6.Time complexity and multiple by slicing on stem (Sycamore( = 20) is used as an example).

Figure 7 .
Figure 7. Overhead distribution for different storage level by cotengra.(Sycamore( = 20) as an example with original memory cost dozens of PBs.For Sunway architecture, there is a 96GB main memory and a 256KB LDM for each CPE)Minimizing the overhead while fitting memory bound describes an optimization problem with constraints.For a contraction tree  = (  ,   ), the target function can be described by the total time complexity after slicing.

Figure 8 .
Figure 8.A fused design for thread-level optimization.The top part shows the step-by-step strategy in previous works, and the bottom is the fused design.Our strategy organizes a sub-path on the LDM, and reduces  − 1 times of DMA-get and DMA-put for a length- sub-path.The superscript  denotes permutation, and the subscript of tensors denotes the sliced dimensions.

Figure
Figure 9.An example of secondary slicing.There are 5 edges whose lifetime go across the whole.After slicing these edges, the maximum memory demand is reduced from 2 8 to 2 3 , and 2 5 subtasks are generated for thread-level parallel.Only the computation inside the dotted box will be performed in each core, and memory access happens at the beginning and the end of the computation.adjacency of the 4 elements shaped by 4, 5, 6, 7 will be held, and the size of the map is reduced to 1/16.With an offset, map of the corresponding can be calculated by

PPoPP ' 23 ,
February 25-March 1, 2023, Montreal, QC, Canada Chen, et al.the granularity of RMA and ensure efficiency.For other architecture, we can implement this strategy by simply partitioning the communication groups.

Figure 12 .
Figure 12.Optimization by secondary slicing at the thread level.Optimization is done on a single node with 390 cores, and the performance is tested for tasks with different size on a contraction path.Use the step-by-step strategy for comparison.

Fig.
Fig.12shows that our work is able to significantly improve the computing efficiency.Using 1024 nodes, a perfect sample or 1M correlated samples can be generated in 10098.5s.Considering about the scaling result, we project that we can reduce the whole time cost using 107520 nodes (41,932,800 cores) to 96.1s.The sustainable single-precision performance is projected as 308.6Pflops.According to Fig.12, the time of memory access is largely reduced by secondary slicing, and the time of permutation and GEMM keep similar.This result verifies our prediction that, secondary slicing can reduce memory access with the price of some python based

Figure 13 .
Figure 13.Roofline Model of our work.For different cases, the arithmetic intensities are different from 10 to 40.In some cases, the problems turn into computation-bound.
Lifetime-based Optimization for Simulating Quantum Circuits on a New Sunway Supercomputer PPoPP '23, February 25-March 1, 2023, Montreal, QC, Canada contraction tree, there is only one extension direction, then its length decides the containment relationship.So we start the search from the two ends of the stem, and choose one tensor as  1 each time and the left part as  2 .Therefore, an edge with longer lifetime shall be better.If we perform this process iteratively, a smaller slicing set is expected.We propose Algorithm. 1 for details.