Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters

This paper presents a unified communication optimization frame-work for sparse triangular solve (SpTRSV) algorithms on CPU and GPU clusters. The framework builds upon a 3D communication-avoiding (CA) layout of $P_{x}\times P_{y}\times P_{z}$ processes that divides a sparse matrix into $P_{z}$ submatrices, each handled by a $P_{x}\times P_{y}2\mathrm{D}$ grid with block-cyclic distribution. We propose three communication optimization strategies: First, a new 3D SpTRSV algorithm is developed, which trades the inter-grid communication and synchronization with replicated computation. This design requires only one inter-grid synchronization, and the inter-grid communication is efficiently implemented with sparse allreduce operations. Second, broadcast and reduction communication trees are used to reduce message latency of the intra-grid 2D communication on CPU clus-ters. Finally, we leverage GPU-initiated one-sided communication to implement the communication trees on GPU clusters. With these nested inter- and intra-grid communication optimization strategies, the proposed 3D SpTRSV algorithm can attain up to 3.45x speedups compared to the baseline 3D SpTRSV algorithm using up to 2048 Cori Haswell CPU cores. In addition, the proposed GPU 3D Sp-TRSV algorithm can achieve up to 6.5x speedups compared to the proposed CPU 3D SpTRSV algorithm with $P_{z}$ up to 64. Finally it is remarkable that the proposed GPU 3D SpTRSV can scale to 256 GPUs using the Perlmutter system while the existing 2D SpTRSV algorithm can only scale up to 4 GPUs.


INTRODUCTION
Sparse triangular solves (SpTRSV) are important computational kernels in many direct sparse linear solvers and preconditioners for a wide range of scientific and engineering applications.Taking LU factorization of a sparse matrix  =  as an example, the solution vector  from  =  given a right-hand side (RHS) vector  can be computed following one lower-triangular and upper-triangular Sp-TRSV operations.Although SpTRSV typically has many fewer arithmetic operations compared to LU factorization, it can become a computational bottleneck for linear systems with many RHSs or preconditioned iterative solvers requiring repeated application of SpTRSV.The low arithmetic intensity and sequential nature of SpTRSV post significant challenges for their efficient implementation on modern shared-memory and distributed-memory computing architectures.Shared-memory implementations [2,14,28,30,31,42,45,46] such as multi-core CPU and GPU implementations rely on level-set, color-set or blocking methods to exploit available parallelism from the directed acyclic graph (DAG) representation of the sparse LU factors.
For realistic and large-scale multi-physics and multi-scale simulations, shared-memory SpTRSV implementation quickly becomes incapable of handling large linear systems and one needs to turn to distributed-memory SpTRSV and LU factorization algorithms [3, 7-10, 16, 18-25, 25-27, 35, 36].However, distributed-memory parallel SpTRSV algorithms are even more challenging as communication quickly becomes dominant as the number of processors increases.Existing works include supernodal representation with a 2D process layout [12,13,22,29], non-blocked representation with a 1D process layout [41], multifrontal representation with sparse RHSs [34], and selective inversion-based algorithms for dense systems [32,43].Among them, communication optimization techniques such as customized communication trees [29], one-sided MPI communication [13] and GPU-initiated communication [12] have been exploited and performance prediction studies such as critical path analysis [12,13], roofline modeling [44] and machine learningbased performance tuning [1,15] have been considered.Despite these advances in distributed-memory SpTRSV algorithms, their strong scalability remains very limited for many classes of triangular matrices.More specifically, parallel 2D CPU SpTRSV can exhibit flattened strong scaling for large numbers of MPIs [13,29], and parallel 2D GPU SpTRSV can only scale up to the number of GPUs in one node (typically less than 10) [12].
In addition to the above-mentioned communication optimization strategies, communication-avoiding (CA) algorithms have drawn many interests in numerical linear algebra research, including sparse LU factorization [37,38], sparse GEMM [6], QR factorization [5], and Krylov methods [33], etc.Many of these algorithms rely on a 3D process layout of   ×   ×   processes arranged as layers of 2D process grids that replicate the memory and computation loads across the grids.These algorithms oftentimes yield asymptotic reduction in communication at the cost of manageable memory overheads.Recently, a 3D CA SpTRSV algorithm [39] has also been developed by following the layout of the 3D CA sparse LU factorization algorithm [37].The 3D SpTRSV algorithm divides the triangular matrix into multiple levels with a binary elimination tree, where each node corresponds to several submatrices that reside on one 2D grid.The algorithm proceeds with a bottom-up tree traversal where SpTRSV of each node requires only communication inside one 2D grid (intra-grid communication), and the solution subvectors need to be reduced across the 2D grids (inter-grid communication) before moving to the next level.This algorithm can effectively reduce the communication volume by a factor of √   for matrices arising from many 2D and 3D PDEs [39].
Despite its preliminary success, neither the intra-grid communication pattern nor inter-grid communication pattern of the 3D SpTRSV (henceforth refereed to as the baseline algorithm) [39] is optimal: the algorithm requires  (log   ) synchronizations between the 2D grids which significantly increase runtime when the intragrid computation and computation are load-imbalanced.Moreover, these synchronizations make it seemingly impossible to efficiently integrate other communication reduction techniques such as customized communication trees [29] and GPU-initiated one-sided communication [12].As a result, the baseline algorithm can yield even worse performance than the 2D SpTRSV algorithms [29].In this paper, we propose a unified 3D SpTRSV algorithm that can efficiently leverage multiple communication optimization strategies on both CPU and GPU clusters.The contributions of this paper include: (1) Develop a new 3D SpTRSV algorithm that requires only one inter-grid synchronization in between the L and U solves.(2) Develop an efficient sparse allreduce communication scheme that dramatically reduces the inter-grid communication cost.(3) Integrate the communication tree-based optimization [29] to reduce message latency for the intra-grid communication on CPU clusters, which yields significantly faster SpTRSV time compared to the baseline algorithm.(4) Integrate NVSHMEM-based one-sided communication for intra-grid communication and GPU acceleration for the computation workloads using GPU clusters, which significantly improves the strong scalability of multi-GPU SpTRSV algorithms.
The rest of this paper is organized as follows: Section 2 briefly overviews the SpTRSV with a supernodal representation and the baseline 3D SpTRSV algorithm.Section 3 describes the proposed synchronization-reduced 3D SpTRSV algorithm (Subsection 3.1) and sparse inter-grid communication (Subsection 3.2), as well as the integration of communication tree-based intra-grid communication optimization for both CPUs (Subsection 3.3) and GPUs (Subsection 3.4).Section 4 presents several numerical experiments to demonstrate the improved scalability of the proposed 3D SpTRSV algorithm on Cori Haswell CPU nodes, Perlmutter GPU nodes, and Crusher GPU nodes.

BACKGROUND 2.1 Overview of SpTRSV With Supernodal Representation
Consider a LU factorized sparse  ×  matrix  =  , the solution vector (or matrix)  subject to  =  with a dense right-hand side (RHS) vector (or matrix)  can be computed by a lower-triangular SpTRSV (L-solve)  =  followed by a upper-triangular SpTRSV (U-solve)   = .Furthermore, we assume a supernodal representation of  and  with  supernodes from supernodal sparse direct solvers [7].A supernode consists of a set of contiguous columns and rows whose nonzero patterns are similar across the rows and columns, respectively.Let  (),  () and  () denote the subvectors corresponding to supernode , and (, ) and  (, ) denote the nonzero submatrix corresponding to supdernodes  and .Each (, ),  <  consists of a set of full rows, each  (, ),  >  consists of a set of columns, and (, ) and  (, ) are dense. (, ) typically follows the "skyline" format assuming each nonzero column has different length, but in this work we assume all nonzero columns in each  (, ) have the same length.Compared with column/row-based representation, these supernodal column/row-based representations yield higher floating point (FP) operation performance.The subvectors  () and  () can be computed as Throughout this paper, we assume that diagonal blocks (, ) −1 and  (, ) distributed-memory machines.One can expect that the communication will dominate the overall SpTRSV time as the number of (distributed-memory) processes increases due to the low arithmetic intensity and sequential nature of DAG.Communication reduction techniques become crucial for scaling up the SpTRSV to large numbers of CPU and GPU processors [12,13,29,39].Next, we briefly review the CA 3D SpTRSV algorithm [39] as the baseline algorithm for the proposed unified communication optimization strategies.

Baseline CA 3D SpTRSV Algorithm
The baseline algorithm [39] trades off some communication witth memory replication similar to several other CA algorithms [5,6,37].
As opposed to traditional distributed-memory SpTRSV algorithms [22,24] that use a 2D block-cyclic process layout with  =   ×   processes, the CA 3D SpTRSV algorithm [39] uses a 3D process layout of  =   ×   ×   processes consisting of   2D process grids.The sparse matrices  and  are distributed onto the   grids as follows.
First we assume that during the numerical factorization of , an ordering of the matrix has been applied to reduce the number of fill-ins in  and  , such as minimum degree ordering or nested-dissection (ND) ordering.The ordering generates a multilevel dependency tree, known as the elimination tree.Each node of the tree represents a group of supernodes and the SpTRSVs of the nodes at the same level are independent of each other.Therefore the elimination tree can be used to exploit parallelism in SpTRSV.In this paper, we uses ND ordering computed by METIS [17], assuming that the number of 2D grids   is power-of-two and the top log(  ) levels of the elimination tree are a binary subtree.Following the elimination tree, the 3D SpTRSV algorithm finds the leaf level where the number of nodes is equal to   .Each leaf node  and all its ancestors  are assigned to one 2D grid.In other words, the ancestor nodes are replicated across multiple 2D grids.Fig. 1(a) shows the distribution of the LU factors with   = 4 2D grids according to the first three levels of the elimination tree.The root node 0 is replicated across all 2D grids, node 1 is replicated across Grid-0 and Grid-1, node 2 is replicated across Grid-2 and Grid-3, and each of the nodes 3,4,5,6 belongs to one 2D grid.Here the submatrix of a node consists of the diagonal block corresponding to all supernodes of the node, as well as the off-diagonal blocks below and to the right of the diagonal block.This submatrix of each grid is distributed with a 2D block-cyclic layout using   ×   processes.Fig. 1(b)(c) shows the 2D layout with   = 2 and   = 3 for the submatrix in Grid-3 of Fig. 1(a).Here we have assumed that the matrix  has symmetric nonzero patterns for simplicity.
Following this 3D layout, the CA SpTRSV algorithm [39] proceeds as follow. 1 First, each leaf node  performs 2D SpTRSV independently using the diagonal block.Once the solution subvectors corresponding to the node has been obtained, they are used to perform GEMV/GEMM with the off-diagonal blocks to generate partial summation results corresponding to RHS of Eq (1).This step only involves intra-grid communication.Next, an inter-grid communication between a pair of grids is performed to reduce the partial summation results to the grid whose grid number  is the smallest among all grids replicating the parent  of .All other grids among those grids remain idle ever since.Next, the active grid  performs 2D SpTRSV for the node , and GEMV/GEMM operations with its off-diagonal blocks, which then requires inter-grid communication before moving to the next level.Using the example of Fig. 1(a), Grid-1 and Grid-3 are active at level 2 (i.e., the leaf level), Grid-2 are active at levels 2 and 1, and only Grid-0 is active at all levels.Fig. 1(b) shows the computation and communication workloads at Grid-3.
It has been estimated and validated that when compared with 2D SpTRSV using the same number of processes , the 3D SpTRSV algorithm can effectively reduce the communication volume from ) for many 2D and 3D PDEs [39].That said, this algorithm has two main drawbacks due to the alternating intra-grid and inter-grid communication required at each level.
• Synchronization across the 2D grids is required at each level of the elimination tree for a total of log   times, which can significantly increase the overall runtime due to possible load imbalance among diagonal and off-diagonal blocks of different grids.• The separation of diagonal block solve and off-diagonal block GEMV/GEMM for each node of the tree makes it cumbersome for further communication optimization in each grid  such as latency reduction with binary communication trees [29] and one-sided CPU/GPU communication [11,12].
These difficulties have to be addressed to further improve the parallel scalability of SpTRSV with large numbers of CPU and GPU nodes.

UNIFIED COMMUNICATION OPTIMIZATION STRATEGIES FOR 3D SPTRSV ALGORITHM
In this paper, we propose a unified communication optimization framework based on a novel 3D SpTRSV algorithm.The proposed 3D algorithm requires only one inter-grid synchronization at the cost of replicated computation as opposed to the baseline 3D algorithm requiring  (log   ) inter-grid synchronizations.As a result, the new 3D SpTRSV algorithm can integrate multiple communication optimization strategies such as sparse inter-grid communication, latency reduction [29] and one-sided GPU communication [12].

New 3D SpTRSV With Synchronization Reduction
The proposed 3D SpTRSV algorithm can be described using the example in Fig. 1(a).Unlike the baseline algorithm where each grid is active for a few bottom levels, the proposed algorithm ensures that all grids are active at all levels for allowing them to perform replicated computation (see Fig. 1(c) for Grid-3).Specifically, the baseline algorithm performs operations for each shaded block (see Fig. 1(b) for Grid-3) separately and proceeds level by level with intergrid communication in between.In contrast, the proposed algorithm treats the blocks of one grid (including the replicated blocks) as one 2D distributed matrix, denoted as   and   for Grid-.These 2D solves require only one inter-grid communication/synchronization in between at the cost of replicated computation.To ensure correct solutions, judiciously selected subvectors of the RHS   for each 2D L-solve need to be set to 0, and a sparse allreduce operation is needed for the solution vectors   before the 2D U-solve.This procedure is illustrated in Fig. 2 using the same matrix as Fig. 1(a) except that the blocks associated with one node of the elimination tree is marked with the same color.The subvectors of the RHS corresponding to one node in Grid- are set to 0 if  is not the smallest grid number sharing that node (see the RHS for each grid in Fig. 2).The 2D L-solve does not involve any intergrid communication, which leads to partial solution vectors    for non-leaf nodes .Therefore an allreduce operation is performed across the grids to obtain the complete solution vectors, which then become the RHSs for the 2D U-solve.Again, the 2D U-solve does not involve any inter-grid communication.This algorithm is summarized as Algorithm 1. Line 4 forms the proper RHS   for   , Lines 11 and 21 perform the 2D L-solve and U-solve whose communication costs can be further optimized based on CPU (see Section 3.3) or GPU (see Section 3.4) implementations, and Line 20 performs the inter-grid communication (see Section 3.2).
Remark.Although the proposed 3D SpTRSV algorithm introduces extra FP operations and communication due to replicated computation, the FP operations are performed in parallel among all the 2D grids, whereas many 2D grids stay idle at higher tree levels in the baseline 3D SpTRSV algorithm.Therefore, the FP operations do not require extra overall runtime for the proposed algorithm.That said, the extra communication introduced by the proposed 3D SpTRSV algorithm can have an impact on the total communication time in each 2D grid.However, they can be highly optimized using the communication schemes in Subsection 3.2.

Inter-Grid Communication Optimization
The inter-grid communication and synchronization are needed in between the 2D L-solve and U-solve of Algorithm 1 to reduce the partial solution vectors    for all grids  sharing the non-leaf node .Straightforward implementations using MPI_allreduce for each node  can become costly both in terms of latency and synchronization.Instead, we propose a sparse allreduce algorithm which requires only  (log   ) pair-wise sends and receives.The algorithm is illustrated with Fig. 3 using the example of Fig. 2. The algorithm is composed of a sparse reduce phase and a sparse broadcast step.The reduce phase (Fig. 3(a)) proceeds from leaf to root: Step 1 ○ performs pairwise send/receive between two grids for [ 0 ,  1 ] and [ 0 ,  2 ], Step 2 ○ performs pairwise send/receive between Grid-0 and Grid-2 for  0 .Similarly, the broadcast phase (Fig. 3(b)) proceeds from root to leaf: Step 3 ○ performs pairwise send/receive between Grid-0 and Grid-2 for  0 , Step 4 ○ performs pairwise send/receive between two grids for [ 0 ,  1 ] and [ 0 ,  2 ].Note that each   consists of many supernodes which are distributed according to the 2D blockcyclic layout of   .That said, each process only communicates for  (log   ) times when the communication buffer packs data from all the required supernodes.This sparse allreduce algorithm is summarized as Algorithm 2.

Intra-Grid Communication Optimization on CPU
For each 2D L-solve and U-solve, the intra-grid communication can be further improved by reducing message latency using customized communication trees [29].In Subsection 3.3 and 3.4, we only describe the communication optimization strategies for the L-solve as the U-solve algorithm is very similar.Consider the   matrix in Fig. 1(c) of Grid-3 with a 2 × 3 process layout.For each supernode column  , the process computing  ( ) needs to send it to the process in charge of (,  ) to perform the partial sum (see Eq (1)), for all  >  residing on a different process, which Algorithm 1 Proposed 3D SpTRSV on a 3D grid of size   ×  ×  .
1: procedure solve_3D(, ,,) 2: Suppose I am on grid- with leaf node , form   and   with a   ×   layout (see Fig. 1(c  end if 30: end procedure is a broadcast operation for each column  .Similarly, for each supernode row , the process computing () = (,  ) •  ( ) for all (,  ) it owns needs to send  to the process storing (, ) −1 (see Eq (1)).This induces a reduction operation for each row .These broadcast and reduction operations can be improved using customized binary communication trees (one per column and row) to significantly reduce the total message counts [29].For ease of illustration, Algorithm 3 summarizes the communication tree-enhanced 2D L-solve with a   × 1 process layout, where only broadcast operation is needed.Our numerical results in Section 4.1 will demonstrate the functionality and efficiency of using   ×   process layouts with both broadcast and reduction operations.In Algorithm 3, we use an indicator array   to keep track of the number of  ( ) each row  needs to receive.The algorithm is MPI Algorithm 2 SparseAllReduce of the partial solution vectors after 3D L-solve across   grids 1: procedure SPARSE_ALLREDUCE() 2: Suppose I am on grid- with nodes   ,  = 0, . . .,   , complete solution vector   0 , and partial solution vectors    =     ,  > 0. if %2 +1 = 0 then ⊲ pairwise inter-grid communication 6: Send    7: Receive and reduce      message driven with a blocking MPI_Recv at Line 7 for any incoming message until the total number of expected messages have been received by each process.
Remark.The proposed 3D SpTRSV algorithm with synchronization reduction (see Fig. 1(c)) makes it very convenient to integrate the binary communication tree-based optimization as each grid treats   as a single-level 2D block-cyclic distributed matrix.For example, each row and column in Fig. 1(b) require one broadcast and communication tree.In contrast, the baseline 3D SpTRSV algorithm needs to compute the broadcast trees and reduction trees for each row and column, for all diagonal and off-diagonal blocks.For example, each row and column in Fig. 1  end for 10: end procedure reduction trees (i.e., for the orange, blue and pink colored blocks), which unfavorably increases the message counts and greatly complicates the implementation.

Intra-Grid Communication Optimization on GPU
We also implement the proposed 3D SpTRSV algorithm on GPU clusters using synchronization reduction with replicated computation and latency reduction with communication trees.In all our implementations, we assume that one GPU is assigned to one MPI for simplicity.Just like Subsection 3.3, we only describe the Lsolve implementation for simplicity.In our implementation, it is assumed that   ,   and   reside on GPU and the computation (i.e., GEMV/GEMM involving (,  ) and (, ) −1 ) is performed on GPU.While the SparseAllReduce operation in Line 20 of Algorithm 1 is implemented with MPIs, the communication operations in 2D Lsolve and U-solve cannot use MPI efficiently.This is due to the fact that MPI communications are CPU-initiated (even for GPU-aware MPI) and CPU control flow will significantly slow down performance due to the sequential nature of SpTRSV.In other words, both the CPU and GPU need to keep track of the DAG of SpTRSV with frequent CPU-GPU data communication.Instead of MPI, we rely on the GPU-initiated one-sided communication libraries such as NVSHMEM to implement the binary communication trees.The GPU-initiated communication feature allows the code to execute both computation and communication in GPU kernels without CPU interference, hence traverse the DAG much more efficiently.Before describing the NVSHMEM-based 2D L-solve, we first describe the single-GPU accelerated L-solve algorithm.The reason is two-fold: 1) When   =   = 1 and   > 1, the GPU 2D L-solve of   requires no intra-grid communication, hence the L-solve kernel can be greatly simplified.2) Currently, the GPU-initiated communication-based 2D solves only work on NVIDIA GPUs with NVSHMEM.The AMD GPU's counterpart ROC-SHMEM currently does not support MPI subcommunicators, which are used throughout the 3D SpTRSV algorithm.Adding support for MPI subbcommunicators in ROC-SHMEM will enable significantly improved scalability of SpTRSV for large nubmers of GPU nodes.In this paper our numerical results for AMD GPUs will only consider   =   = 1.The single-GPU 2D L-solve algorithm is described in Algorithm 4. The kernel assigns one thread block (with ID ) per supernode column , whose thread 0 spin waits for the dependency indicator   () until  () can be computed.The GEMV/GEMM operations for (, ) (),  >  are parallelized over the threads in each thread block.When the number of RHSs nrhs is 1, the GEMV is parallelized over the row dimension; when nrhs>1, each GEMM (, ) () is implemented as dense blocked operations involving only the nonzero rows of (, ) using the shared memory, similar to MAGMA's GEMM implementation [40].
Based on the single-GPU L-solve implementation, we briefly summarize the NVSHMEM-based 2D L-solve algorithm for NVIDIA GPUs in Algorithm 5. Just like Subsection 3.3, we only describe the case of   × 1 2D grids for simplicity and the complete 2D L-solve algorithm can be found in work [12].NVSHMEM has a major limitation which is that the number of thread blocks that can be concurrently scheduled on a GPU equals the number of SMs.Such a design is to avoid potential deadlocks when using point-topoint synchronization in the CUDA kernel.However, that limitation would significantly restrict SpTRSV concurrency.To address this limitation, two kernels (SOLVE and WAIT) that run on two CUDA streams are utilized.The SOLVE kernel assigns one thread block (with ID ) per supernode column  that my GPU/CPU owns.If the GPU is the diagonal process for handling  and dependency   () has been met, the GEMV/GEMM operation involving (, ) −1 is parallelized over the threads of the thread block (Line 7).Otherwise, if the GPU is an off-diagonal process (Line 11), thread 0 will spin wait for the  _ () indicating whether  () (the buffer _ ()) has been received (Line 12).If received,  () will be forwarded along the binary broadcast tree of  using NVSH-MEM_SEND (Line 13).Then the GEMV/GEMM involving (, ),  >  are parallelized over the threads and   ( ) is updated, just like the single-GPU algorithm of Algorithm 4. Recall that the number of thread blocks of the SOLVE kernel equals the number of supernode columns one GPU/CPU handles.The WAIT kernel has only one thread block, where all threads are waiting for incoming messages using nvshmem_init_wait_until_any (Line 22).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

NUMERICAL RESULTS
In this section, we present several numerical experiments to demonstrate the superior performance of the proposed 3D SpTRSV algorithm compared with the baseline 3D SpTRSV algorithm [39] and the existing 2D SpTRSV algorithms [12,29] using three leadership supercomputing systems: Cori Haswell, Perlmutter and Crusher.
The CPU-only experiments are performed on the Cori Haswell system at NERSC.Cori Haswell is a Cray XC40 system and consists of 2388 dual-socket nodes with Intel Xeon E5-2698v3 processors running 16 cores per socket.The nodes are equipped with 128 GB of DDR4 memory.The nodes are connected through the Cray Aries interconnect.The GPU experiments (including the reference CPU experiments) are performed on the Crusher system at OLCF and the Perlmutter system at NERSC.Crusher is a testbed system for the Frontier exascale machine and each node consists of a 64-core AMD EPYC 7A53 CPU processor and 4 AMD MI250X GPUs (8 Graphics Compute Dies) each with 64 GB of HBM2 memory.The nodes are connected with HPE Slingshot interconnect with a 25 GB/s bandwidth.Perlmutter (the GPU partition) is a HPE Cray EX system and consists of 1536 GPU nodes each with a 64-core AMD EPYC 7763 CPU processor, 4 NVIDIA A100 GPUs, and 40 GB HBM memory per GPU.The nodes are connected with the HPE Slingshot 11 interconnect with a 25 GB/s bandwidth.
All the benchmark matrices are listed in Table 1, which arise from various applications such as optimization problems, structural computation, quantum chemistry calculation, fusion plasma simulation, finite-difference discretization of Poisson equations, and finite element discretization of Maxwell equations.All these matrices except for s1_mat_0_253872 and s2D9pt2048 are publicly available via the SuiteSparse Matrix Collection [4].The LU factors are generated by running the 3D numerical factorization algorithm in SuperLU_DIST [25] with the METIS ordering for fill-in reduction [17].The size and nonzeros of the LU factors are also listed in Table 1.

Performance of 3D SpTRSV on CPU clusters
First, we demonstrate the improved scalability and efficiency of the proposed 3D SpTRSV algorithm using the Cori Haswell CPU nodes.We test the runtime of the baseline and proposed 3D Sp-TRSV algorithms by varying the total MPI count  =   ×   ×   from 128 to 2048 and changing   from 1 to 32.When fixing  and   , the 2D grid sizes are set such that   ≈   .Fig. 4 shows the runtime of the two algorithms for four matrices s2D9pt2048, nlpkkt80, ldoor and dielFilterV3real.It is worth noting that the proposed algorithm with   = 1 reduces to the 2D SpTRSV algorithm [29] with latency optimization, corresponding to the red solid curves.For both the baseline and the proposed algorithms, increasing   until 16 leads to improved runtime.However, the proposed algorithm constantly overperforms the baseline 3D SpTRSV [39] and 2D communication-optimized SpTRSV [29].For the four matrices s2D9pt2048, nlpkkt80, ldoor and dielFilterV3real, the proposed algorithm respectively achieves up to 3.45x, 1.87x, 1.13x and 1.98x speedups compared to the baseline 3D algorithm.Moreover, it respectively achieves up to 2.2x, 1.1x, 2.1x and 1.43x speedups compared to the 2D communication-optimized SpTRSV.It's worth noting that without further communication optimization, the baseline 3D algorithm can be worse than the communication-optimized 2D algorithms.For example, see the solid red curves and dashed green curves for matrices ldoor and nlpkkt80.
To better understand the performance gains, we show the time breakdown (averaged over all MPI ranks) in inter-grid communication, intra-grid communication and FP operations in Fig. 5 and Fig. 6 for matrices s2D9pt2048 and nlpkkt80, respectively.We compare the results of the baseline 3D SpTRSV algorithm [39] and the proposed 3D SpTRSV algorithm.In Fig. 5, one can see that the optimization algorithm in Subsections 3.1 and 3.2 significantly reduce the intra-grid communication time particularly when   is large; the optimization algorithm in Subsection 3.3 significantly reduces the inter-grid communication time, particularly when   ×   is large; the algorithm in Subsection 3 introduces extra FP operations, however they do not contribute to the overall runtime much as the replicated FP operations are performed in parallel in different grids.Unlike the 2D PDE discretized matrix s2D9pt2048 in Fig. 5, the 3D PDE discretized matrix nlpkkt80 (see Fig. 6) introduces asymptotically more replicated computation and intra-grid communication when   is large, as can be seen from the bottom middle figure.The increased intra-grid communication for large   leads to worse 3D SpTRSV performance.
For the s2D9pt2048 matrix, we also measured the load balance when varying   .In Fig. 7, we plotted the time in the L and U solve across all MPI ranks given  = 128 and  = 1024.The load imbalance is indicated by the error bars where both the baseline and proposed 3D SpTRSV exhibit reasonable load balance.
For the nlpkkt80 matrix, we measured the load balance when varying   .In Fig. 8, we plotted the time in the L and U solve across all MPI ranks given  = 128 and  = 1024.When   is large, the baseline code shows large imbalance, while the proposed code shows good balance.Although the proposed code shows increased CPU time averaged over the ranks due to duplicated computation, it still achieves decreased overall CPU time, which is the maximum over the ranks.Figure 4: SpTRSV time using the Cori Haswell machine as the total MPI counts   ×   ×   vary.For each   value (one curve), the 2D grid (  ,   ) is set as square as possible."Baseline" denotes the baseline 3D SpTRSV [39] and "New" denotes the proposed 3D SpTRSV.

Performance of 3D SpTRSV on GPU clusters
This subsection provides several numerical examples to demonstrate the enhanced scalability of the proposed 3D SpTRSV algorithm on both AMD and NVIDIA GPUs, compared with the proposed 3D SpTRSV algorithm on CPUs and existing multi-GPU 2D SpTRSV algorithm [12].As mentioned in Subsection 3.   show scalability results only for NVIDIA GPUs for more general   ×   ×   layouts.In all experiments below, we assume one GPU is assigned to one MPI.For each configuration with GPU solves, we also include the results with CPU solves (using the proposed 3D SpTRSV algorithm instead of the baseline 3D SpTRSV algorithm) obtained from the same system using only CPU cores.
4.2.1 1 × 1 ×   layouts on Crusher and Perlmutter.We first test the performance of the proposed 3D SpTRSV algorithms on CPUs and GPUs using the Crusher system with AMD MI250X GPUs.We tested the runtime of CPU and GPU SpTRSV algorithms using the matrices s1_mat_0_253872, s2D9pt2048 and ldoor by changing   for 1 to 64 (i.e., 8 compute nodes), with both 1 RHS and 50 RHSs.Fig. 9 shows that total runtime, L-solve time, U-solve time and inter-grid communication time for each configuration.For most matrices, the inter-grid communication time is negligible due to the efficient sparse allreduce operation in Subsection 3.2.The CPU 3D SpTRSV algorithms exhibits good scalability until   = 64 while the GPU 3D SpTRSV algorithm exhibits good scalability until around   = 16.The deteriorated scalability for larger   is largely due to the increased FP operations as there is no intra-grid communication.
The CPU-GPU speedups for these three matrices are up to 1.6x, 1.8x, and 1.7x for 1 RHS and 2.9x, 2.2x and 2.2x for 50 RHSs.Moreover, the L-solve shows higher CPU-GPU speedups compared with the U-solve.This is largely due to reversed computation order and less coalesced memory access patterns of the U-solve compared to the L-solve.Next, we perform the similar tests of the proposed 3D SpTRSV algorithms on CPUs and GPUs using the Perlmutter system with NVIDIA A100 GPUs.We tested the runtime of CPU and GPU Sp-TRSV algorithms using the matrices s1_mat_0_253872, s2D9pt2048, nlpkkt80 and dielFilterV3real by changing   for 1 to 64 (i.e., 16 compute nodes), with both 1 RHS and 50 RHSs.Fig. 10 shows that total runtime, L-solve time, U-solve time and inter-grid communication time for each configuration.The observations are similar to those on Crusher: Both CPU and GPU 3D SpTRSV algorithms exhibit scalability until   = 64.For example when   changes from 1 to 16 for matrix dielFilterV3real with 1 RHS, the CPU and GPU 3D SpTRSV algorithms exhibit a speedup of 9.5x and 7x, respectively.Moreover, the CPU-GPU speedups for these four matrices are up to 6.5x, 4.6x, 4.8x and 5x for 1 RHS, and 5.2x, 3.7x, 4.1x and 4x for 50 RHSs.These speedup numbers are much higher than those on Crusher.
4.2.2   × 1 ×   layouts on Perlmutter.Finally, we demonstrate the scalability of the proposed 3D SpTRSV algorithm using the Perlmutter system with NVIDIA A100 GPUs with   >= 1, and   ×   > 1, which requires the use of NVSHMEM-based 2D Land U-solves described in Subsection 3.4.As reported in [12], for NVSHMEM-based 2D solves, best performance can be obtained with   = 1 due to the slower reduction performance compared to the broadcast performance.Therefore in this subsection we set   = 1 and only vary   and   .
We tested four matrices s1_mat_0_253872, nlpkkt80, Ga19As19H42 and dielFilterV3real with 1 RHS by changing   from 1 to 64 and   varying from 1 to 4, as shown in Fig. 11.First, it is worth noting that when   = 1, the 3D GPU SpTRSV algorithm reduces to the NVSHMEM-enhanced 2D GPU SpTRSV algorithm [12].As shown with the solid red curves in most sub-figures of Fig. 11, the 2D SpTRSV stops scaling at  =   = 8 which requires NVSHMEM to send data across multiple nodes (recall each Perlmutter GPU node has 4 GPUs).This is expected as the peak intra-node and inter-node bandwidth per direction per GPU on Perlmutter are 300 GB/s and 12.5 GB/s, respectively.Such a big bandwidth performance difference suggests that 2D GPU SpTRSV may not benefit from using more than one GPU node.Therefore, all other data points in Fig. 11 only vary   from 1 to 4 to ensure all NVSHMEM communications are confined within one GPU node.Next, we observe that as   and   increase, both CPU and GPU 3D SpTRSV algorithms exhibit  good scalability.That said, given a fixed number of  =   × 1 ×   GPUs, larger   can yield better performance than larger   meaning the 3D GPU SpTRSV scales better with increasing   .Finally, we remark that as opposed to 2D GPU SpTRSV [12] that only scales up to 4 GPUs, the proposed 3D GPU SpTRSV can scale to 256 GPUs.To our best knowledge, this represents the most scalable GPU SpTRSV implementation to date.

CONCLUSIONS
This paper presents a novel communication optimization framework for enhancing scalability of the SpTRSV algorithms on CPU and GPU clusters.The framework builds upon a 3D communicationavoiding parallel layout which decomposes the sparse matrix into submatrices mapped to multiple 2D process grid.Each grid handles a judiciously selected submatrix.The proposed framework advances the existing 3D SpTRSV algorithm with four improvements in communication: an inter-grid synchronization reduced process layout, an efficient inter-grid sparse allreduce communication scheme, integration of a latency reduction scheme for intra-grid communication for CPU clusters, and integration of a GPU-initiated one-sided communication scheme for GPU clusters.The resulting 3D SpTRSV algorithm exhibits up to 3.45x speedups compared to the baseline 3D SpTRSV algorithm using up to 2048 Cori Haswell CPU cores, and scales up to 256 GPUs compared to existing 2D multi-GPU  SpTRSV algorithms that can only scale to 4 GPUs using Perlmutter GPU nodes.
Although the unified communication optimization framework is only applied to the SpTRSV algorithm in this paper, the general take-away message is that the proposed domain-decompositiontype hierarchical communication optimization framework applies to many 3D communication avoiding algorithms and their GPU implementations.For each 2D subdomain or subproblem, GPU can boost fine-grained and localized computation, but lacks MPI scalability due to excessive communication.GPU-initiated one-sided communication and CPU-based communication optimization can improve their scalability to some extent.On the other hand, adding the third dimension in the parallel layout can increase coarse-level parallelism at the expense of limited replicated memory and computation.Moreover, it can be very beneficial to avoid intertwining Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.intra-grid and inter-grid communication such that the subdomain can be handled independently as mush as possible.

Machines:
The CPU-only experiments are performed on the Cori Haswell system at NERSC (Fig. 4-8).The GPU experiments (including the reference CPU experiments) are performed on the Crusher system at OLCF (Fig. 9) and the Perlmutter system at NERSC (Fig. 10-11).

Figure 1 :
Figure 1: Parallel data layout of 3D SpTRSV with   = 2,   = 3 and   = 4, assuming a LU-factorized symmetric sparse matrix.(a) Mapping of the matrix onto the 4 2D grids.(b)2D process layout for computation tasks in Grid-3 of the baseline algorithm[39].(c) 2D process layout for computation tasks in Grid-3 of the proposed algorithm.

Figure 2 :
Figure 2: Workflow of the proposed 3D SpTRSV algorithm.2D L-solve and U-solve only involve intra-grid communication, and SparseAllReduce only involves inter-grid communication.

Figure 3 :
Figure 3: The SparseAllReduce phase of Fig. 2 with 4 2D grids, consisting of (a) a sparse reduce step and (b) a sparse broadcast step.

Figure 5 :Figure 6 :
Figure 5: Time breakdown in seconds (averaged over all MPI ranks) of the s2D9pt2048 matrix using the baseline and proposed 3D SpTRSV algorithms on the Cori Haswell machine.Z-Comm denotes inter-grid communication time, XY-Comm denotes intra-grid communication time, and FP-Operation denotes the floating-point operation time.

Figure 7 :
Figure 7: Load balance for the s2D9pt2048 matrix of the baseline and proposed 3D SpTRSV algorithms using (a)  = 128 and (b)  = 1024.Here the colored bars show the mean value across all MPI ranks and the error bars show the maximum and minimum values across all MPI ranks.The left and right subfigures show the data for the L and U solve phases, respectively.Note that the Z-Comm time is not included.

Figure 8 :
Figure 8: Load balance for the nlpkkt80 matrix of the baseline and proposed 3D SpTRSV algorithms using (a)  = 128 and (b)  = 1024.Here the colored bars show the mean value across all MPI ranks and the error bars show the maximum and minimum values across all MPI ranks.The left and right subfigures show the data for the L and U solve phases, respectively.Note that the Z-Comm time is not included.

Figure 9 :
Figure 9: Time breakdown (averaged over all MPI ranks) of the proposed GPU 3D SpTRSV algorithm with 1 and 50 RHSs on the Crusher machine as   =   = 1 and   varies."L-Solve" and "U-solve" denote the time spent in each 2D grid and   =   = 1 leads to no intra-grid communication.

Figure 10 :
Figure10: Time breakdown (averaged over all MPI ranks) of the proposed GPU 3D SpTRSV algorithm with 1 and 50 RHSs on the Perlmutter machine as   =   = 1 and   varies."L-Solve" and "U-solve" denote the time spent in each 2D grid and   =   = 1 leads to no intra-grid communication.

Figure 11 :
Figure11: Runtime of the proposed 3D SpTRSV algorithm using the Perlmutter machine as the total MPI counts   ×   ×  vary.For each   value (one curve), the 2D grid (  ,   ) is set to   = 1."CPU" denotes the CPU 3D SpTRSV and "GPU" denotes the GPU 3D SpTRSV with one GPU per MPI rank.

Table 1 :
Test matrices.Density := {nonzeros in  } /  2 4, ROC-SHMEM for AMD GPUs does not yet support the use of MPI subcommunicators and hence the proposed 3D SpTRSV cannot use more than 1 GPUs per 2D grid.Therefore, we shows scalability results for both AMD and NVIDIA GPUs for 1 × 1 ×   layouts, but