A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs

Sparse-Matrix Dense-Matrix Multiplication (SpMM) and Sampled Dense Dense Matrix Multiplication (SDDMM) are important sparse kernels in various computation domains. The uneven distribution of nonzeros in the sparse matrix and the tight data dependence between sparse and dense matrices make it a challenge to run sparse matrix multiplication efficiently on GPUs. By analyzing the aforementioned problems, we propose a row decomposition (RoDe)-based approach to optimize the two kernels on GPUs, using the standard Compressed Sparse Row (CSR) format. Specifically, RoDe divides the sparse matrix rows into regular parts and residual parts, to fully optimize their computations separately. We also devise the corresponding load balancing and finegrained pipelining technologies. Profiling results show that RoDe can achieve more efficient memory access and reduce warp stall cycles significantly. Compared to the state-of-the-art (SOTA) alternatives, RoDe achieves a speedup of up to 7.86× with a geometric mean of 1.45× for SpMM, and a speedup of up to 8.99× with a geometric mean of 1.49× for SDDMM; the dataset is SuiteSparse. RoDe also outperforms its counterpart in the deep learning dataset. Furthermore, its preprocessing overhead is significantly smaller, averaging only 16% of the SOTA.


Introduction
Sparse-Matrix Dense-Matrix Multiplication (SpMM) and Sampled Dense Dense Matrix Multiplication (SDDMM) are key kernels in many areas such as scientific computing, machine learning, data analysis, etc.Specifically, SpMM is widely used in scientific computing, for the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) method [2], iterative solvers with multiple right-hand sides, and so on.It is also one of the basic operators in Graph Neural Networks [16] and the sparsity exploitation of deep learning [15,21].Similarly, SDDMM is the core operation [4] of the factorization algorithms [18,19] and topic modeling algorithms [3,26] in machine learning.
Recently, with the widespread application of deep learning, SpMM and SDDMM on GPUs have attracted more and more attention [5,9,20,28].It is necessary to note that, implementing SpMM through either multiple invocations to highlyoptimized Sparse Matrix-Vector multiplication (SpMV) or directly utilizing the optimization strategy of SpMV does not yield good performance due to the different data reuse pattern [10].So new strategies are needed.
SOTA studies have proposed some solutions.ASpT [11] proposed an adaptive-tiling technique to improve data locality for the sparse matrix, with the primary idea being the construction of dense tiles through column reordering.Sputnik [9] proposed the ROMA technique for regularizing sparse rows to enable vector memory instructions, along with the row swizzle technique for load balancing on GPUs.
However, the distribution pattern of nonzeros in sparse matrices limits the effectiveness of certain techniques proposed in these studies.For instance, when applying the ASpT method, many matrices from the SuiteSparse collection [6] show less than 10% speedup or even a slowdown, primarily due to the scarcity of dense columns in these matrices [13].Similarly, the uneven distribution of long and short rows in the matrix limits the effectiveness of the techniques employed by Sputnik [9].
Moreover, sparse matrix multiplication only involves data accesses and computations related to irregularly distributed nonzeros.This causes tight data dependence between sparse and dense matrices and then makes it challenging to hide the high latency of memory accesses to the dense.
These motivate us to analyze the nonzeros distribution, as well as the specific computational pipeline on GPUs (see Section 3 for details).For the former, we found that there is an uneven distribution of nonzeros, with a majority of very short rows and a small minority of long rows that contain a large absolute number of nonzeros.This inspired us to optimize the coexistence of these two scenarios, thereby avoiding incomplete optimization which has been observed in the above-mentioned works.For the latter, we discovered that even under balanced conditions, there is significant stall time in the warp processing pipeline on GPUs, indicating space for further improvement.
Intending to exploit the nonzeros distribution of sparse matrix and improve specific computing pipelines, we propose a row decomposition-based approach to accelerate SpMM and SDDMM kernels on GPUs (named RoDe).First, we decompose each row in the sparse matrix into two parts, a   (containing a multiple of 32 nonzeros) and the residue.This strategy regularizes all rows and separates the long rows from the short ones for subsequent separate processing.Second, the block split technique is proposed to divide the nonzeros within each row into consecutive segments, which is conducive to load balance for the processing of long rows.Finally, we propose the sub-block pipeline technique to construct a finer-grained pipeline, to better overlap computation and memory access to reduce latency.
To demonstrate the efficacy of RoDe, we compared it with three SOTA alternatives (i.e.Nvidia's cuSPARSE [22], Sputnik [9] and ASpT [11]), on two datasets.For the sparse matrices from the SuiteSparse dataset, RoDe achieves a speedup of up to 8.02× with a geometric mean of 1.45× for SpMM and a speedup of up to 16.7× with a geometric mean of 1.49× for SDDMM, compared to the best baseline.For the dataset of deep learning, RoDe achieves an average speedup of 1.22× on SpMM and 1.20× on SDDMM.Moreover, our approach incurs only 16% preprocessing overhead on average.

Background
2.1 Graphics Processing Units Physically GPUs are made up of an array of streaming multiprocessors(SMs).The threads in a GPU kernel are organized into a grid of thread blocks, and consecutive 32 threads within a thread block form a warp.When a kernel is launched to the GPU for execution, the thread blocks are assigned to available SMs in a way that maximizes parallelism and minimizes resource conflicts.Within each SM, the GPU scheduler further divides the assigned thread blocks into warps.Once a warp is selected for execution, the SM's instruction scheduler fetches and issues instructions to the available execution units within the SM.GPUs can hide the latency of instructions by scheduling warps within the SM [14].
GPUs have a large but high-latency global memory that is accessible to all SMs, an L2 cache that is shared by all SMs, and L1 caches that are local to each SM.All threads within a thread block can communicate through fast shared memory and have access to thread-local registers.When multiple threads within a warp access consecutive global memory locations, GPUs can coalesce the accesses into a single transaction for efficiency.Each thread can also load multiple data (e.g., 4 floating-point numbers at a time) using a single vector memory instruction.SpMM (Sparse Matrix Matrix Multiplication) multiplies an input  ×  sparse matrix A and an input  ×  dense matrix B to form a  ×  dense output matrix C, i.e.C = AB.Alg.1 shows the sequential SpMM process using the CSR representation.

Algorithm 1: Sequential SpMM
SDDMM (Sampled Dense Dense Matrix Multiplication) computes the product of two dense matrices (B 1 and B 2 ) and the result matrix is then scaled by a sparse matrix A (Hadamard product), i.e.C = A ⊙ B 1 B 2 .Alg.2 shows the sequential SDDMM process using CSR.
The outer  loop in both SpMM and SDDMM traverses all rows of the sparse matrix A. In SpMM, the  loop accesses the nonzeros from the CSR representation of A and multiplies it with all elements (inner -loop) from the appropriate row of B (corresponding to the column index of the nonzero element of B) to accumulate to the elements of C[] [:].In SDDMM, the  loop identifies the nonzeros from the CSR representation of A and then scales it with the inner product of the corresponding row of B 1 and a column of B 2 to form the output A straightforward GPU implementation parallelizes the outermost i-loop, where each warp or thread block is assigned to process one row.Within each warp or thread block, the inner k-loop is parallelized, with each thread processing a column of the dense matrix.The threads processing the same row iterate over the nonzero elements in that row and accumulate the results, which corresponds to the j-loop.

Related Work
The Sparse Tensor Compiler (TACO) [17] is an open-source framework for optimizing sparse tensor computations.It applies a range of optimization techniques (including loop fusion, loop reordering, parallelization, and memory access optimizations tailored for sparse tensors) to transform the high-level tensor expressions into efficient executable code.It can be used to generate high-performance SpMM and SDDMM kernels on GPUs.cuSPARSE [22] is a library provided by NVIDIA that offers a collection of sparse matrix operations and algorithms optimized for NVIDIA GPUs.It supports several common sparse matrix formats (including CSR) and offers two different modes depending on access patterns of dense matrices (i.e., row or column-major order) for SpMM.
ASpT [11] introduced an adaptive sparse tiling method to enhance data locality in SpMM and SDDMM.It initially partitions the sparse matrix into row panels and reorders the columns within each row panel to create dense tiles consisting of dense columns (i.e., columns with numerous nonzeros).The tiling method is then applied to the dense tiles, while the regular method is employed for the remaining non-dense tiles.
Sputnik [9] proposed GPU kernels for SpMM and SDDMM in deep learning, where the sparsity is relatively low and the nonzeros are more uniformly distributed [8].It introduced the reverse-offset memory alignment (ROMA) technique to pad the rows of the sparse matrix for alignment, enabling the use of vector memory instructions.Additionally, the row swizzle technique was introduced for load balancing.
There are some works to improve the performance of sparse matrix multiplication by using other representations or defining new representations for sparse matrices [1] [27].For example, FastSpMM [23] and MAGMA [2] used variants of ELLPACK [7] to improve performance.The custom representations of sparse matrices introduce more preprocessing time, as the current mainstream applications or libraries use CSR.Also, there has been some work in recent years using Tensor Core Unit (TCU) [12] of the Nvidia GPUs to accelerate sparse matrix multiplication, such as vectorSparse [5], Magicube [20], etc.They targeted structured sparsity and low-precision computation in deep learning.
In this paper, we focus on techniques to improve the performance of SpMM and SDDMM on GPUs, utilizing the CSR format for compatibility.

Nonzeros Distribution
Figure 2 shows the distribution of nonzeros in each row of the matrices from the SuiteSparse collection.The left subfigure shows the average row length (the number of nonzero elements per row) of each matrix: We can see that ❶ the average row length of most matrices is less than 321 (the red line), which means that many times it is wasteful to use a GPU warp for one row.It is even worse in the case of using vector memory instructions per thread (1282 , the green line).The  distribution of the length of each specific row is plotted in the middle subfigure: ❷ Most of the rows are very short (84% of rows is shorter than 32), but there are also a number of rows with a length of several thousand or even tens of thousands.The right subfigure shows the distribution of rows within each matrix: small standard deviations indicate that most of the rows of a single matrix are concentrated around the average length, while relatively large ranges 3 indicate that a small number of rows differ significantly from the average.It tells that ❸ many matrices consist mainly of short rows and contain a small number of long rows.For example, Matrix "rajat30" contains only 6 rows with over 10,000 nonzeros per row, whereas there are more than 60,000 rows, each containing fewer than 1,000 nonzeros.
Property ❶ makes it difficult to fully utilize the computation or memory-access resources of GPU and reduce the space available for some optimizations.For example, in ASpT [11], dense tiles are built by reordering the columns.But, the existence of lots of very short rows makes the number of dense tiles built very limited, which leads to the fact that many of the matrices from the SuiteSparse collection show less than 10% speedup or even a slowdown [13].Similarly, Sputnik [9] has used vector memory instructions, while its effectiveness is often limited due to too short rows.
Property ❷ indicates that it would be beneficial to optimize long rows.That is, vector memory instructions and more threads for such rows are useful.Property ❸ points out that Property ❶ and Property ❷ coexist for many matrices, implying that we should optimize them simultaneously in a single computation.Moreover, the large ranges in Property ❸, i.e., the large gap between short and long rows in terms of the number and length, indicate that "aggressive" load balancing is necessary.
Accordingly, we have the following inspirations: It is essential to optimize both long and short rows at the same time to avoid incomplete optimization.Additionally, load balancing is crucial.

Computational Pipeline
Initially, sparse matrices and dense matrices are partitioned into blocks, each containing only a portion of the matrix's 3 Range is the difference between the maximum and minimum lengths Figure 3 shows the timeline of the pipeline of SpMM, which is divided into three main phases: loading sparse block, loading dense block, and computing.As shown in Alg. 1 and 2, the column indices col_idx[] within the sparse block are utilized to load the dense block, and the final computation necessitates the complete loading of the dense block.Consequently, these three phases exhibit tight data dependencies.To enable the reuse of sparse blocks among threads, it is customary for different threads to load distinct sections of the block and subsequently share the data through shared memory or shuffle instructions.This necessitates thread synchronization, as indicated by the "sync"s in the figure.The above synchronization limits the optimization space of the compiler: It is only able to interleave memory accesses to dense blocks with computation to hide latency because the prior sparse memory accesses have to be synchronized.In addition, the limited data reuse in dense blocks and the presence of a larger data volume compared to sparse blocks makes the situation worse.Although GPU further hides instruction latency by scheduling threads, the pipeline often experiences stalls during the loading of dense blocks.To show the above-mentioned issue intuitively, we profiled the Sputnik SpMM kernel on matrices with uniform row distribution (e.g.Matrix "wv2010"), which is most friendly to its optimization strategy.Moreover, uniform row distribution ensures that the profiling result is less affected by load imbalance.Table 1 shows the average number of cycles spent by different types of stalls during the kernel execution.We can see that each warp averagely spends 21.86 cycles waiting for a scoreboard dependency on an L1TEX (global) operation, and these stalls do occur on the compute instructions waiting for data from dense blocks.This implies the potential to introduce some software pipelining technique to perform more instructions or more computations within the latency of the current memory access.

Overview
In this section, we provide an overview of our approach, Row Decomposition (RoDe), designed to accelerate sparse matrix multiplication using the CSR representation on GPUs.The specific GPU kernel implementations of SpMM and SDDMM will be given in the next two sections.
The complete workflow of RoDe is illustrated in Figure 4. Initially, we decompose the rows of the sparse matrix into the block part (a) and the residual part (b), each of which is processed separately.Next, the block split method is proposed to be applied to the block part for load balancing.Finally, we construct the sub-block pipeline to perform more operations between consecutive synchronizations, to hide memory access latency better.

Row Decomposition
Block Split Sub-block Pipeline

Row Decomposition
Based on Property ❸ in Section 3, it is intuitive to consider separating long and short rows for individual processing.
Instead of dividing all rows into two groups based on length, we break each row into a block part (containing a multiple of   nonzeros, where we take   as 32 in practice) and a residual part.These two kinds of parts are then separately combined for independent processing.Figure 5 illustrates an example of the first step.We use two auxiliary arrays for each part to implement the row decomposition: the array Row Indices stores the row indices of each part in the sparse matrix; Start Offsets stores the index of the starting element of each row in each part.In the practical implementation, the latter are not necessary as they can be inferred at runtime (they are introduced here for presentation convenience).Equations (1) (2) show the specific calculation process.

𝑆𝑡𝑎𝑟𝑡_𝑂 𝑓 𝑓 𝑠𝑒𝑡𝑠
The aforementioned auxiliary arrays facilitate the separate processing of the two parts, wherein each part is handled by a distinct kernel with its own thread configuration.For two parts of the same row, the results are accumulated to the final output using the atomicAdd instruction.
By decomposing the rows into block parts and residual parts, we can optimize for properties ❶ and ❷ separately to avoid complex branching code and simplify implementation.Specifically, we use the kernel with larger thread blocks to handle the block parts and the kernel with smaller thread blocks for the residual.In addition, vector memory instructions are employed for the former, and residue unroll is performed for the latter.
So, as a whole, row decomposition enables the two parts of each row to be processed separately, leading to improved utilization of GPU resources and few redundant codes between the two classes of kernels.In contrast, although ASpT also uses multiple kernels, they handle dense and sparse tiles separately and then cannot avoid inadequate processing of residues as well as redundant code.

Block Split
Properties ❷ and ❸ indicate that the lengths of different rows within a single matrix differ significantly, which implies that load balancing is necessary.Row swizzle in Sputnik decouples load balancing from the parallelization scheme but is not efficient for rows with large ranges of length in sparse matrices.The sparse iteration space transformation in TACO [24] collapses loops and evenly splits them for load balance, but the overhead of binary search for row index location is substantial.
Therefore, we introduced the block split method based on the aforementioned row decomposition.The specific process is illustrated in Figure 6: each row in the block part is further divided into segments of a fixed length, denoted as   .The array Row Indices contains the row index for each segment, while the array Start Offsets stores the starting index of the first element within each segment.Next, we treat each segment as an independent row for processing, and the results of each segment are accumulated to the final output using the atomicAdd instruction.Although the atomic instruction introduces additional overhead, it is acceptable when an appropriate value of   is selected to achieve a favorable compromise with load balancing (typically set to 512 in practice).The residual part, on the other hand, does not require this method as it is already small enough.
After block splitting, the elements within each row are split into consecutive segments, which facilitates the coalesced memory access of threads on the GPU.This approach also decouples load balancing from specific parallel strategies, due to the implementation based on the two arrays.What's more, all it takes to generate the two arrays is to scan through the row_ptr array in the CSR representation, which brings very little overhead.

Sub-block Pipeline
As discussed in Section 3.2, the common load-computation pipeline can suffer from performance limitations due to the long latency in accessing global memory.To address this issue, we introduced the sub-block pipeline method to better overlap shared memory accesses, global memory accesses, and computations.The principle behind our method is to issue as many instructions or perform as many computations as possible within the latency of the ongoing memory access.
Figure 7 illustrates our sub-block pipeline method.To ensure data reuse, we still choose to load sparse blocks into shared memory, while dense blocks are directly loaded into registers.Since it is necessary to use a nonzero column index (col_idx) to fetch the corresponding row in the dense matrix, the synchronization operation before loading the dense block is unavoidable.Thus, while the compiler does try to interleave memory instructions with compute instructions to hide instruction latency, the synchronization instructions restrict the available interleaving space.
Fortunately, we note that there is a significant disparity in data volume between sparse and dense blocks, with a ratio of 1 :  in SpMM ( is the width of the dense matrix).This observation motivates us to further subdivide the dense blocks for improved pipelining.For the example in Figure 7, we divide a dense block (dn[0-1]) into two sub-blocks (dn[0] and dn [1]).The specific pipeline process is as follows: we initially load sub-block dn[0], and then compute sub-block [0] while loading sub-block dn [1] simultaneously.Furthermore, during the computation of the last sub-block [1] in this stage, we load the sparse block sp [2][3] from the subsequent pipeline stage.In summary, our method constructs a more refined pipeline, in which we integrate the loading of sparse blocks into the interleaving of compute and memory accesses, effectively expanding the overlap space for instructions.Although our method utilizes more registers and shared memory, resulting in a slight reduction in occupancy, the advantages gained from the new pipeline outweigh this decrease.The impact of our method is demonstrated in Table 2; the stall time is significantly reduced.

Preprocessing Algorithm 4: Parallel Preprocessing
Input :_ of the sparse matrix Output : _ , _    1 get thread index _; The purpose of preprocessing is to extract the metadata required by row decomposition and block split, and the process is very simple.It involves scanning the row_ptr array of the CSR representation only once and populating the corresponding values in the Row_Indices and Start_Offsets arrays; the computational complexity is  () ( is the number of rows in the sparse matrix).This process can be easily parallelized on GPUs, as illustrated in Alg.4.

SpMM Kernel
In this section, we detail the design of our SpMM kernel with the above RoDe-based approach.
Similar to ASpT and Sputnik, our kernel employs the hierarchical thread mapping 4 that follows a row-splitting scheme [30].
The detailed process is depicted in Figure 8, where the dark parts indicate the computations handled by a thread block at a time.Different thread blocks handle distinct rows in the sparse matrix and different columns in the dense matrix.The main computational loop is iterated along the N-axis while the other axes remain fixed.We assign each row to a subwarp consisting of 8 threads for processing.Additionally, one thread block handles 4 rows (i.e.kBlockItemsY=4), as depicted in the figure .To promote data reuse, we load the indices and values from the sparse matrix into shared memory.Concurrently, the 4 Some articles use the term "decomposition" to refer to "thread mapping".To avoid confusion with row decomposition, the term "thread mapping" is used here.In the block part, we divide each row into segments using the block split method described in Section 4.2, treating each segment as an individual row (this means more concurrent threads handling the same rows).Due to the high regularity exhibited by each row in the block part, we employ the ROMA technique from Sputnik to enable the utilization of vector memory instructions on sparse matrices, aiming to enhance the efficiency of memory access.Listing 1 shows the detailed implementation of the kernel for the block part.Line 1-3 calculates the tile indices to implement the thread mapping in Figure 8.The row decomposition and block split methods described in Section 4 are implemented in lines 4-7.Before the main loop, each subwarp loads the initial data block from the corresponding sparse rows.Then, the sub-block pipeline is performed in each iteration, shown in lines 10-27.Finally, line 29 accumulates the results into global memory using the atomicAdd instruction.
Listing 2 presents the implementation of our kernel for the residual part.Since the number of nonzeros per row here does not exceed   , the loop in this kernel can be unrolled.To reduce branching instructions, we split the main loop into inner and outer levels following the residue unroll technique in Sputnik, as shown in lines 7-17.Lines 18-22 aim to minimize the usage of atomic instructions.Moreover, in both kernels, we make use of vector memory instructions for the dense matrix whenever possible.This allows one of our subwarps to fetch 32 elements at a time, significantly enhancing memory access efficiency.The thread mapping for the SDDMM kernel is similar to SpMM, with the difference being that the loop iterates over the K-axis of the two dense matrices, while it remains fixed over the sparse matrix.The process is shown in Figure 9.The row decomposition and block split methods in SDDMM are also akin to those in SpMM.In contrast to SpMM, in SDDMM, each thread block loads only one block from the sparse matrix and subsequently performs computations with the corresponding blocks in the dense matrices.In each iteration of the loop, for the dense matrix B 1 , each thread loads 1 ×  #ℎ    elements into registers, while for the dense matrix B 2 , each thread loads  ×  #ℎ    elements into registers in an interleaved manner to coalesce memory access.

SDDMM Kernel
So finally, we need to swap and reduce the results between the threads using the shuffle instruction.
Listing 3 shows the main loops of the SDDMM kernels.The process is basically the same in both parts, except for the specific operations performed in lines 1 and 17.Regarding the use of vector memory instructions, it is also the same as we do for SpMM.

Evaluation
In this section, we provide the experimental evaluation of our RoDe-based approach for SpMM and SDDMM kernels.
7.1 Experimental Setup 7.1.1Environment.We conducted our experiments on an Nvidia A100-PCIE-40GB GPU with CUDA 11.2.The CUDA code was compiled using NVCC 11.1 with the -O3 flag.The kernel execution time and preprocessing time were analyzed and compared separately.All tests were run consecutively ten times to get the average values.
7.1.2Benchmarks.In the evaluations from Section 7.1 to 7.5, we selected more than 900 matrices with at least 10K rows, 10K columns, and 100K nonzeros from the SuiteSparse collection [6], which is the same as in ASpT.The matrices in SuiteSparse are sourced from diverse application domains including scientific computing, graph processing/mining, etc., and encompass a wide range of sparsity patterns.
In Section 7.6 the DLMC (Deep Learning Matrix Collection), a dataset tailored for deep learning tasks [8] is used.It comprises over 3000 sparse matrices commonly used in deep learning applications.These matrices are relatively small (with dimensions of a few hundred or thousand) and have a more uniform distribution of nonzeros [9], which is very different from SuiteSparse.7.1.3Baseline.For SpMM, we compare our RoDe-based approach with Nivida cuSPARSE [22], ASpT [11] and Sputnik [9].ASpT is claimed to be the SOTA for the SuiteSparse collection.Sputnik performs better on the DLMC, For SDDMM, we compare our approach with ASpT and Sputnik, too.The kernels were evaluated in single (FP32) and double precision (FP64), with the width of the dense matrices set to 32 and 128 (K = 32,128), which is the same configuration as in ASpT.Also, only FP32 was used for SDDMM comparison since there is no FP64 implementation in ASpT for SDDMM.We do not evaluate MAGMA [2] and TACO [17] as they had been all outperformed by ASpT.

SpMM
SpMM performance comparison under different configurations is shown in Figure 10.As depicted in the figure, for each of these four configurations, although the other three   implementations have their respective advantages, our approach consistently outperforms them across the majority of matrices.Table 3 presents the comparison summary of the four configurations.The "speedup" is defined as the ratio of GFLOPs compared to the best baseline.A value less than 1 indicates a slowdown.The results show that our approach achieves acceleration on the majority of matrices (85.1%, 96%, 73.9%, and 65.7% respectively), with a slight slowdown observed on a small portion.Table 4 shows the overall speedup.
In summary, RoDe achieves a speedup of up to 32.17×, with a geometric mean of 1.91× over the cuSPARSE baseline; a speedup of up to 198.51×, with a geometric mean of 1.83× over Sputnik; a speedup of up to 8.02×, with a geometric mean of 1.45× over ASpT.
To explain the above speedup, we profile the kernels on the matrix "mip1", which contains both long and short rows, representing the sparse patterns of the majority of matrices.We use "Execution Time", "Memory Throughput", "Warp Cycles Per Executed Instruction" and "Executed Instructions" as the criteria for analysis.The "Memory Throughput" reflects the utilization of GPU memory resources and is a significant performance factor.The "Warp Cycles Per Executed Instruction" defines the latency between two consecutive The profiled results are listed in Table 5.As ASpT's performance is closest to ours, we focus on it for analysis and comparison.ASpT is composed of three subkernels, while we use two subkernels.We highlight the dominant one for each in bold.Compared to others, our RoDe-based approach achieves the highest memory throughput.This is attributed to the fact that either the threads in the block kernel or the threads in the residual kernel have been assigned the appropriate workload to maximize memory access efficiency.Moreover, both the largest and smallest warp cycles are observed in RoDe.This is because the irregular part, i.e., the least GPU-friendly part, is assigned to the residual kernel, so as not to affect the regular computation of the block kernel.In addition, the block kernel has been optimized with our sub-block pipeline strategy.Finally, due to the use of vector memory instructions, our implementation has fewer executed instructions.Figure 11 shows the SDDMM performance comparison.Our approach also outperforms ASpT and Sputnik on the majority of matrices as shown in Table 6.The reason for Sputnik's poor performance is that its SDDMM kernel is primarily parallelized for the scenario of small N (rows) of sparse matrices in deep learning, whereas the rows of matrices in scientific computing are much longer.So, in the subsequent analysis, we will omit Sputnik.We also profile the SDDMM kernel on the matrix "mip1", and Table 7 shows the results.We can observe similar trends as in SpMM, indicating that our approach achieves similar effects here.The increase in warp cycles in the block kernel (the corresponding value of SpMM is 10.69) is due to a decrease in loading sparse blocks and an increase in loading dense, resulting in a reduced pipelining optimization space.This also leads to a more large difference in executed instructions between our and ASpT kernels.7.4 Preprocessing Analysis cuSPARSE, Sputnik, ASpT, and our RoDe-based approach all take CSR as the input format.Except for cuSPARSE, the remaining three require preprocessing, primarily involving scanning sparse matrices to construct additional meta-data.This process introduces additional space overhead and time overhead.Sputnik requires sorting the rows of the sparse matrix based on the number of nonzeros and storing the result in an index array.As a result, it requires only  additional space ( represents the number of rows of the input sparse matrix), and the computational complexity of the preprocessing step is  ( log()).ASpT first divides the rows into   panels and then reorders the columns within each panel.The actual implementation introduces an additional space of at least 3   + 3 and a computational overhead of  (     log( )). denotes the number of columns of the sparse matrix, and  is the number of nonzeros.The preprocessing of RoDe is similar to computing the prefix sum, requiring only one scan of the rows.It introduces an overhead of  + 2    additional space and takes  () time.  is the segment length of each row in the block part.Figure 12 shows the actual execution time (GPU) of the preprocessing for different methods.As shown in the figure, our preprocessing overhead is significantly lower compared to the other two methods, only 16% average, and remains consistent across different precisions and dense matrix scales.While our preprocessing requires slightly more additional space than Sputnik, it is still much smaller than ASpT.This makes it suitable for various applications, including neural network training [9].

Ablation Test
Table 8.Ablation test for our SpMM and SDDMM kernels.Performance is measured as a percent of the performance of the complete kernels.
SpMM SDDMM Row decomposition 62.1% − Block split 92.5% 96.3% Sub-block pipeline 100.0%100.0% We ran ablation experiments to evaluate the effect of each stage of our approach, and the results with K=128 for FP64 precision are shown in Table 8; other configurations achieved similar results.
The first stage, row decomposition, primarily serves as a technique to enable subsequent optimizations and, therefore, does not offer very high performance on its own.Moreover, as the SDDMM kernel cannot achieve good parallelism solely in the first stage, the corresponding data in this table is not available.The second stage, block split, primarily addresses the load imbalance issue, resulting in the majority of the performance improvements.The final stage aims to further enhance performance.Compared with SpMM, in this stage the improvement of SDDMM is small.The reason is that each thread in SDDMM only needs to load the sparse block once, resulting in a reduced pipeline optimization space for interleaving the load of sparse and dense blocks.

Tests on DLMC
For DLMC with FP32 and K=128, results show that RoDe achieves an average speedup of 1.22× on SpMM and 1.20× on SDDMM, compared to Sputnik; the latter is optimized for DLMC and has outperformed ASpT and cuSPARSE.If considering preprocessing, RoDe's advantage would be larger.

Conclusion
This paper analyzes the distributional properties of sparse matrices across rows and identifies the challenge within the sparse-matrix-multiplication pipeline on GPUs.Accordingly, we leverage row decomposition to decouple the regular part from the residual part, and we introduce new load balance and fine-grained pipelining technologies for further optimization.Thus, for datasets in the fields of numerical computing and deep learning, our kernels on both SpMM and SDDMM outperform SOTA works, i.e., cuSparse, ASpT, and Sputnik remarkably.Furthermore, we use CSR as the input format and the preprocessing overhead is much smaller.

Figure 1 .
Figure 1.CSR Representation of a Sparse Matrix Our approach uses the standard Compressed Sparse Row (CSR) representation, which is one of the most widely used data structures for sparse matrices[25][29].As shown in Figure 1, the CSR structure is composed of three arrays: _ , _, and .The value of _ [] is the starting index of Row i in Array _ and . [] holds the actual numerical values of the nonzeros, and _ [] holds the corresponding column indices.SpMM (Sparse Matrix Matrix Multiplication) multiplies an input  ×  sparse matrix A and an input  ×  dense matrix B to form a  ×  dense output matrix C, i.e.C = AB.Alg.1 shows the sequential SpMM process using the CSR representation.

Figure 2 .
Figure 2. Distribution of nonzeros on rows.The data were collected from matrices from the SuiteSparse collection.

Figure 5 .
Figure 5.An example of Row Decomposition.The dashed border indicates that it is not needed in the actual implementation.(a) is the block part and (b) is the residual part.We use two auxiliary arrays for each part to implement the row decomposition: the array Row Indices stores the row indices of each part in the sparse matrix; Start Offsets stores the index of the starting element of each row in each part.In the practical implementation, the latter are not necessary as they can be inferred at runtime (they are introduced here for presentation convenience).Equations (1) (2) show the specific calculation process.

Figure 6 .
Figure 6.An example of Block Split.

Figure 8 .Listing 1 . 1 / 9 / 13 // Ensure sparse tile loading is complete 14 __sync () ; 15 / 17 // Overlap loading and computation 18 for 21 } 22 / 26 k ^= 1; 27 } 28 /
Figure 8. Thread mapping for the SpMM kernel.The left part A represents the sparse matrix and the top part B represents the dense matrix.vectorsfrom the dense matrix are directly loaded into threadlocal registers.The accumulated results are also stored using registers.Listing 1.The implementation of our block part kernel

Figure 9 .
Figure 9. Thread mapping for the SDDMM kernel.The blocks in dense matrices B 1 and B 2 undergo inner product operations, and the results are subsequently scaled by the corresponding block in sparse matrix A.

Figure 10 .
Figure 10.SpMM performance results.The matrices are sorted in ascending order by the number of nonzeros and each point in the figure represents the average GFLOPs for a contiguous set of 5 matrices.

Figure 12 .
Figure 12.Preprocess time.Note that the Y-axis is on a logarithmic scale.

Table 1 .
Warp State Statistics.Data was collected from Nvidia Nsight Compute.

Table 2 .
Warp State Statistics as using our new pipeline strategy.Data is obtained from profiling our block kernel and the matrix used is the same as the one of Table1.

Table 4 .
Overall speedups of SpMM."G-mean" indicates the geometric mean and "Max" indicates the maximum speedup.