Spatula: A Hardware Accelerator for Sparse Matrix Factorization

Solving sparse systems of linear equations is a crucial component in many science and engineering problems, like simulating physical systems. Sparse matrix factorization dominates a large class of these solvers. Efficient factorization algorithms have two key properties that make them challenging for existing architectures: they consist of small tasks that are structured and compute-intensive, and sparsity induces long chains of data dependences among these tasks. Data dependences make GPUs struggle, while CPUs and prior sparse linear algebra accelerators also suffer from low compute throughput.We present Spatula, an architecture for accelerating sparse matrix factorization algorithms. Spatula hardware combines systolic processing elements that execute structured tasks at high throughput with a flexible scheduler that handles challenging data dependences. Spatula enables a novel scheduling algorithm that avoids stalls and load imbalance while reducing data movement, achieving high compute utilization. As a result, Spatula outperforms a GPU running the state-of-the-art sparse Cholesky and LU factorization implementations by gmean 47× across a wide range of matrices, and by up to thousands of times on some challenging matrices.


INTRODUCTION
Solving sparse systems of linear equations, i.e., finding  such that  =  when  is a sparse matrix and  is a vector, is a fundamental problem in scientific computing [8,22,35,36].Solving sparse linear equations is the dominant computation across many applications, including simulating circuits [7,18] and physical systems [32], computational fluid dynamics [59], and optimization [5,31].As a result, supercomputers spend a substantial fraction of time on solvers [4,42].
Matrix factorization (Section 2) is the dominant component of direct solvers.Matrix factorization decomposes  into two matrices with a particular structure that makes solving  =  easy (e.g.,  =  , where  is lower-triangular and  is upper-triangular).When  is dense, matrix factorization algorithms are regular and compute-intensive, and existing accelerators like GPUs achieve high efficiency [27].But  is often highly sparse, which causes poor performance on GPUs and CPUs.For example, on a large circuit matrix (FullChip) [16] that has 0.0003% nonzeros, the state-ofthe-art GPU factorization algorithm STRUMPACK [22] achieves only 0.3 GFLOP/s on an NVIDIA V100 GPU-just 0.004% of its peak floating-point throughput.STRUMPACK suffers this dismal utilization despite using sophisticated algorithms and schedules that seek to use GPUs as best as possible [1].
Sparsity in linear solvers is unavoidable, because it arises from problem structure.For example, consider circuit simulation: a circuit may have millions of nodes, but each node is connected to only a handful of other nodes.Therefore, many problem domains will always yield highly sparse matrices.This is different from applications like deep learning, where sparsity is induced as an optimization (e.g., by pruning) and can be shaped or controlled [64].
To tackle this challenge, we present a hardware accelerator for sparse matrix factorization algorithms, including Cholesky and LU.These algorithms have two key properties that thwart GPUs, CPUs, and more specialized sparse accelerators [29,45,57,68]: First, sparse matrix factorization algorithms contain long chains of dependences among tasks, which are hard to schedule.Since GPUs lack fine-grained control over scheduling, they suffer from inefficient schedules that cause load imbalance and force excessive data movement.Prior sparse linear algebra accelerators are also insufficient: most focus on specific kernels, such as sparse matrixsparse matrix multiplication (SpMSpM) [50,68,69], and even more flexible ones like ExTensor [29] and Tensaurus [57] handle loop nests that do not support these complex data dependences.
Second, sparse factorization is nonetheless dominated by structured compute operations on smaller matrices that can be effectively accelerated at very high throughput using systolic arrays.No prior architecture can handle this combination of sparse and structured features: the vector datapaths and tensor cores of GPUs are a good match for structured compute, but dependences cause terrible utilization.By contrast, prior sparse linear algebra accelerators focus on unstructured, memory-intensive problems [45,67,68] and lack the compute throughput or design to handle structured operations.
To make these challenges concrete, Section 2 presents the necessary background on sparse matrix factorization algorithms (Cholesky and LU), and Section 3 details why current architectures handle them poorly, including a performance characterization of these algorithms on CPUs and GPUs.
Based on these insights, we present Spatula, a novel architecture designed to accelerate sparse matrix factorization.Spatula hardware (Section 4) combines the features needed by these algorithms: its processing elements (PEs) are systolic arrays that implement the primitive tasks in factorization workloads efficiently and at high throughput; and a programmable scheduler orchestrates execution, dispatching tasks to PEs and cheaply tracking frequent data dependences among tasks.Spatula hardware features a cache-based memory hierarchy that captures the irregular reuse in this workload and decoupled-execution mechanisms to fetch data ahead of its use and avoid memory-induced stalls.
To leverage Spatula hardware, we present a novel scheduling algorithm (Section 5) that efficiently schedules sparse matrix factorizations with a wide variety of sparsity patterns.Different sparsity patterns result in fundamentally different computation graphs, presenting different tradeoffs between parallelism and memory footprint.Our approach leverages fine-grained task scheduling, multi-level tiling, and memory-aware scheduling to achieve high utilization while minimizing data movement.
We evaluate Spatula on both sparse Cholesky factorization and sparse LU factorization using a combination of simulation and RTL synthesis (Section 6, Section 7).Spatula outperforms state-of-theart matrix factorization algorithms on a V100 GPU by gmean 61× on Cholesky and 36× on LU, across a diverse set of matrices from many application domains including circuit simulation, structural analysis, fluid dynamics, and convex optimization.Speedups over a 32-core server CPU are gmean 213× on Cholesky 33× on LU.The evaluated configuration of Spatula has an area of 108 mm 2 and consumes 146 W on average when synthesized in 12-14 nm technology, significantly less than the GPU and CPU baselines.
In summary, we make the following contributions: (1) We identify the key features of sparse matrix factorization and its inefficiencies on current architectures (Section 3).(2) We propose the first hardware architecture that achieves high performance on sparse matrix factorization (Section 4).(3) We design a novel scheduling algorithm enabled by this hardware that achieves high utilization (Section 5).( 4) We perform a detailed evaluation of our proposed architecture, showing order-of-magnitude improvements in performance and energy efficiency (Section 6, Section 7).

BACKGROUND
Solving systems of linear equations  =  is a key primitive in many scientific computing applications [5,7,18,31,32,59].Solvers can be direct, if they find  directly given  and , or iterative, if they start from an approximate  and iterate until finding an exact solution.Efficient direct solvers rely on factoring matrix , i.e., decomposing it into a set of matrices with a certain structure that makes  =  easy to solve.In this paper, we focus on Cholesky and LU factorization, the most efficient and widely used techniques for square matrices [24] (e.g., MATLAB adaptively chooses among these algorithms when solving systems of linear equations [15]).
Cholesky factorization is simpler and more efficient than LU, but it requires the  matrix to have a particular structure: it needs to be symmetric (i.e.,  =   ), and positive-definite (i.e., for all non-zero column vectors ,    must be positive).Cholesky finds a lower-triangular matrix  such that  =   .This enables solving  =  via two triangular solves:  =  →    = .Triangular matrices are simple to solve via substitution: when  is a dense  ×  matrix, triangular solves are  ( 2 ), whereas factorization is  ( 3 ) and dominates performance.Factorization also dominates performance when the matrix is sparse.
LU factorization instead finds lower-triangular matrix  and upper-triangular matrix  such that  =  .Like Cholesky, this enables solving  =  through triangular solves, but does not require  to be symmetric positive-definite.
Real-world  matrices are often extremely sparse.These matrices typically arise from discretizing equations, where each entry represents an interaction between two variables.In circuit simulations, each variable represents a circuit node [17], while for partial differential equations on meshes, each variable represents a node on the mesh [33,58].Each node has a small number of neighbors, irrespective of the overall system size.Consequently, matrices arising from larger simulations become increasingly sparse.
Since Cholesky factorization is simpler, in this section we use Cholesky to introduce the algorithmic techniques and optimizations in sparse matrix factorization.Section 2.4 discusses the differences between Cholesky and LU.

Basic Cholesky factorization
Listing 1 shows the code for a basic in-place implementation of Cholesky. is a lower-triangular matrix, initialized with the lower triangle of  (because  is symmetric, only its lower triangle is needed).Each outer loop iteration in Listing 1 modifies  and produces a single column of the output  in-place; after execution,  contains the output .
Specifically, the  ℎ outer loop iteration in Listing 1 consists of two distinct activities.First, lines 2-5 overwrite the  ℎ column of , producing its final value.Second, lines 9-11 update columns  >  by computing the outer product of the  ℎ column of  with itself, and subtracting this outer product from the rest of the matrix.Figure 1 shows these steps in detail for a sparse matrix.
This basic loop nest shows two key features of Cholesky.First, data dependences are frequent: each iteration updates the remainder of matrix , so each output element incorporates contributions from many inputs.Second, outer products dominate performance: if the  ℎ column of  has  nonzeros, updating the column takes  () operations, whereas outer-product updates take  ( 2 ).

Challenges of sparse factorization
Sparse linear algebra algorithms must use compressed formats that leverage sparsity by representing only nonzero values.While they are space-efficient, compressed formats limit the operations that can be efficiently performed [9,60].A key challenge in sparse matrix factorization is using a compressed format that allows all necessary operations.Typical sparse matrix formats like compressed sparse row/column (CSR/CSC) work poorly in Listing 1.Consider using CSC, which stores each column as a sorted list of nonzero coordinates and their values.CSC makes sequential traversals of columns efficient, so the updates to the  ℎ column (lines 5-6) are efficient.But the outer-product updates (lines 9-11) would be extremely difficult, because they require updating individual values scattered throughout the matrix structure.These updates would require expensive searches in CSC, and often introduce new nonzeros, which would require rewriting the CSC structure.
The new nonzeros introduced by outer-product updates are known as fill-in.For example, in Figure 1c, (6,5) and (8, 5) become nonzero.It is common for the final matrix  to have substantially more nonzeros than , e.g., 10-150× is typical in our experiments.But since  is highly sparse (e.g., with a fraction 10 −5 of nonzeros), so is , and storing  uncompressed would be very inefficient.
Prior work makes the key observation that outer-product updates have substantial structure.Figure 3 shows that the outer product of a sparse vector  with itself produces nonzeros at all points in  () ×  ().This structure can be captured with a format that we call Compressed Cartesian Square (CSQ), 1 shown in Figure 3: a  ×  CSQ consists of  2 values and only  coordinates, which denote the row (and column) coordinates of the nonzeros.When we compute the outer product of  with itself, the resulting CSQ will be symmetric, so we only need to store its lower triangle.
Efficient sparse Cholesky implementations represent  using multiple CSQ matrices, which are updated over time.Multiple matrices are needed because different outer-product updates have different sparsity patterns.Since nonzeros added by outer-product 1 Prior work uses this format but does not give it a name.We give it a name for clarity.The name comes from the fact that matrix's nonzero coordinates are the Cartesian square (i.e., the Cartesian product with itself) of the nonzero coordinates of .updates dominate the initial nonzeros of , this representation consists almost entirely of nonzero values, leveraging sparsity.Moreover, the CSQ format enables efficient computation of outer products: since nonzeros are stored contiguously, this is equivalent to computing the outer product of a small dense vector, and can be performed efficiently with a vector processor or a systolic array.

Multifrontal sparse factorization
The widely used multifrontal algorithm [19] organizes computation as a tree of operations on matrices in CSQ format.This compressed format enables using dense linear algebra primitives, and is used in many factorization packages such as MUMPS [2], STRUMPACK [22], and UMFPACK [13].
Symbolic factorization: The multifrontal algorithm relies on a preprocessing step called symbolic factorization that analyzes only the matrix's nonzero pattern and creates helper data structures.In most applications, the nonzero pattern is static or changes very infrequently.For example, when simulating a circuit, devices do not gain new neighbors, and when simulating a car collision, most of the mesh describing the car retains the same structure.As a result, this step can be amortized across many numeric solves, making its performance costs a secondary concern [14].To illustrate the bigger picture and where sparse matrix factorization fits within it, Figure 2 shows the general structure of many applications that use sparse matrix factorization.Listing 2: Multifrontal Cholesky pseudocode.
Numeric factorization: After preprocessing, the sparse factorization is described as an elimination tree [56] of supernodes   , each represented by a CSQ matrix.Figure 4 shows an elimination tree for the matrix from Figure 1a.For a supernode   , the first   columns of   will contain a subset of 's columns that is determined by symbolic factorization.For example, in Figure 4, the nonzeros from columns  [:, 0] and  [:, 5] are stored in the first two columns of  0 .  's remaining   columns are used to store the results of the outer products of the first   columns.Concretely, in Figure 4, the outer product of  [:, 0] with itself will produce nonzero values at (5, 6, 8) × (5,6,8).By construction, these values can all be stored in  0 and represented efficiently using the CSQ format.For each supernode in Figure 4's elimination tree, the first   columns are shaded and the remaining   columns are left blank.
During the actual numeric factorization, the algorithm traverses the tree from leaves-to-root, executing the pseudocode in Listing 2. At each supernode   , the first step is gathering updates from child CSQ matrices. 2Updates need to be accumulated by coordinate.For example, in Figure 4, when  =  6 ,  0 [6,6] and  2 [6,6] would be added to  6 [6,6], and  0 [8,6] would be added into  6 [8,6].Importantly, the same coordinate will almost always map to different positions (i.e., actual memory locations) in the parent and child CSQ matrices.
After all updates have been gathered, the algorithm runs   outer-loop iterations of Cholesky factorization (Listing 1) on   .This step produces the final output columns.For example, after running two Cholesky outer-loop iterations on  0 , columns  [:, 0] and  [:, 5] will be fully factored and will not be updated again.
The fact that the last   columns of any supernode   need to be gathered into  (  )'s CSQ imposes a data dependence.The algorithm cannot begin factoring  (  ) before   has been fully factored.As long as data dependences are respected, supernodes can be factored in parallel.Listing 2 performs a postorder traversal of the elimination tree, which ensures that all children are factored before their parents.This ordering is correct, but in Section 5, we will refine this ordering to improve performance.
This algorithm is efficient because factoring each supernode, which has cubic complexity, is efficient on CSQ matrices.Gatherupdates have low arithmetic intensity, but there is only a quadratic number of updates, so factoring dominates. 2Prior work sometimes calls this operation "extend add" [2,39].

LU factorization
This section has focused on Cholesky factorization, but almost all of the explanations apply to LU factorization too.LU factorization is similar to Cholesky factorization, but is not limited to symmetric positive-definite (SPD) matrices.Whereas Cholesky factorization finds  such that  =   , LU factorization finds a lower-triangular matrix  as well as an upper-triangular matrix  such that  =  .This allows  factorization to work on non-symmetric matrices, which also means 's upper triangle needs to be stored and computed on. factorization requires ≈ 2× more FLOPs than Cholesky factorization.To preserve numeric stability, we use static pivoting as a preprocessing step [37].Using static pivoting, we are able to use a similar loop nest to Listing 1, but change the loop bounds to materialize distinct values in the upper triangle.This implementation of sparse  shares the same types of data dependences as sparse Cholesky.

SPARSE FACTORIZATION IS INEFFICIENT ON PRIOR ARCHITECTURES
We now describe why existing architectures are ill-suited to sparse factorization algorithms, motivating the need for a new accelerator.Section 3.1 and Section 3.2 quantitatively analyze the performance of state-of-the-art GPU and CPU implementations, showing that despite extensive hardware-aware optimizations, utilization is often dismal.Section 3.3 discusses other proposed accelerators, explaining why they lack key ingredients to handle sparse factorization.

GPU implementations
Given their wide availability and high peak compute throughput, GPUs have become the hardware of choice for accelerating linear algebra applications.There are many sparse matrix factorization packages that leverage GPU acceleration; we will analyze CHOLMOD (Cholesky) [54] and STRUMPACK (LU) [22], as we found that they achieved the best performance across a wide range of sparse matrices.precision floating-point throughput of 7000 GFLOP/s. Figure 5 reports performance in GFLOP/s, showing how well the GPU is utilized, on four representative matrices (our evaluation uses a larger set; see Section 7.1 for methodology details).
Figure 5 shows a wide range of efficiencies across matrices: while atmosmoddd achieves 26% of the GPU's peak throughput, human_gene1 achieves only 2.8%, and FullChip achieves a dismal 0.004% of peak throughput-only 0.3 GFLOP/s!This wide range of performance happens because GPUs are inefficient when factoring small CSQ matrices.As we have seen in Section 2.3, sparse factorization mainly consists of outer-product updates on supernodes that are much smaller than the full matrix, and are stored compactly in CSQ format.
Different matrices have different nonzero patterns, which induce different symbolic factorizations with a wide range of supernode sizes.Figure 6 shows the cumulative distribution function (CDF) of FLOPs across supernode sizes for the two matrices at the extremes of efficiency: atmosmoddd (top) and FullChip (bottom).In each graph, the -axis is supernode size as the number of rows and columns (e.g., 4000 denotes a 4000×4000 supernode), and the line reports the fraction of total FLOPs (i.e., work) that happens on supernodes of size ≤ .In atmosmoddd, 8% of the work happens on supernodes of size ≤ 4000, i.e., 92% of work happens on supernodes of size > 4000.By contrast, in FullChip, the largest supernode has size 3047.
To a first order, we can approximate the work in each supernode as the factorization of a dense matrix.Figure 7 shows the performance of the V100 GPU on dense factorization as a function of matrix size, in GFLOP/s.Performance flattens around size 20000, and drops linearly below 10000, so small matrices suffer very low throughput.This mostly explains the dismal performance of GPUs on matrices like FullChip: small supernodes hamper utilization.
While the above analysis is a good first-order approximation, it is not the full story, because it assumes that each supernode is factored in series.This would result in even worse SM utilization.Instead, recent work uses batching [1,22,54]: grouping small supernodes at the same depth in the elimination tree into a single kernel, as shown in Figure 8.The GPU implementations we compare against use batching to improve utilization by exploiting parallelism across supernodes and amortizing kernel launch overheads.However, batching is a crude way to handle data dependences that misses many opportunities for parallelism.Batching also causes load imbalance because it groups supernodes of different sizes, as Figure 8 shows, causing poor SM utilization.Finally, level-by-level traversal of the elimination tree eliminates opportunities for data reuse, incurring additional DRAM traffic.It would be more efficient for the parent supernode to consume child updates immediately after they are produced, as we will see in Section 5.2.As a result of these architectural limitations, GPUs achieve poor utilization, as Figure 5 shows.
In summary, while GPUs have plentiful computational throughput, they are ultimately limited by irregular shapes and data dependences across supernodes, which destroy utilization.Spatula solves this with a more flexible organization that (1) achieves much higher performance on small matrices, and (2) gracefully handles complex data dependences to unlock much more parallelism within and across supernodes.

CPU implementations
Given the limited utilization of GPUs, it is important to consider CPU implementations as well.implementation significantly outperforms the GPU on FullChip and slightly outperforms it on human_gene1, it still suffers from low compute utilization.CPUs overcome some of the limitations of their lower peak throughput with more flexible scheduling hardware, but they still suffer dismal utilization on many difficult matrices.
Hybrid CPU-GPU: Prior work has also considered hybrid CPU-GPU approaches [21,40], However, the increasing gap in FLOP/s between CPUs and GPUs combined with costly host-accelerator communication has resulted in these approaches lagging behind GPU-only techniques [54].

Other accelerators
ASIC accelerators for dense factorization: REVEL [65] and TaskStream [10] are hardware accelerators that support Cholesky factorization of small dense matrices.However, they do not support the sparse case.The difficulty in sparse matrix factorization is not primarily speeding up single-supernode kernels, but rather efficiently handling differently sized supernodes without major load imbalance.Additionally, data movement is not a significant concern for a single small dense Cholesky, but becomes crucial when dealing with a large tree of supernodes, where balancing parallelism and data movement is key to achieving good performance.FPGA accelerators for sparse factorization: Prior work has proposed using FPGAs to accelerate sparse factorization.But the limited arithmetic throughput of FPGAs makes them ill-suited to factorization.Nechma et al. [46] and Kapre et al. [30] describe designs with peak throughput of <10 GFLOP/s.Due to their low throughput, they underperform state-of-the-art CPU implementations.Furthermore, these designs use the Gilbert-Peierls algorithm [23] which does not efficiently scale to higher compute throughputs.

SPATULA ARCHITECTURE
Figure 9 shows an overview of Spatula's hardware architecture.Spatula combines several unique features that enable high-performance sparse factorization.First, Spatula features processing elements (PEs) that achieve high performance on small matrices.Each PE features a 16×16 systolic array that executes factorization tasks at high throughput.Supernodes are processed in 16×16 tiles.This design enables high performance when factoring small supernodes, which as we have seen in Section 3, are common and performancecritical.Second, Spatula features fine-grained hardware support for scheduling needed to handle frequent data dependences.Each PE is double-buffered to hide latency, and a global scheduler dispatches tasks across PEs.Third, Spatula has a memory hierarchy tailored to the needs of factorization workloads: a banked on-chip cache captures irregular reuse of tiles, and high-bandwidth memory provides adequate throughput.

Tiles are Spatula's primitive datatype
The multifrontal algorithm (Section 2.3) runs sparse factorization as a tree of computations on CSQ matrices.The CSQ format enables processing sparse data using dense linear algebra kernels.These dense kernels can be tiled, which enables handling CSQ matrices of varying sizes with a fixed hardware tile size.Spatula's primitive datatype are  ×  dense tiles of double-precision floating-point numbers.A CSQ matrix can be divided into tiles of a fixed size using position-based tiling [60], as shown in Figure 10.Tile size is an important parameter.Larger tiles allow Spatula to achieve a particular throughput with fewer, higher-throughput PEs, allowing for a simpler on-chip network and amortizing scheduling overheads.However, they increase the granularity of computations, leading to possible under-utilization.
Sweeping tile sizes, we find that  = 16 achieves the highest performance across a wide range of matrices.Each tile's values are stored contiguously in memory, along with the coordinates of each row and column.

Task-based programming model
Spatula uses a task-based programming model.Each task takes a set of tiles as inputs and accumulates the results into a single output tile.Spatula supports several types of tasks, all shown in Table 1.Each tasks runs on one PE, and each PE can run tasks of any type.
The multifrontal algorithm (Section 2.3) is decomposed into a collection of tasks.We now present this decomposition; Section 4.3, which describes the implementation of PEs, gives more details on the internal structure of each task.tile is factored using a dchol (dense Cholesky) task.This task's output tile,  000 , is then consumed by a set of tsolve (triangular solve) tasks that process the leftmost column of tiles.These tasks' outputs,  010 ,  020 , and  030 , are then consumed by a set of dgemm tasks that process the second column of tiles.Once the dgemm tasks in the second column complete, tasks  111 ,  121 , and  131 are able to execute, and the algorithm continues like this.Note that dgemm tasks in the third column take inputs from both the first and second columns.For example,  022 's input  1[0:1] is produced by  121 .These deep chains are due to the dependences discussed in Section 2: each dgemm task computes and accumulates all the outer-product updates for its tile.Finally, the last task type, gather_updates, is used to gather updates across supernodes (lines 4-5 of Listing 2).

Processing elements
As shown in Figure 9, Spatula has 32 processing elements (PEs).Each PE features a systolic array that can execute all types of tasks.Systolic arrays are ideally suited to dgemm, the most common task type.The key design choice is whether to have homogeneous PEs that can handle all tasks, or heterogeneous PEs specialized to each task.We opt for a homogeneous approach for two reasons.First, the mix of different task types varies across matrices.For example, Serena has 99.4% of its FLOPs in dgemm tasks, whereas G3_Circuit has just 85% of its FLOPs in dgemm tasks.Second, other types of tasks can also benefit from systolic arrays, and these tasks are often in the critical path (e.g.dchol in Figure 11), so running them quickly helps overall performance.
Prior work has already proposed different systolic arrays for each task [6,34]; we adapt and combine these ideas to produce a single systolic array that can handle all.This design adds < 5% area overhead vs. an array that can only run dgemm.Below, we explain how we build up the array task by task.

= tsolve( , )
T We start with a standard systolic array, shown in Figure 12, which consists of a grid of  ×  FMAC units with pipelined horizontal and vertical connections.The destination tile  is first loaded into the systolic array.Then, a column of  and row of  are fed to the array each cycle, and the array computes and accumulates partial products following an output-stationary (inner-product) dataflow.Finally, the output tile  is read out row by row.The array is doublebuffered (detailed later) to hide startup latencies, so its throughput is one dgemm per  cycles.Handling dense factorization (dchol/dlu) tasks: Tasks corresponding to top-left tiles, such as  000 in Figure 11, are small dense Cholesky factorizations (dchol), and require additional hardware support to execute on a systolic array.
We augment the systolic array to implement dchol using Brent et al. 's algorithm [6].We extend the ALU at one of the corners of the array to support a division and square-root operation, as Figure 12 shows.We also add diagonal links between ALUs, and augment the PE's finite state machine (FSM) to support a different dataflow: input tile  is streamed into the array row by row (as in dgemm), but values cycle through the array so that all elements in the diagonal pass through the inverse-square-root ALU.Outputs are streamed into the PE's accumulator, then written back to the cache.
Systolic dense Cholesky factorization tasks are latency-bound, each having a critical path of  inverse-square root operations.These tasks under-utilize the array's FMAC units, but they are uncommon, as each supernode has a linear number of these tasks, vs. a cubic amount of dgemm tasks.dlu tasks also leverage the same hardware modifications.Handling triangular solve (tsolve) tasks: tsolve tasks need a much smaller set of extensions to implement Kung et al.'s algorithm [34].Values of the read-only  input are streamed downwards through the array, while values in each row of the read-write  input are cycled through a row of ALUs.These tasks do not require any additional ALU hardware or links between ALUs over the above extensions, just additional control FSM logic to enable this dataflow.Handling gather-update tasks: gather_update tasks use the PE just for addition.These tasks are responsible for adding values with matching coordinates from a tile of a child's CSQ matrix into a tile of the parent's CSQ matrix.Figure 13a shows a detailed example, where updates (the unshaded region of the CSQ matrix) from matrices , , and  need to be accumulated into to .However, at a tile level, the coordinates of each tile do not match.As shown in Figure 13a, tile  2,1 requires updates from tiles  3,2 ,  3,3 ,  2,2 , and  1,1 .Coordinates on both axes in each tile are guaranteed to be strictly increasing, allowing addition to be implemented by shifting input rows into the correct position.
Task and data orchestration: To achieve full PE utilization, PEs employ two different latency-hiding mechanisms.First, as shown in Figure 9, each PE has multiple task slots (four in our implementation) to decouple data accesses from execution.Each slot can hold a different task, and the scheduler dispatches tasks to PE slots.When a task arrives at a slot, the PE starts loading the task's input tiles while the PE is executing a task in another slot.This lets the PE hide the latency of memory accesses (from cache hits or misses).Second, the systolic array double-buffered: each array element has two sets of input and accumulator registers.While one task is executing, the PE can load the next task's operands into the array's accumulator registers.This lets the PE hide any startup latency when moving from one task to the next.
Tasks become runnable when all of their input operand loads are completed and data is available.If none of the PE's slots contains a runnable task, the PE stalls.As the PE executes a task, it will draw inputs from the input tile FIFO associated with the current task.Upon finishing a task, the PE writes the updated destination tile back to Spatula's cache.

Hardware scheduler
Efficiently executing sparse matrix factorizations requires dynamic scheduling hardware.There are two sources of parallelism: finegrained intra-supernode parallelism, where independent tasks from a single supernode can run in parallel, and coarser-grained intersupernode parallelism, where tasks from independent supernodes can run in parallel.Due to the wide range of supernode sizes, achieving good utilization requires exploiting both intra-and inter-supernode parallelism.Figure 14 shows this by comparing Spatula's performance on several matrices under three scheduling policies.Inter dispatches each supernode to a different PE, exploiting only intersupernode parallelism.By exploiting coarse-grain parallelism, Inter is very simple to implement and needs minimal hardware support.Unfortunately, it achieves terrible utilization.Recall from Figure 6 that most matrices have large supernodes with ample intrasupernode parallelism.Binding a single supernode to each PE results in running these large supernodes serially and bottlenecking performance.For example, when factoring the root supernode, there are no other available supernodes, so only one PE would be utilized.
By contrast, Intra runs one supernode at a time across all PEs, exploiting only intra-supernode parallelism.Intra requires a hardware scheduler, as with 32 PEs, it must dispatch a task every few cycles, and must negotiate complex inter-task dependences.Intra works well when large supernodes dominate, e.g., in Emilia_923, but works poorly when small supernodes are frequent.For instance, Intra suffers from 4.0% compute utilization in G3_circuit, as its small supernodes cannot use all the PEs.This is the same reason why GPU implementations use batching (Section 3).
Intra+inter is Spatula's scheduling policy, which exploits both intra-and inter-supernode parallelism.This policy first exploits fine-grain parallelism within a single supernode to keep data footprint low, but overlaps multiple small supernodes when more parallelism is necessary.Figure 14 shows that Intra+inter achieves high utilization.
We now discuss the hardware support that enables Spatula's scheduling policy; Section 5 presents the policy itself.
Figure 15 shows Spatula's scheduler, which follows a two-level design: a supernode scheduler feeds ready supernodes to a task scheduler, which produces the fine-grained tasks of each supernode and dispatches them to PEs.Supernode scheduler: The supernode scheduler determines the coarse-grain schedule by controlling the processing order of supernodes.This unit requires limited throughput, producing at most one supernode per 100 cycles (and often much less).Thus, for flexibility, we implement this unit using a RISC-V control core.The core feeds the task scheduler through a queue of ready supernodes, and consumes supernode completion notifications from the task scheduler.Task scheduler: The task scheduler determines the fine-grain schedule of computation.This scheduler is implemented using dedicated hardware, because it requires high throughput (producing one task every 3 to 20 cycles) and it must handle tight data dependencies with low latency.The task scheduler is built around a set of generator units (16 in our implementation).A generator is a simple, configurable FSM that produces all the tasks to process one supernode.Each generator is first configured with the information of the supernode, including its location in memory and dimensions.Then, the generator emits a sequence of tasks that process the supernode.Once the generator finishes producing tasks for its current supernode, it can be reused for a different supernode.A task dispatcher consumes this sequence of tasks and dispatches them to PEs, filling their task slots greedily.
Each generator releases a task to the dispatcher only when the task is ready, i.e., when all its inputs have been computed.To this end, each generator keeps a completion scoreboard that tracks which inputs are available.Due to the structure of the computation, this information is easy to track, requiring   2 -bit entries for a  ×tile supernode (this encodes the last available column tile for each row tile).Because we use multi-level tiling (Section 5),  is limited to 80, and this scoreboard takes about 500 bits of state, with simple wakeup logic similar to that of a scalar scoreboarded core [61].

Memory hierarchy
In our design, we opt for a cache instead of scratchpad memory.As discussed in Section 4.4, the scheduler dynamically interleaves tasks from different supernodes depending on the readiness of their inputs.The resulting access pattern cannot be known at compiletime, requiring the use of a cache.
We use an LRU cache with large cache lines.As dense 16 × 16 tiles are our primitive datatype, we can utilize large tile-sized cache lines (2KB in our implementation).We have relatively few PEs and relatively large data transfers, so a full crossbar is practical.Every cycle, we have 32 PEs, each of which consumes 32 doublewords of data per cycle, resulting in a total of 8 TB/s bandwidth needed to feed our PEs when they are all active.This network configuration consumes relatively little area and energy, as we will see in Section 7. Our scheduler issues memory accesses ahead of time, when a task group is first scheduled onto a PE, achieving a high degree of decoupling and limiting memory stalls.

SCHEDULING
Section 4.4 showed that exploiting intra-and inter-supernode parallelism is necessary, and presented the scheduling hardware needed to do so.We now present our scheduling algorithm, showing how we negotiate parallelism and footprint to achieve high utilization.To reduce data footprint, our general strategy is to exploit finer-grain parallelism first.Section 5.1 describes how Spatula schedules within a supernode, and Section 5.2 describes how Spatula dynamically overlaps supernodes when intra-node parallelism is insufficient.

Intra-supernode scheduling
Effectively scheduling a single supernode is important.Though Figure 14 shows that running a single supernode at a time is insufficient, exploiting available intra-supernode parallelism lets Spatula achieve the same degree of parallelism with fewer concurrent supernodes, reducing cache footprint.Breadth-first task order: Intra-supernode schedules are implemented in hardware by the generator FSMs described in Section 4.4.For simplicity, generators produce tasks in a fixed order, and dispatch tasks in this order to PEs (i.e., out-of-order completion is possible, but not out-of-order dispatch).
Due to frequent dependences, different task orders produce vastly different performance.However, we observe that due to the structure of the computation, a breadth-first task order produces a nearly optimal schedule.This is simply a loop nest that corresponds to a breath-first of the task dependence graph shown in Figure 11.
Simpler orders, like following a fixed-dimension order, fail to exploit the available parallelism and are multiple times slower on small supernodes with frequent dependences.Conversely, we explored an aggressive dataflow scheduler that issues tasks out-of-order, and found negligible overall performance gains overall, and less than 10% in all cases.Thus, we opt for this simple breadth-first order.Multi-level tiling: Some matrices have supernodes whose size vastly exceeds on-chip storage.For instance, atmosmodd's largest supernode (Figure 6) is 12709×12709, over 1 GB! Handling these large supernodes efficiently requires additional levels of tiling.In Section 4, we saw factorization is amenable to tiling, and Spatula uses small  ×  tiles (16×16 in our implementation) as its main primitive.This structure is fractal, and admits further tiling.We introduce level-2 supertiles of configurable size,  × small tiles each. is configurable, and simply adds loop nest levels to the generator FSM, which produces tasks to compute outputs supertile by supertile.
We size each supertile to fit on-chip.For example, with  = 70, each supertile is 10 MB, with fits within Spatula's 16 MB cache.Most reuse happens within a supertile, allowing us to adopt the breadth-first task order without incurring additional data movement.Generators also process supertiles in breadth-first order, but this is to simplify logic: we have evaluated other supertile orders and performance is insensitive to ordering.

Inter-supernode scheduling
As discussed in Section 4.4, achieving high compute utilization requires running multiple supernodes concurrently.However, exploiting inter-supernode parallelism too aggressively risks blowing up the algorithm's cache footprint.
Spatula's general-purpose scheduling core enables flexibility when designing supernode ordering algorithms.To balance parallelism with memory footprint, Spatula opts for an algorithm that is based on a post-order traversal of the tree.A post-order traversal Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.minimizes footprint by visiting a parent supernode immediately after all its children.However, Spatula's algorithm allows dynamic reordering to unlock inter-supernode parallelism.Specifically, code running on the scheduling core maintains a min-heap of ready supernodes keyed by their position in the post-order traversal.Whenever a new supernode is needed, the scheduler core yields the supernode at the root of the min-heap.

Area (𝑚𝑚
To minimize footprint, Spatula's task dispatcher follows a biased order: it tries to fill PEs with tasks from the generators with older supernodes, and uses more recent ones only when older supernodes have no ready tasks.
This policy automatically balances parallelism and footprint: when processing a large supernode, internal parallelism is plentiful, and the whole system focuses on that supernode, processed in cachefitting supertiles.Other generators may have younger supernodes, but their tasks are not issued because the large supernode fills all PEs.Conversely, when processing many small supernodes with limited internal parallelism, the dispatcher overlaps the execution of these supernodes to keep high utilization.But this is fine footprintwise, as each supernode is much smaller than the on-chip cache.

IMPLEMENTATION
We implement Spatula's components in RTL and synthesize it using Synopsys Design Compiler on commercial 12/14nm technology processes.target a 1 GHz frequency.We assume 512 GB/s bandwidth per PHY, similar to the NVIDIA A100 GPU, which has 2.4 TB/s with 6 HBM2E PHYs [48].We rely on prior work to estimate the PHY's area [11] and power [20].
Table 2 shows Spatula's configuration and its area breakdown by component.Spatula's design is balanced between computation and communication, with functional units taking up 43.5 mm 2 out of the total 108 mm 2 of area.

EVALUATION 7.1 Experimental methodology
Evaluated systems: We compare Spatula with two baselines: an NVIDIA V100 GPU, and an AMD Ryzen Threadripper PRO 3975WX CPU.Note these systems have higher area and TDP than Spatula at similar or more advanced nodes (specifically, the V100 is 815 mm 2 in TSMC 12nm) [47].Simulation: We evaluate Spatula using a cycle-level simulator.Our simulator is based on a simulator for accelerators that has been used in prior work, including Gamma [68] and ISOSceles [67], but with customized timing models for PEs, scheduler, and memory system.
The simulator is cycle-driven [41]: every hardware component is modeled as an object; every cycle, each object is ticked and its activity is simulated appropriately.
Spatula's simulator uses synthesis-derived detailed timing models for PEs, scheduler, caches, on-chip network, and main memory.Caches are banked, pipelined, and implement lookups with serial tag and data accesses, and we model bank access latency and contention.Banks are shared across PEs.The NoC connecting PEs and cache banks is modeled using bit-sliced crossbars as described by Passas et al. [51].We model HBM2E's structure using Micron's specifications [43].Each cache bank issues accesses to a single HBM2E channel.Because each cache line is 2KB, the size of the row buffer, memory accesses achieve high utilization.
As an optimization, PEs are simulated at a task granularity, but we do not model the cycle-by-cycle execution of each task (e.g., the timing of each individual ALU).Since tasks are executed systolically, once started, each task incurs a fixed latency that depends solely on tile size parameters encoded in the task descriptor.This enables us to fully simulate large matrices instead of depending on samplingbased approaches.We model all dynamic timings of PEs, e.g., each task does not start until all its operands have been fetched.
Finally, we check functional correctness against the baselines, and we compute power by combining activity factors from simulations with synthesis results.Selected configuration: Spatula's default configuration uses the parameters in Table 2.We determined this configuration by sweeping the number of PEs, cache banks, HBM PHYs, and primitive tile size.We select a Pareto-optimal configuration with reasonable area and power.We use RTL synthesis to find the area and power of components other than main memory; we use prior work to estimate HBM2E power [20,52] and PHY area [12,52].Section 7.3 evaluates larger and smaller Spatula designs, showing good scalability.Factorization algorithms: We compare against state-of-the-art implementations of sparse Cholesky and sparse LU.
For sparse LU, we compare against STRUMPACK [22], a widely used sparse LU package optimized for both CPUs and GPUs.For sparse Cholesky, we compare against CHOLMOD [8].CHOLMOD is widely used and is the standard sparse Cholesky implementation used in Matlab.We compare against the most recent versions (STRUMPACK 6.3 and CHOLMOD 7), except for CHOLMOD on GPU, where we use version 4.6.0-beta,which incorporates batching [54] and outperforms more recent versions.These packages use MKL on the CPU, and cuBLAS and cuSolver on the GPU.Input matrices: We select a representative set of matrices from SuiteSparse [16].We select symmetric positive-definite matrices for Cholesky, and leave the others for LU.For both Cholesky and LU, we select the 20 relevant matrices with the longest GPU execution time.Hardware acceleration is most needed for matrices that are timeconsuming to factor.We find that among these top matrices, some have extremely similar structures, such as Long_Coup_dt0 and Long_Coup_dt6.Additionally, some matrices are reduced versions of others, such as bone010 and bone010_M.To achieve a more diverse set, we select only one matrix from each of these groups.
The resulting matrix sets have diverse structures and display a wide range of utilizations in CPU and GPU baselines.Our methodology produced a set of matrices that contains almost all the matrices evaluated by our baselines [22,54].Performance for Spatula and the baselines depends on matrix structure.In matrices dominated by large supernodes, such as atmosmodd, all systems achieve high utilization, and Spatula's speedups over the GPU and CPU are more muted (e.g., 7.7× over the GPU on atmosmodd).But in matrices with a more diverse mix of supernodes, GPUs and CPUs falter, whereas Spatula still achieves good utilization.This causes large speedups.For example, Spatula is 22,000× faster than the GPU and 1,600× faster than the CPU on FullChip, because Spatula achieves 6.5 TFLOP/s on this matrix while the baselines have terrible utilization (Section 3).

Architectural analysis
Utilization: Spatula achieves high PE utilization across matrices.Figure 16 shows a breakdown of activity in PEs, showing the cycles that they spend on tasks of each type and stalled (due to limited parallelism or memory access stalls).Stalls are rare (typically 5-15% of cycles), showing that Spatula's latency-hiding mechanisms and memory-aware scheduling are effective.As discussed in Section 4.3, most cycles are spent in dgemm tasks, where Spatula achieves full utilization of ALUs.gather_update tasks are the second mostcommon type, and the other types are rare overall, but can consume significant cycles in some matrices (like G3_circuit or rajat31).Data movement: Figure 17 reports Spatula's main-memory traffic per matrix.The left of each bar shows the total data transferred (bottom) and average memory bandwidth (top); each bar is broken down by type of traffic.Loads are broken down into three categories: compulsory loads needed to read inputs, noncompulsory loads initiated by gather_update tasks, and noncompulsory loads initiated by other task types.Store traffic is split between writing results back to main memory and spilling intermediates.
Figure 17 shows that Spatula achieves high bandwidth utilization: the average is 40% of the maximum 1 TB/s, with remarkably little deviation across matrices (27% to 87%) given their diversity.This is because Spatula's memory-aware scheduling achieves similar operational intensity across matrices.
Figure 17 also shows that Spatula avoids needless data movement.This can be seen from the ratio of non-compulsory loads to store spills, which is close to 1:1 across all matrices.This means that each spilled value is read back only once most of the time, avoiding thrashing.Power consumption: Figure 18 shows a breakdown of Spatula's power consumption.On almost all matrices, more than half of total power (including main memory) goes to PEs, showing that thanks to Spatula's memory-aware scheduling, these algorithms are compute-bound.Scheduling: Figure 19 shows CDFs for the distributions of concurrently executing supernodes for Cholesky (top) and LU (bottom).Each line reports results for a single matrix, and each point shows the fraction of time (-axis) that Spatula spends executing at most the given number of supernodes (-axis) concurrently.
Figure 19 shows two important points.First, different matrices need different levels of concurrency, requiring a flexible scheduler.Second, for many matrices, a lot of the time is spent processing a single large supernode.This is because unlike GPU implementations, Spatula can factor small supernodes at high throughput, removing them as bottlenecks.Scalability: We explore the design space of Spatula implementations by sweeping the number of PEs, the primitive tile size, the number of HBM PHYs, and the cache size.Comparison across GPU generations: We have compared with the V100 GPU because it is built on the same technology we synthesize Spatula on, but more recent GPUs exist.
Table 5 shows the gmean performance of STRUMPACK on the NVIDIA V100, A100, and H100 GPUs, and their utilization as percentage of peak throughput (7/19.5/51FP64 TFLOP/s, respectively).Newer GPUs improve throughput but utilization is low across the board: the A100 improves V100's utilization, reaching 4.8% (likely due to its larger cache and FP64 tensor cores); meanwhile, the H100 barely outperforms the A100 and suffers the lowest utilization, 2.0%.
These results show that the latest GPUs still suffer from poor utilization, and Spatula still outperforms them by a wide margin.Moreover, if built on newer technologies, Spatula can also be scaled to provide higher throughputs.For example, Spatula as configured achieves a 11× speedup over the A100 (just on LU), but the A100 has 2.6× more transistors than the V100.As Figure 20 shows, Spatula scales effectively with area.

Figure 1 :
Figure 1: Performing a single step of Cholesky factorization to a sparse matrix: (a) initial matrix ; (b) result of executing lines 3-6 of Listing 1; (c) result of applying the first column's outer product update to the rest of the matrix (lines 8-10 of Listing 1).

Figure 3 :
Figure 3: Outer products can be stored in compressed format.

Figure 6 :
Figure 6: CDFs of FLOPs by supernode size for the two extreme matrices in Figure 5, showing that matrices with lower utilization have more FLOPs in smaller supernodes.

Figure 7 :
Figure 7: Performance of dense LU factorization (without pivoting) across matrix sizes on an NVIDIA V100 GPU.

Figure 11 illustrates
how the code on lines 8-12, which factors a supernode, is mapped into Spatula tasks.First, the top-left corner (D) gather_updates for T in A: D += T A: list<Tile>

33 Figure 11 :
Figure 11: Data-dependence graph of tasks needed to perform 9 outer-loop iterations of factorization on a tiled CSQ supernode.

Figure 13 :
Figure 13: Given the elimination tree with tiled CSQ supernodes shown in (a), (b) illustrates the the many-to-many gather_update dependence structure.

Figure 20 shows
Figure20shows the performance of each design (axis) as a function of area (axis).Spatula's design scales gracefully to both larger and smaller configurations, with linear performance along the Pareto frontier.Comparison across GPU generations:We have compared with the V100 GPU because it is built on the same technology we synthesize Spatula on, but more recent GPUs exist.Table5shows the gmean performance of STRUMPACK on the NVIDIA V100, A100, and H100 GPUs, and their utilization as percentage of peak throughput (7/19.5/51FP64 TFLOP/s, respectively).Newer GPUs improve throughput but utilization is low across the board: the A100 improves V100's utilization, reaching 4.8% (likely due to its larger cache and FP64 tensor cores); meanwhile, the H100 barely outperforms the A100 and suffers the lowest utilization, 2.0%.These results show that the latest GPUs still suffer from poor utilization, and Spatula still outperforms them by a wide margin.Moreover, if built on newer technologies, Spatula can also be scaled to provide higher throughputs.For example, Spatula as configured achieves a 11× speedup over the A100 (just on LU), but the A100 has 2.6× more transistors than the V100.As Figure20shows, Spatula scales effectively with area.

Table 1 :
Spatula task types.Note that D is always a Tile, and it is both an input to the task and its sole output.
30 T 31 T 32 Figure 12: A snapshot of a PE's state. 033 is drawn from Figure 11, and   is a task from a different supernode.Vertical and horizontal links are part of the basic systolic array, and diagonal links are added to support dchol and dlu.
Basic systolic array for dgemm: dgemm multiplies  pairs of input tiles, interpreted as matrices  and  of sizes  ×  and  ×  , and accumulates the result into a  × tile . Figure 12 (top) shows task  033 from Figure 11 as an example.

Table 2 :
2) Configuration and area of Spatula as evaluated.

Table 3 :
Spatula performance on sparse Cholesky and speedups over GPU and CPU.

Table 4 :
Spatula performance on sparse LU and speedups over GPU and CPU.

Table 5 :
Performance of STRUMPACK on other GPUs.