Matrix-free SBP-SAT finite difference methods and the multigrid preconditioner on GPUs

Summation-by-parts (SBP) finite difference methods are widely used in scientific applications alongside a special treatment of boundary conditions through the simultaneous-approximate-term (SAT) technique which enables the valuable proof of numerical stability. Our work is motivated by multi-scale earthquake cycle simulations described by partial differential equations (PDEs) whose discretizations lead to huge systems of equations and often rely on iterative schemes and parallel implementations to make the numerical solutions tractable. In this study, we consider 2D, variable coefficient elliptic PDEs in complex geometries discretized with the SBP-SAT method. The multigrid method is a well-known, efficient solver or preconditioner for traditional numerical discretizations, but they have not been well-developed for SBP-SAT methods on HPC platforms. We propose a custom geometric-multigrid preconditioned conjugate-gradient (MGCG) method that applies SBP-preserving interpolations. We then present novel, matrix-free GPU kernels designed specifically for SBP operators whose differences from traditional methods make this task nontrivial but that perform 3 × faster than SpMV while requiring only a fraction of memory. The matrix-free GPU implementation of our MGCG method performs 5 × faster than the SpMV counterpart for the largest problems considered (67 million degrees of freedom). When compared to off-the-shelf solvers in the state-of-the-art libraries PETSc and AmgX, our implementation achieves superior performance in both iterations and overall runtime. The method presented in this work offers an attractive solver for simulations using the SBP-SAT method.


INTRODUCTION
Computational modeling of the natural world involves pervasive material and geometric complexities that are hard to understand, incorporate, and analyze.The partial differential equations (PDEs) governing many of these systems are subject to boundary and interface conditions, and all numerical methods share the fundamental challenge of how to enforce these conditions in a stable manner.Additionally, applications involving elliptic PDEs or implicit timestepping require efficient solution strategies for linear systems of equations.
Most applications in the natural sciences are characterized by multiscale features in both space and time which can lead to huge linear systems of equations after discretization.Our work is motivated by large-scale (∼hundreds of kilometers) earthquake cycle simulations where frictional faults are idealized as geometrically complex interfaces within a 3D material volume and are characterized by much smaller-scale features (∼microns) [21,28].In contrast to the single-event simulations, e.g.[49], where the computational work at each time step is a single matrix-vector product, earthquake cycle simulations must integrate with adaptive time-steps through the slow periods between earthquakes, and are tasked with a much more costly linear solve.For example, even with upscaled parameters so that larger grid spacing can be used, the 2D simulations in [21] generated matrices of size ∼10 6 , and improved resolution and 3D domains would increase the system size to ∼10 9 or greater.Because iterative schemes are most often implemented for the linear solve (since direct methods require a matrix factorization that is often too large to store in memory), it is no surprise that the sparse matrix-vector product (SpMV) arises as the main computational workhorse.The matrix sparsity and condition number depend on several physical and numerical factors including the material heterogeneity of the Earth's material properties, order of accuracy, the coordinate transformation (for irregular grids), and the mesh size.For large-scale problems, matrix-free (on-the-fly) techniques for the SpMV are fundamental when the matrix cannot be stored explicitly.
In this work, we use summation-by-parts (SBP) finite difference methods [30,40,52,53], which are distinct from traditional finite difference methods in their use of specific one-sided approximations at domain boundaries that enable the highly valuable proof of stability, a necessity for numerical convergence.Weak enforcement of boundary conditions has additional superior properties over traditional methods, for example, the simultaneous-approximationterm (SAT) technique, which relaxes continuity requirements (of the grid and the solution) across physical or geometrical interfaces, with low communication overhead for efficient parallel algorithms [18].For these reasons SBP-SAT methods are widely used in many areas of scientific computing, from the flow over airplane wings to biological membranes, to earthquakes and tsunamigenesis [20,36,43,46,54,60]; these studies, however, have not been developed for linear solves or were limited to small-scale simulations.
With this work, we contribute a novel iterative scheme for linear systems based on SBP-SAT discretizations where nontrivial computations arise due to boundary treatment.These methods are integrated into our existing, public software for simulations of earthquake sequences.Specifically, we make the following contributions: • Since preconditioning of iterative methods is a hugely consequential step towards improving convergence rates, we develop a custom geometric multigrid preconditioned conjugate gradient (MGCG) algorithm which shows a nearconstant number of iterations with increasing system size.The required iterations (and time-to-solution) are much lower compared to several off-the-shelf preconditioners offered by the PETSc library [4], a state-of-the-art library for scientific computing.• We develop custom, matrix-free GPU kernels (specifically for SBP-SAT methods) for computations in the volume and boundaries, which show improved performance as compared to the native, matrix-explicit implementation, while requiring only a fraction of memory.• GPU-acceleration of our resulting matrix-free, preconditioned iterative scheme shows superior performance compared to state-of-the-art methods offered by NVIDIA.
Furthermore, the ubiquity of SBP-SAT methods in modern scientific computing applications means our work has the propensity to advance scientific studies currently limited to small-scale problems.The paper is organized as follows: First, we present related (albeit limited) work on preconditioning and GPU acceleration of iterative schemes for SBP-SAT methods.Next, we provide a detailed description of our model problem, followed by the corresponding SBP-SAT discretization which generates the linear system we focus on in this work.We include details of our code development in the Julia programming language and its HPC capabilities and verify our solutions through rigorous spatial convergence tests to ascertain correctness.We next discuss a two-pronged approach for accelerating time-to-solution, with developed methods customized for SBP-SAT methods.The first is through a preconditioning approach, which can be thought of as a mathematical means for reducing the number of overall iterations required for the iterative scheme to converge.Second, individual iterations can be accelerated by GPU parallelism and the development of matrix-free operations.Preconditioning and parallel performance are compared against options from several state-of-the-art libraries (PETSc and NVIDIA AmgX).We conclude with a summary of our results and proposed future work.

RELATED WORK
Conjugate gradient (CG) is a standard iterative method for solving linear systems involving symmetric positive-definite (SPD) matrices.But the performance of CG often relies fundamentally on specialized preconditioning techniques specific to the application, e.g.[1,25,44,59].To our knowledge, no systematic studies of preconditioning performance for SBP-SAT methods have been done until this study.Multigrid (MG) is a technique for both solving and preconditioning linear systems using a hierarchy of successively coarser grids.Three key ingredients define multigrid methods, namely, interpolation operators (prolongation and restriction), smoothers, and (if used) a direct solve on a coarse grid.MG has inherent high-parallelism, smooths low-frequency error components and has proven to be a highly successful preconditioner for improving convergence rates in a wide range of application problems [55].In [50], MG methods were explored for SBP-SAT methods, and the development of SBPpreserving interpolation operators showed improved performance by modifying standard interpolation operators near boundaries.However, these studies only explored MG as a solver, and not its effectiveness as a preconditioner.In addition, they applied Galerkin coarsening to produce the coarse grid operators rather than through a rediscretization of the original PDE.Because part of the focus of our work is the development of matrix-free methods, defining the coarse grid operators in this fashion would require writing an entirely different kernel for every grid level, as well as more memory for data storage [11].Therefore, to fully utilize the efficiency of our matrix-free methods, as well as to reduce complexity in and number of kernels needed, our focus is on rediscretization approaches when using MG as a preconditioner for CG (denoted MGCG).
The CG iteration itself involves three components: the sparse matrix data structure (if matrix-explicit methods are used), the reduction operations to compute inner products, and the matrixvector multiply.For such matrix-explicit methods, loading the matrix from device memory often dominates SpMV performance.Even though GPUs are bandwidth rich (an order of magnitude or higher than CPUs), making them ideally suited for SpMV [16], the SpMV constitutes the majority of the computational load.Specifically, performance can be limited by low computational intensity, irregular memory access, and sparsity [33], and though not the focus of this work, much effort has been devoted to sparse matrix data structures such as CSR (our baseline comparison), ELL, COO, and BELLPACK (GPU only) [3,5,7,8,16,56,58].
To our knowledge, the only previous work on SBP-SAT methods on the GPU considered a matrix-explicit SpMV [32], or used stencil computations in seismic wave propagations but did not solve linear systems [45].Here we develop a matrix-free implementation of the SpMV and corresponding MGCG method to solve our model problem.Matrix-free methods include the additional benefits of increased arithmetic intensity and reduced memory footprint, and do not require matrix assembly (which can be cumbersome for complex applications), all of which can improve performance on modern, bandwidth-limited processors [13,24,35,42].

MODEL PROBLEM AND METHODS 3.1 Partial Differential Equations for the Solid Earth
Our work is motivated by the study of quasi-static deformation of the solid Earth over the time-scales of earthquake cycles.Motion is governed by the equilibrium equation and a constitutive relationship describing the material properties.The standard assumption is that the Earth is linear elastic, defined on a sub-domain of R 3 .While solutions to the 3D elasticity equations are the eventual target, 2D problems are considered in this work in order to sort out implementation details with reduced computational costs.The 2D anti-plane shear problem [37] is particularly ubiquitous in earthquake cycle benchmark problems [22], where a vertical cross-section of a 3D problem (assuming invariance in one-direction) gives rise to an elliptic equation at each time-step, where only one non-zero component of the displacement vector exists and depends on two spatial variables, namely, where  (, ) is the spatially-varying shear modulus,  is Earth's material displacement in the -direction, and  comprises body forces.In order to enable complex fault geometries and topography, we assume that Ω is a curved quadrilateral in R 2 , which enables extensions to arbitrary domains partitioned into computational blocks, e.g.[29].As Figure 1(a) illustrates, the boundary can be partitioned into four curves Ω  ,  = 1, ..., 4, where (for example), Ω 3 represents Earth's surface and the shear modulus  (, ) can vary in order to represent heterogeneities in the crust, for example, a shallow sedimentary basin, as illustrated.
For generality we consider both Dirichlet and Neumann boundary conditions, namely where vector  is the outward pointing normal to the domain boundary and   ,  = 1, ..., 4 represent boundary data.

Coordinate Transformation
In order to solve (1)-( 2) with SBP-SAT methods, the domain Ω is transformed (via a conformal mapping) to the regular, square where ∇ =   ,    , face  for  = 1, ...4 define the domain boundaries of Ω, given in Figure 1(b). > 0 is the Jacobian and    is the surface Jacobian on face .Vector n is the outward pointing normal to the face and 2 × 2 matrix c is symmetric positive definite (SPD) and encodes the variable material properties  (, ) and the coordinate transform, see [23,29] for more details.

SBP-SAT Finite Difference Methods
SBP methods approximate partial derivatives using one-sided differences at all points close to the boundary node, generating a matrix approximating a partial derivative operator.In this work we focus on second-order derivatives that appear in (3), however, the matrixfree methods we derive are applicable to any second-order PDE.We consider SBP finite-difference approximations to boundary-value problem (3), i.e. on the square computational domain Ω; solutions on the physical domain Ω are obtained by the inverse coordinate transformation.
In this work, we focus on SBP operators with second-order accuracy which contains abundant complexity at domain boundaries to enable insight into implementation design extendable to higherorder methods.To provide background on the SBP methods we first describe the 1D operators, as Kronecker products are used to form their multi-dimensional counterparts.

1D
Operators.We discretize the spatial domain −1 ≤  ≤ 1 with  + 1 evenly spaced grid points   = −1 + ℎ,  = 0, . . .,  with grid spacing ℎ = 2/ .A function  projected onto the computational grid is denoted by  = [ 0 ,  1 , . . .,   ]  and is often taken to be the interpolant of  at the grid points.We define the grid basis vector ì   to be a vector with value 1 at grid point  and 0 for the rest, which allows us to extract the jth component: Definition 3.1 (First Derivative).A matrix   is an SBP approximation to the first derivative operator / if it can be decomposed as    =  with  being SPD and  satisfying ì Here,  is a diagonal quadrature matrix and   is the standard central finite difference operator in the interior which transitions to one-sided at boundaries. to be an SBP approximation to      if it can be decomposed as are approximations of the first derivative of  at the boundaries.
With these properties, both   and  ( )  mimic integration-byparts in a discrete form which enables the proof of discrete stability [39,40].
is a centered difference approximation within the interior of the domain, but includes approximations at boundary points as well.For illustrative purposes alone, if  = 1 (e.g. a constant coefficient case), the matrix is given by , which, as highlighted in red, resembles the traditional (secondorder-accurate) Laplacian operator in the domain interior.

2D SBP Operators.
The 2D domain Ω is discretized using  +1 grid points in each direction, resulting in an ( +1) × ( +1) grid of points where grid point (, ) is at (  ,   ) = (−1+ℎ, −1+ ℎ) for 0 ≤ ,  ≤  with ℎ = 2/ .Here we have assumed equal grid spacing in each direction, only for notational ease; the generalization to different numbers of grid points in each dimension does not impact the construction of the method and is implemented in our code.A 2D grid function  is ordered lexicographically and we let    = diag(   ) define the diagonal matrix of coefficients, see [29].
In this work we imply summation notation whenever indices are repeated.Multi-dimensional SBP operators are obtained by applying the Kronecker product to 1D operators, for example, the 2D second derivative operators are given by for ,  ∈ {, }.Here M (   )

𝑖 𝑗
is the sum of SPD matrices approximating integrated second derivatives (i.e.sum over repeated indices and matrix  involves the boundary derivative computations, see [23] for complete details.

SAT Penalty
Terms.SBP methods are designed to work with various impositions of boundary conditions that lead to provably stable methods, for example through weak enforcement via the simultaneous-approximation-term (SAT) [14] which we adopt here.
As opposed to traditional finite difference methods that "inject" boundary data by overwriting grid points with the given data, the SAT technique imposes boundary conditions weakly (through penalization), so that all grid points approximate both the PDE and the boundary conditions up to a certain level of accuracy.The combined approach is known as SBP-SAT.Where traditional methods that use injection or strong enforcement of boundary/interface conditions destroy the discrete integration-by-parts property, using SAT terms enables proof of the method's stability (a necessary property for numerical convergence) [38].

PROBLEM DISCRETIZATION
The SBP-SAT discretization to (3) is given by where   are SAT vectors formed from the boundary condition on face .To illustrate the structure of the SAT vectors, enforcing Dirichlet conditions on faces  = 1, 2 generates where matrix G  computes the weighted derivative on face , L  is the face extraction operator and   is the 1D  matrix in the direction parallel to face .Matrix   = n      n    , where     is the diagonal matrix of coefficients restricted face  and   is the diagonal penalty matrix on face  with sufficiently large components to ensure discrete stability, according to where Here  , is the minimum value of  in the two points orthogonal to the boundary and ℎ  ⊥ is the grid spacing orthogonal to face  [23].
In order to render the linear system (5) SPD, we multiply by ( ⊗  ) (the discrete equivalent of integrating over Ω and discretizing the weak form).This process yields the final linear system where is SPD [23], and matrices Right-hand side vector  is given by which encodes the source term and boundary data.Here matrices  depend on boundary conditions; for those given in (3) they are Note that  includes contributions from both volume operators ( M     ) and from the SAT enforcement of boundary terms (  ), and differs from the traditional discrete Laplacian near all domain boundaries; see Figure 2. At Dirichlet boundaries (faces 1 and 2),   modifies the layer of three points normal to the face (i.e. the SAT imposition penalizes all points used in the computation of the derivative normal to the face).

DEVELOPMENT AND VERIFICATION
In this work all code is written in the Julia programming language.Julia is a dynamically typed language for scientific computing designed with high performance in mind [10].Julia supports generalpurpose GPU computing with the package CUDA.jl.Through communications in LLVM intermediate representations with NVIDIA's compiler, it is claimed that CUDA.jl achieves the same level of performance as CUDA C according to previous research [9].Aimed to address the "two-language" problem, Julia enables implementation Table 1: Discretization error and convergence rate for the SBP-SAT method.The use of more digits illustrates the convergence rate approaching the theoretical rate 2.

𝑁
Discretization error Convergence rate 2 9  1.353609E-05 1.998341E+00 2 10 3.385302E-06 1.999455E+00 2 11 8.464360E-07 1.999811E+00 2 12 2.116192E-07 1.999930E+00 2 13 5.290578E-08 1.999973E+00 ease of complex mathematical algorithms while achieving high performance, an ideal match for computational scientists without expertise in low-level language-based HPC.Julia has gained attention among the HPC community, with notable examples including The Climate Machine [51], a new Earth System model written purely in Julia that is capable of running on both CPUs and GPUs by utilizing KernelAbstractions.jl[17], a pure Julia device abstraction similar to Raja, Kokkos, and OCCA [6,15,41].In addition, because a large body of researchers studying SBP methods use Julia in serial, e.g.[29,47], our developments will enable these users to gain HPC capabilities in their code with minimal overhead.
The SBP-SAT discretization used to form system ( 9) is rigorously verified for correctness by confirming numerical convergence towards a known, analytical solution [48].For the exact solution (denoted with an asterisk) we take on the domain Ω with corners (, ) = {(−0.3,0), (0.5, −.25), (0, 1), (1, 1.5)}, illustrated in Figure 1(a), with curved edges defined by sinusoids; the grid is formed with transfinite interpolation.The spatially variable shear modulus is taken to be (18) which forms a semi-ellipse representing a shallow, sedimentary basin.The basin contains compliant material with  in = 20, surrounded by stiffer host rock, with  out = 32.The basin parameters  = 0.5, r = 6.25E-4 and   = 0.015, which implicitly define a basin depth and width of 0.05.The discrete L 2 -error is defined by the  −norm, namely where  * is the exact solution (17) evaluated on the grid. ℎ is reported in Table 1 using a direct solve (i.e.we compute the discretization error), which shows that we have achieved convergence at the theoretical rate.

PERFORMANCE: PRECONDITIONING
Iterative methods enable the solution to larger problems of the form (9) compared to a direct solve that requires storing a matrix factorization.However, the convergence of CG depends predominantly on the condition number of the matrix , which can be reduced (to accelerate convergence) through preconditioning techniques.The preconditioning matrix  has to be SPD and fixed, and although it need not be explicitly assembled nor inverted, good preconditioners should satisfy  ≈  −1 .
In this work we develop a custom MG preconditioner by first adopting the second-order SBP-preserving prolongation/restriction operators from [50], which maintain accuracy at domain boundaries and correctly transfer residual vectors to the coarser grids.The 2D restriction operator is given by where  ℎ and  2ℎ denote  ⊗  with grid spacing ℎ and 2ℎ, respectively.The 2D prolongation operator  ℎ 2ℎ is defined by where  ℎ 2ℎ is the standard 1D prolongation operator [12], see Appendix A in [50].
One feature that differentiates our problem formulation from those in [50] is that our matrix in ( 9) is rendered SPD only after the multiplication of (4) on the left by ( ⊗  ), which introduces additional grid information when calculating the associated residual vector.To properly transfer this grid information we found improved performance of our preconditioner when further modifying the restriction operator to account for grid spacing.This is achieved by excluding the ( ⊗  ) term when computing the residual on the fine grid, then restricting using   , and then re-introducing the grid spacing on the coarse grid.In addition, we explored both Galerkin coarsening and rediscretization to form the coarse-grid operators and found the latter to give superior performance.The pseudo-code for the custom MG method (which we will apply as a preconditioner) is given in Algorithm 1.
We first compare the preconditioning performance of our custom geometric multigrid against the multigrid using Galerkin's condition from [50].To avoid the influence of hyperparameters in smoothers, we use the Gauss-Seidel method as smoother.This method is not suitable for GPU parallelization, so we only focus on the preconditioning performance in terms of iteration reduction here.For a 2D computational grid of  = 2  points in each direction, we use a single V-cycle multigrid of  − 1 levels with 5 smoothing steps on each level including the coarsest grid (i.e.taking  1 =  2 =  3 = 5 and   = 1 in Algorithm 1) as the preconditioner.The MGCG stops when the relative residual is reduced to less than 10 −8 times the initial residual.For all tests in this work we initialize the iterative scheme with the zero vector.We report results in Table 2. Our custom geometric multigrid method achieves comparable preconditioning results with the multigrid method using Galerkin's condition from [50], both having nearconstant iteration counts for different problem sizes.The iteration count for non-preconditioned CG almost doubles when  is doubled, indicating worse condition numbers on finer grids.
Many types of smoothers for multigrid methods can be explored for best performance.For the rest of the work, we focus on Richardson iteration given by x +1 = x  +  (b − Ax  ) because it can be easily accommodated by our subsequent development of matrixfree kernels for .Here  is chosen to satisfy the convergence criteria and its optimal value depends on the eigenvalues of  as , where   and   are the largest and smallest eigenvalues of  respectively.We use Arpack.jl,which is a Julia Algorithm 1 ( + 1)-level MG for  ℎ  ℎ =  ℎ , with smoothing   ℎ  applied  times.SBP-preserving restriction and interpolation operators are applied.Grid coarsening ( →  + 1) is done through successive doubling of grid spacing until reaching the coarsest grid.The multigrid cycle can be performed   times. represents the residual, and  represents the solution to the residual equation used during the correction step.This algorithm is adapted from [34].
⊲ Post-smoothing  3 times end for end function Table 2: Performance of MGCG using our custom preconditioner (denoted MGCG) against the multigrid using Galerkin's condition (denoted MGCG-Galerkin) from [50] and non-preconditioned CG.We report the total iterations to converge and the final relative residual for three different algorithms.where  , represents the minimal eigenvalue of a linear system formed for our 2D problem with  + 1 grid points in each direction and  , is the corresponding maximum value.There are many configurations for the multigrid method, but we found that the total number of iterations required by MGCG is largely determined by the number of grid levels and smoothing steps and is less impacted by the choice of smoother itself.MG methods have many tunable parameters.For this initial study, we explored the MGCG performance varying the number of Richardson pre-and post-smoothing steps on every level between 1 and 5.We considered a single V-cycle (i.e.taking  1 =  2 =  3 = 5 and   = 1 in Algorithm 1), including on the coarsest grid (5 grid points in each direction).For all tests in this work we initialize the iterative scheme with the zero vector.All algorithms stop when the relative residual is reduced to less than 10 −6 times the initial residual.

N MGCG-Galerkin
Comparisons against an unpreconditioned CG are not generally appropriate as most real-world applications require preconditioning to make any solution tractable.The Portable, Extensible Toolkit for Scientific Computing (PETSc) [4] is one of the most widely used parallel numerical software libraries, featuring extensive preconditioning methods, many of which can be tested by users via relatively simple command-line options.We experimented with several of PETSc's off-the-shelf algebraic multigrid preconditioned CG solvers (denoted PAMGCG).PETSc's PAMGCG is similar to our MGCG and only requires loading PETSc formatted A and b (from which it forms the coarse grid operators).
We tested PAMGCG with various configurations against our custom MGCG, applying the same stopping criterion (here based on the relative norm of the residual vector reduced to 1E-6), with results provided in Tables 3 and 4. The mg_levels_ksp_type and mg_levels_pc_type in the tables stand for Krylov subspace method types and preconditioner types used at each level of the multigrid in PAMGCG.When classical iterative methods are used as smoothers, mg_levels_ksp_type is set as richardson and the particular smoother (e.g.Jacobi) is set by mg_levels_pc_type.Since our MGCG uses Richardson iteration as the smoother for multigrid, we report mg_levels_ksp_type as richardson and mg_levels_pc_type as none to maintain coherence across the columns.Iterations and total time to converge are reported.We found that the Jacobi iteration is not a good choice as smoother in PAMGCG.When using 1 smoothing step, it takes more iterations than other configurations.It does not converge (denoted as DV) when using 5 smoothing steps.Aside from this configuration, other PETSc configurations in the table exhibit comparable performance in both the number of iterations and the convergence time.We found that additional options within PAMGCG play relatively minor roles in performance.Our MGCG results (reported in the last rows), however, show superior performance in terms of both iteration counts and overall time.

ENVIRONMENT AND IMPLEMENTATIONS
Preconditioning reduces the total number of iterations required by CG, but time-to-solution can be further accelerated by reducing individual time-per-iteration.We achieve these improvements by adding GPU-capabilities to our MGCG method, specifically through the development of matrix-free kernels for computing the SpMV (the action of matrix ).

Environment Configuration
For the GPU implementation, the matrix-vector product is tested using an NVIDIA V100-SXM2 GPU with 32 GB of memory and an A100-PCIE with 40 GB of memory.Tests are carried out using CUDA 12.0.In addition, we have a dual E5-2690v4 (28 cores) CPU with 256 GB of system memory.The results for CG and MGCG are conducted on the A100 GPU.

Kernel Design
Stencil computations have proven efficient in utilizing GPU resources to achieve optimal performance [31,57].In this work we implement a similar GPU kernel for our 2D problem by matching each spatial node to a GPU thread, however, our work requires specialized treatment for domain boundaries.The most computationally expensive operator is the volume operator M     , which differs from traditional finite difference operators in that it involves derivative approximations at domain boundaries.However, the use of else statements in GPU kernels tends to lead to warp divergence and should be avoided.We construct the matrix-free action of , referred to as mfA!() based on node location.Algorithm  on boundary nodes, however, has a different stencil depending on the face number and whether the node is at a corner of the domain.
To avoid race conditions at corners (while minimizing conditional statements), only normal components of M     are computed (as they correspond to the same stencil).For example on face 1 only the action of M   and M   are computed at the corners, see Figure 3.The action of the remaining components of M     on the corner nodes are computed in computations associated with adjacent faces (faces 3 and 4).
At boundary nodes we must also compute boundary condition operators   , with differing stencils depending on face number and whether a node is an interior node, an interior boundary node (i.e.not a corner), or a corner node.Algorithm 3 provides the pseudocode for nodes on face 1; stencils are differentiated with superscripts , , , corresponding to the interior boundary, northwest, and southwest corner nodes, respectively.Figure 3 further illustrates the nodes involved in each computation: black dots correspond to nodes within the 2D domain.Black circles correspond to the interior nodes that are modified by the action of M     .On the western boundary (face 1), the three-node layer adjacent to face 1 is used to compute the actions of the volume and boundary operators.Blue diamonds and red stars correspond to nodes that are modified by the different components of M     .Green squares correspond to the nodes that are modified by the boundary operator  1 in order to impose the Dirichlet condition (in this case a layer of three nodes normal to the face).

PERFORMANCE: MATRIX-FREE GPU KERNELS 8.1 Performance Comparison
With mfA!() we can carry out the matrix-vector product without explicitly storing the matrix.In this section we compare its performance against the matrix-explicit cuSPARSE SpMV implementation available through CUDA.jl.We note that this is not an exhaustive comparison against all possible sparse matrix data structures.Our goal is to establish a baseline comparison of our matrix-free implementation against the standard sparse matrix format CSR in CUDA.jl, with a focus on integration with preconditioning for improving CG performance.
We set up our benchmark as follows: We discretize the domain Ω in each direction using  +1 grid points, varying  from 2 4 to 2 13 , so the matrix  is of size ( +1) 2 ×( +1) 2 .Figures 4 and 5 ss interior: rs performance of the matrix-free implementation against the matrixexplicit SpMV provided with cuSPARSE using the CSR format on both the A100 GPU and V100 GPU.The performance is measured by profiling 10,000 SpMV calculations with NVIDIA Nsight Systems, and the time results shown in the figures represent the time to perform one SpMV calculation.For problem sizes large enough for GPUs with  greater than 2 10 , we see consistent speedup from mfA!() kernel with higher speedup achieved for larger problem sizes.On the A100 GPU, our speedup ranges from 3.0× to 3.1×, .
On the V100 GPU, we see a similar trend, with our speedup ranging from 3.1× to 3.6×.The mfA!() kernel has a low arithmetic intensity of 0.28 based on the computation of the interior points (which accounts for more than 99% of the total computation and data access).This puts the mfA!() kernel in the bandwidth-limited regime [19].If we plot this on the Roofline model, as shown in Figure 6 as the left red dot, we see that our kernel achieves performance that is higher than what is possible for the given arithmetic intensity.If we calculate the arithmetic intensity based on the assumption that the data is read from the DRAM only once (i.e., the ideal case when the kernel only incurs compulsory cache misses), as shown in Figure 6 as the right red dot, we see a higher arithmetic intensity of 1.85 and our achieved performance falls below the Roofline.This suggests that a large portion of our data is coming from the fast memory (e.g., L1 or L2 caches), leading to performance that is better than what can be achieved if the data is only coming from the DRAM.To confirm our hypothesis, we use NVIDIA Nsight Compute to profile our code for the problem size of N=2 13 .The profile shows that we achieve 72% L1 cache hit rate and 57% L2 cache hit rate, which indicates that the majority of our data is from the L1 and L2 caches (approximately 88%), and that our DRAM reads are due mostly to compulsory cache misses (i.e., when the input data is read for the first time).This explains why our code performs better than the DRAM-bounded performance.Figure 7 shows the Roofline model generated by Nsight Compute, based on performance counter measurements of how much of the overall data is coming from different levels of the memory hierarchy.Figure 7 confirms that the majority of our data comes from the L1 cache, followed by L2 and DRAM.It also suggests that we can further improve the performance of our mfA!() kernel by improving data reuse in the L1 cache, which will yield up to 3.8× speedup.
In future work, we will target improved performance of mfA!(), for example through additional memory optimization techniques to improve L1 cache hit rate, especially with respect to its performance on newer architectures.In the present work, however, we focus on utilizing mfA!() to solve the linear system with preconditioning.

Memory Usage Comparison
Next we compare the memory usage of mfA!() against the SpMV kernel via the built-in memory status function in CUDA.jl.CUDA.jl currently has good support for only three different sparse matrices: CSR, CSC, and COO.In Julia, the default sparse matrix format is CSC, but in CUDA.jl, the default sparse matrix format is CSR, and thus,   there is a necessary conversion between these two formats when converting the CPU arrays to GPU arrays in Julia.However, for our problem, where the matrix is SPD, both CSR and CSC formats use exactly the same amount of memory; the only difference is in the use of row pointer rowptr values (for CSR) instead of column pointer values colptr (for CSC), and the order of nonzero values nzval.To avoid redundancy, we merge key results in memory consumption for CSC and CSR formats into three different numbers for each  .The collected data is given in Table 5. Figure 7: Roofline model generated by Nsight Compute, based on performance counter measurements of how much of the overall data is coming from different levels of the memory hierarchy.This confirms our hypothesis that the majority of our data is coming from the L1 cache, and that further improving data reuse in L1 will yield up to 3.8× speedup.
For the matrix-free method, memory consumption is reported in Table 6.In order to perform the matrix-vector product, we need to allocate memory to store the coefficients   ,   and   ; each requires the same size of memory as the numerical solution and must be stored on each grid level when using geometric multigrid as a preconditioner.In addition, we must compute and store the minimum coefficient values  ,  on faces 1 and 2, as specified in (8) Total memory allocated (last column) is calculated using previous columns.
these contributions, we can compute the total memory size, which we provide in the last columns of Tables 5 and 6: We can see that there is a significant reduction in additional required memory for the matrix-free method than the memory to store sparse matrices in CSC or CSR format.When calculating the total memory used for an SpMV operation (including writing results into output vectors), we need to add additional memory allocated for the input data and output data, which require the same memory as the coefficients (the first column of Table 6).A simple calculation can show that the total memory required when using an SpMV kernel is a constant 4.2× of that required for the matrix-free method.

PERFORMANCE: MATRIX-FREE MGCG ON GPUS
With the matrix-free action of  established, we can solve system ( 9) with a matrix-free version of our custom MGCG method (MF-MGCG).Other than low-level GPU kernels, Julia also supports high-level vectorization for GPU computing, which we utilize extensively in our MGCG code for convenience.In this section, we compare its performance against MGCG using the cuSPARSE (matrix-explicit) SpMV (SpMV-MGCG) and also against the stateof-the-art off-the-shelf methods offered by NVIDIA, namely, AmgX -the GPU accelerated algebraic multigrid.The solvers and preconditioners used by AmgX are stored as JSON files.We explored different sample JSON configuration files for AmgX in the source code and found that CG preconditioned by classical AMG performed best for our problem.To maintain a multigrid setup comparable to our MGCG, we modified the PCG_CLASSICAL_V_JACOBI.json to use 1 and 5 smoothing steps with block Jacobi as the smoother.All algorithms stop when the relative residual is reduced to less than 10 −6 times the initial residual.We report our results in Table 7.Also included in the table are results using a direct solve (using LU factorization in LAPACK in Julia) only because it is so often used in the earthquake cycle community for volume based codes [22] and our developed methods offer promising alternatives.As illustrated, the GPU-accelerated iteratives schemes achieve much better performance for the problem sizes tested.Table 7 illustrates that our MGCG method uses fewer iterations to converge compared to AmgX, while iterations for both remain generally constant with increasing problem size.When we increase smoothing steps from 1 to 5, the AmgX sees reduced iterations, but the time to solve also increases by roughly 3×.Because we apply rediscretization (rather than Galerkin coarsening) for MGCG, the setup time is negligible.The setup time in the AmgX is comparable to the solve time however, which adds additional cost to use AmgX as a solver.Our SpMV-MGCG is roughly 2× slower than the AmgX using 1 smoothing step, but our MF-MGCG is faster than AmgX, up to 2× speedup for  = 2 13 .Compared to our SpMV-MGCG, our MF-MGCG achieves more than 2× speedup, and the speedup is more obvious at  = 2 13 , indicating that the MF-MGCG is suitable for large problems.

SUMMARY AND FUTURE WORK
In this work we present a matrix-free implementation of multigrid preconditioned conjugate gradient in order to solve 2D, variable coefficient elliptic problems discretized with an SBP-SAT method.Our custom multigrid preconditioner achieves similar preconditioning performance against the multigrid using Galerkin's condition from previous work, and it is more suitable for GPU code.The MGCG algorithm requires a nearly constant number of iterations to converge for various problem sizes.We explored several comparable solvers and preconditioners in PETSc and found that MGCG requires fewer iterations for the same convergence criteria.We developed matrix-free kernels that outperform the cuSPARSE SpMV kernels from NVIDIA (i.e. using CUDA.jl) in both runtime and memory usage.We used Nsight Compute to analyze the performance of our matrix-free kernel.This offers us more insights into the achieved computation and memory performance, which points directions for future kernel-level optimizations on newer GPU architectures.The resulting matrix-free MGCG method achieves better performance than several off-the-shelf solvers offered by NVIDIA's AmgX when tested on the same GPU.We also demonstrate Julia's ability to leverage both high-level vectorization and low-level GPU kernels for GPU computing, achieving comparable performance to packages in native CUDA C.This facilitates seamless integration of GPU-accelerated MGCG solvers into existing Julia code without external reliance on C solvers.
This work is a fundamental first step towards high-performance implementations to solve linear systems using SBP-SAT methods.Future work will target SBP-SAT methods with higher-order accuracy in 3D, as well as explorations of additional GPU kernel

Figure 1 :
Figure 1: (a) Geometrically complex physical domain Ω with material stiffness that increases from  in within a shallow, ellipsoidal sedimentary basin, to stiffer host rock given by  out .(b) Ω is transformed to the regular, square domain Ω via conformal mapping.

Figure 2 :
Figure 2: Sparsity pattern for matrix  with  = 5 grid points in each direction.Traditional 5-point Laplacian stencil in black circles.Contributions to  are separated into contributions from the volume (red +) and from the boundary enforcement (red ×), so that contributions from both (red * ) cancel deviations from symmetry and render  SPD.

Figure 3 :
Figure 3: Schematic of 2D computational domain; nodes denoted with solid black dots.mfA!() modifies interior nodes, denoted with circles.For face 1, contributions to mfA!() from coordinate transformation matrices modify nodes corresponding to different shapes.Calculations by boundary operator  1 modify nodes in green squares.
The red dot on the left represents the performance achieved by our kernel and its arithmetic intensity (0.28).The red dot on the right represents the same but assuming data is loaded only once from DRAM (i.e., compulsory misses), which yields a higher arithmetic intensity (1.85).The fact that our kernel (red dot) achieves higher performance than what is predicted by the Roofline model suggests that a large portion of our data is coming from the caches.

Table 5 :
, which we denote Ψ 1 = and Ψ 2 , respectively.These are associated with Dirichlet boundary conditions and are significantly smaller in size, and thus reported in KB.Adding up Number of nonzero values (nzval) for CSC or CSR sparse matrices with different  , where matrix size is ( + 1) 2 × ( + 1) 2 .The matrices are SPD.Here, m represents the number of rows, and nzval represents the number of nonzero values.The total memory size (last column) is calculated using previous columns.