Mat2Stencil: A Modular Matrix-Based DSL for Explicit and Implicit Matrix-Free PDE Solvers on Structured Grid

Partial differential equation (PDE) solvers are extensively utilized across numerous scientific and engineering fields. However, achieving high performance and scalability often necessitates intricate and low-level programming, particularly when leveraging deterministic sparsity patterns in structured grids. In this paper, we propose an innovative domain-specific language (DSL), Mat2Stencil, with its compiler, for PDE solvers on structured grids. Mat2Stencil introduces a structured sparse matrix abstraction, facilitating modular, flexible, and easy-to-use expression of solvers across a broad spectrum, encompassing components such as Jacobi or Gauss-Seidel preconditioners, incomplete LU or Cholesky decompositions, and multigrid methods built upon them. Our DSL compiler subsequently generates matrix-free code consisting of generalized stencils through multi-stage programming. The code allows spatial loop-carried dependence in the form of quasi-affine loops, in addition to the Jacobi-style stencil’s embarrassingly parallel on spatial dimensions. We further propose a novel automatic parallelization technique for the spatially dependent loops, which offers a compile-time deterministic task partitioning for threading, calculates necessary inter-thread synchronization automatically, and generates an efficient multi-threaded implementation with fine-grained synchronization. Implementing 4 benchmarking programs, 3 of them being the pseudo-applications in NAS Parallel Benchmarks with 6.3% lines of code and 1 being matrix-free High Performance Conjugate Gradients with 16.4% lines of code, we achieve up to 1.67× and on average 1.03× performance compared to manual implementations.


INTRODUCTION
1 Numerical solutions of partial differential equations (PDEs) are crucial in a wide variety of scientific computing applications, spanning from weather prediction [Skamarock et al. 2019] to seismic imaging [Mcmechan 2006].Many of these applications employ structured grids, primarily Cartesian grids, for spatial discretization.These grids manifest as 2-dimensional or 3-dimensional arrays, with calculations executed through algorithm-driven traversal over the array elements.In the past, scientists employed general-purpose programming languages, such as FORTRAN and C/C++, to manually develop solvers for PDEs on structured grids.This process required extensive use of "for" loops and arithmetic operations, often leading to solvers encompassing thousands of lines of code.As multi-threading and single instruction multiple data (SIMD) technologies have advanced in modern computing, the manual development of such solvers is progressively becoming less practical and less effective.A natural direction to address this issue is to translate the operations on structured grids to manipulations of sparse matrices, which we term the matrix-based approach.The PDE solving procedure thus becomes multiplying sparse matrices to vectors (SpMV) and solving sparse linear systems according to the matrix.This approach thus typically employs general library routines [Intel 2023;Virtanen et al. 2020] to work on the user-provided matrices.However, there are two critical issues.First, constructing the sparse matrix for the PDE problem necessitates significant human effort due to the complexity of real-world equations and their solvers.Second, as these libraries serve general purposes, they must read the matrix and analyze it at runtime to cater to a broader range of applications, including unstructured grids.Consequently, they have limited ability to efficiently exploit structured sparsity patterns.Despite covering the domain well, the matrix-based approach suffers from unresolved programmability and performance issues.
However, matrix-free solvers that directly target the equations are typically handwritten in languages such as C/C++ or FORTRAN, making them labor-intensive to implement.Domain-specific languages (DSLs) and compilers have emerged as promising approaches to simplify programming and enhance performance.By eliminating matrices, the computations involved in solving partial differential equations (PDEs) are reduced to stencil operators, which iterate over grid points and compute new values based on their neighborhoods.A well-known class of stencil operators is the Jacobi-style stencils, which read and write different arrays at each time step and exhibit no spatial dependence, which is the dependence between different spatial locations at the same time step, as illustrated in Fig. 1a.In contrast, Seidel-style stencils read and write the same arrays, thereby introducing spatial dependence, as seen in Fig. 1b, which makes them challenging to automatically parallelize and optimize.While some DSLs, such as [Lengauer et al. 2020;Louboutin et al. 2019], have successfully simplified the programming of PDE equations and solvers by primarily targeting Jacobi-style stencils, they fail on many algorithms containing Seidel-style stencils due to limited abstractions.On the other hand, domain-specific compilers, such as [Bondhugula et al. 2017[Bondhugula et al. , 2008;;Verdoolaege and Janssens 2017], support both types of stencil operators through advanced dependence analysis techniques but lack abstractions for PDEs, necessitating significant human effort to program a solver with nested loops.
In summary, we recognize several challenges in structured grid PDE solving: ❶ a labor-intensive process to express the equation and the solver; ❷ a conflict between expressiveness and the matrix-free property; ❸ the absence of effective automatic optimization for Seidel-style stencils.
To address these challenges, we introduce a novel DSL called Mat2Stencil, designed for both explicit and implicit solving algorithms for PDEs, while ensuring its expressiveness, modularity, and performance.Our contributions are as follows: • We enable modular programming of equations and their solvers through a high-level abstraction of structured sparse matrices.It utilizes a set of simple matrices and their arithmetic operations to compose more complicated problem matrices.These pre-defined matrices reflect the various steps involved in discretizing partial differential equations (PDEs) and enable user-friendly problem construction in PDE-solving processes.• We enhance expressiveness by introducing a low-level abstraction for structured sparse matrices based on row functions.It allows users to effortlessly define their custom sparse matrices as well as implement custom algorithms against any structured sparse matrices.The design is kept simple and compile-time resolvable, to enable further lowering to matrix-free implementations.
• We compile structured sparse matrices user code to matrix-free target code by expanding the sparse matrices during the first stage of our multi-stage programming infrastructure.
In specific, we backtrack and discuss different cases of loop indices to fill the gap between compile-time (stage 1) and runtime (stage 2) control flow, generating matrix-free code consisting of regular quasi-affine loops.• We develop a novel automatic parallelization technique, SewTh (abbrev.for Sewing the Threads), based on polyhedral dependence analysis, to enable efficient multi-threaded execution of the more generalized stencil loops with spatial dependencies.This approach partitions the loop domain across threads and automatically generates fine-grained synchronization when dependencies cross thread boundaries.
Altogether, our new DSL is capable of expressing a much wider range of explicit and implicit PDE solvers.We evaluate our DSL by implementing four benchmarking programs from the NAS Parallel Benchmarks (NPB) [Bailey et al. 1991] and High Performance Conjugate Gradients (HPCG) [Dongarra et al. 2016], which includes Symmetric Gauss-Seidel, Symmetric Successive Over-Relaxation, and direct solving of multi-diagonal matrices (degenerated from Incomplete Lower-Upper decomposition) as the implicit solving kernels with spatial dependence.Our performance reaches up to 1.67× performance in NPB compared with the original FORTRAN implementation, with only 6.3% lines of code; we also finish HPCG implementation in 16.4% lines of code compared to a manually matrix-free implementation.The geometric mean of our performance over all the programs is 1.03× of the highly-optimized matrix-free human implementations.

BACKGROUND AND MOTIVATION
In this section, we discuss the current status and problems in differential equation solving, and then the background of techniques we will employ in this paper, including multi-stage programming and polyhedral analysis for automatic parallelization.

Solving Differential Equations on Structured Grids
In this subsection, we demonstrate existing approaches to solving differential equations on structured grids.We use the simple equation in Equation (1) as an example in which we want to solve  (, ). (1) To solve it numerically through Finite Difference [Strikwerda 2004], the variable  is discretized temporally to  Δ and variable  spatially to Δ (0 ≤  ≤ , Δ = 1/), in which  for timesteps and  for grid point number are both integers.We thus note  ( Δ, Δ) as  ( )    in short.If we discretize / as ( ( +1)    − ( )   )/Δ, we have the explicit numerical scheme in Equation ( 2), which is a iterative approach to compute  one by each timestep: (2) However, such explicit schemes sometimes suffer from stability problems.Instead, if we discretize / as ( ( )    −  ( −1)

𝑖
)/Δ, we then derive an implicit numerical scheme in Equation (3), which requires solving a linear system at each timestep:

𝑛
where  = Δ/(Δ) 2 . (3) We see that explicitly solving differential equations only requires straightforward computation, while an implicit method will need to solve systems of linear equations.To solve the sparse linear systems, direct methods (usually for multi-diagonal matrices) and iterative methods using Jacobi, Gauss-Seidel, Incomplete LU decomposition, etc., have been employed.Though in different patterns, the existing methods for both cases can be classified into two: one is matrix-based, generates sparse matrices, and uses general routines for sparse linear algebra; the other is matrix-free and targets the math formulas directly.
2.1.1Matrix-based Approach: General Sparse Linear Algebra.One general method to deal with the grids that PDEs are discretized onto involves sparse matrices.Regarding Equation (3), the linear system to be solved can be formulated in matrix form as follows: Users may thus generate the sparse matrix by themselves and use general sparse linear algebra routines to solve their equation.Similarly, the explicit form in Equation ( 2) is formulated as a similar sparse matrix-vector multiplication, or SpMV.Thus, by implementing those algorithms against arbitrary sparse matrices, library providers liberate domain experts from writing those complicated codes, only leaving the generating of sparse matrices.
Yet, generating the sparse matrix via hand-written codes can be annoying.With much more complex equations, the sparse matrix may require more than one page of math formulas to specify, not to mention the code.Besides the difficulty in constructing the matrices, the performance of the matrix-based approach is unsatisfying.Although general sparse routines are highly optimized, require extensive human efforts from expert programmers, and usually cannot be implemented efficiently by science domain researchers, these general sparse routines cannot compete with matrix-free approaches in structured problems, due to the lack of structural information at compile time.
2.1.2Matrix-free Approaches: Stencil by Hand or Automatically.Regarding the structured grids, it is possible to get rid of the matrices and perform computation directly, which we call "matrixfree".Like in Equation ( 2), explicit approaches primarily involve embarrassingly parallel stencil computations in a single timestep.In each iteration, they read neighbor points in the input 2D or 3D tensors with a certain pattern, go through a certain function as its computation, and yield the corresponding value in output tensor(s).There is no data dependence between different grid points if we look at only one iteration.Due to its nature of independent computation, parallelizing such types of stencil computation is straightforward.We denote such stencils as "Jacobi-style" stencils considering that Jacobi iterations match such a computational pattern without purely spatial dependence.Since the computation is highly customized across applications due to different PDEs to solve, and optimization by hand requires massive human effort, many domain-specific languages (DSLs) have been designed to address the optimization of stencil computation, as well as programmability of explicit or Jacobi iteration-based implicit solvers.
While Jacobi-style stencil computations are widely researched, many implicit solvers require operators not covered by such DSLs.Such operators, while still operating neighbors on structured grids, have dependence carried purely by spatial loops on one or more dimensions, which we recognize as Seidel-style stencils, named after the most commonly used Gauss-Seidel iteration but not limited to that particular algorithm.For example, directly solving Equation (3) requires Gaussian eliminating forward and backward, introducing sequential dependence along the  axis.More complicated equations and numerical schemes may require iterative methods such as multigrid solvers or conjugate gradients, which in turn requires routines like Gauss-Seidel iteration, incomplete lower-upper (ILU) factorization, fine-grained parallel ILU [Chow and Patel 2015], and so on.
Although the 1D case introduced here has to be executed sequentially, Seidel-style stencils with two or more dimensions are usually possible to get parallelized, but not as trivial as those Jacobi-style stencils, requiring more human efforts to optimize.For example, the block tri-diagonal pseudo-application in NAS Parallel Benchmarks (NPB-BT) is written in over 400 lines of Fortran code for solving the block tri-diagonal matrix in only one direction, and it has three directions to solve, involving over 1300 LOC.However, such routines are not yet supported by the DSLs targeting PDE solving.As the best among existing, ExaStencils is capable of implicitly solving differential equations, but only through Jacobian and forward Gauss-Seidel iterations in a multi-grid (MG) solver, which is just a particular case with limited computation patterns.
2.1.3Discussion.Due to the difficulty of manually programming Seidel-style stencil operators, the matrix-based representation is usually preferred, especially in the case of implicit solvers.For example, the High Performance Conjugate Gradients (HPCG) Benchmark [Dongarra et al. 2016], a widely recognized effort for ranking top-level supercomputers, enforces sparse matrices and forbids matrix-free approaches in its valid implementations, indicating their concern about the generality of  operator implementations.However, it turns out such prohibition limits performance drastically.It is reported that implementing the HPCG computation in a matrix-free form significantly improves the performance, by 4.67× on the New Sunway supercomputer [Zhu et al. 2021].When possible, HPC researchers still seek matrix-free approaches even for implicit approaches, e.g., the 2016 Gordon Bell Prize winner [Yang et al. 2016] designed and manually implemented a geometry-based ILU that maps the dependence to hardware-supported inter-core communication.
To this end, we seek a solution so that: ❶ the Seidel-style operators are naturally enabled, resolving the coverage issue of stencil DSLs; ❷ the construction of linear problems is easy for users, resolving the programmability issue of the matrix-based approach; ❸ the matrix-free property is retained at runtime, resolving the performance issue of the matrix-based approach.These challenges lead us to a new DSL working with sparse matrices but compiled into generalized stencils.

Multi-stage Programming
Multi-stage programming, or staging, has been a promising method for building embedded domainspecific languages.In such practices of multi-stage programming, a program usually is split into two stages: stage 1 (at compile time) executes the user code partially to generate an optimized intermediate program, and stage 2 (at run time) executes the intermediate program to get the final result.In our design, programmers express the solvers as sparse matrices and their operations on stage 1, which generates matrix-free stage-2 code.
Lightweight Modular Staging (LMS) [Rompf and Odersky 2012] is one of such frameworks and has been adopted in multiple DSL works [Ofenbeck et al. 2013;Shaikhha et al. 2018;Sujeeth et al. 2014].LMS separates the two stages through only types: Rep[T] refers to a T-typed value at the second stage.Since that LMS is a Scala library, we implement our own to build our embedded DSL in Python.
In such an approach, expressions to be delayed to the next stage are used together with normal values.The operators, including conventional ones (e.g.__mul__(a, b) for a * b) and indexed load/store (__getitem__ and __setitem__ for code like a[...]), are overloaded to make user code operating on the delayed expressions seamlessly.When the first stage runs, the codes generated by the overloaded operators for the second stage are spliced together for later compiling.The control flow structures are similarly overloaded to be able to generate control flow codes in the next stage.In particular, our backend, built on top of FreeTensor [Tang et al. 2022], transforms the Python AST to enable virtualized control flow, the details of which will be introduced in Section 5.1.A simple example is shown in Fig. 2.

Polyhedral Analysis for Automatic Parallelization and Optimization
Through multi-stage programming, we produce matrix-free code containing the generalized stencils, in the form of quasi-affine tensor programs.Such programs contain indexed for-loops and branches according to quasi-affine expressions, and access tensors, or multi-dimensional arrays, with quasiaffine indices as well.There have been many studies on optimizing such affine programs since the invention of FORTRAN, which can be summarized as polyhedral analysis.Program analysis and transformations for multithreading, vectorization, and better cache management are summarized in [Padua and Wolfe 1986] and [Allen and Kennedy 2001].As such methods can only be applied on nested loops, they can hardly be used directly on domain-specific user code, but they point out a path towards automatically parallelizing our generated code.
The most famous ones among those compilers are PPCG [Verdoolaege and Janssens 2017] and PLUTO [Bondhugula et al. 2008].While capable of parallelizing computation in many different patterns, they need more specially-designed approaches to achieve efficient stencil computations, e.g.diamond tiling in PLUTO [Bondhugula et al. 2017].However, previous stencil optimizations mainly target Jacobi-style stencils, leaving Seidel-style stencils with vanilla skewing and/or tiling.An example is shown in Fig. 3, which contains two typical methods to parallelize a 2D 9-point Gauss-Seidel loop, produced by the existing automatic approaches.The approach without tiling involves too many global synchronizations and has a poor memory locality.The approach with skewed tiling instead incurs huge starting and ending overhead.As such, they are not efficient at many of the implicit solvers our DSL wants to cover.The development of new techniques resolving the Seidel-style stencils is thus essential for us.Based on the observation mentioned above, we propose Mat2Stencil, our domain-specific language and its corresponding compiler for solving partial differential equations (PDEs) on structured grids.An overview is shown in Fig. 4. Mat2Stencil offers a concise and flexible programming interface for a wide range of PDE solvers.Our structured sparse matrix abstraction enables the modular construction of sparse matrices through linear algebra operations on both predefined and user-defined simple matrices.Furthermore, our DSL separates the algorithms from the matrices, supporting operations ranging from simple stencil operators to complicated Gauss-Seidel iterations and Incomplete LU decomposition.Users write equation solver code that composes simple sparse matrices using arithmetic operations into complicated ones and invokes the algorithms against them.The compile-time resolvable feature enables us to exploit sparsity patterns effectively.The language design will be covered in Section 4. As the first stage of the multi-stage programming frontend, we assemble matrix-free generalized stencil code in FreeTensor IR [Tang et al. 2022], by backtracking and discussing over compile-time undecidable branches.The staging techniques involved will be covered in Section 5.

OVERVIEW OF MAT2STENCIL LANGUAGE AND COMPILER
After the first stage, the backend optimization of Mat2Stencil, which we implemented as a FreeTensor extension, is responsible for optimizing and parallelizing the code before proceeding to stage 2 (or runtime).We propose a novel SewTh parallelizer (Sewing the Threads), which empowers the Mat2Stencil backend to automatically and efficiently parallelize Seidel-style stencils.Additionally, we use a PLUTO+-like [Bondhugula et al. 2016] algorithm to fuse the loops, and perform aggressive inlining of elementwise functions, to optimize memory traffic further.They will be discussed in Section 6.

MAT2STENCIL LANGUAGE DESIGN
In this section, we present the detailed design of the Mat2Stencil language.We will start from the uppermost abstraction, which is on equation solvers, and then go down to the matrices abstraction and expression of linear algebra algorithms within Mat2Stencil.Finally, we present the grammar of Mat2Stencil.

Abstraction over the Equation
In Mat2Stencil, the solving procedure of a PDE is expressed with sparse linear algebra, in which the sparse matrices involved are structured and modular.As is mentioned in Section 2.1.1,constructing the sparse matrix can be problematic for complicated real-world solvers.Instead of requiring users to generate an entire matrix to solve with twisted problem generation code, we enable the assembly of such problem matrices through linear algebra expressions evaluated at compile-time.We present how Mat2Stencil abstracts over the equation in Fig. 5 with an end-to-end example from the Scalar Penta-diagonal (SP) pseudo-application in the NAS Parallel Benchmark (NAS) before we dig into the details.Notice that instead of requiring the user to fill in the non-zeros of the matrices to solve one by one like in other matrix-based approaches, Mat2Stencil only requires users to implement individual modules, including custom structured sparse matrices like dssp4 for ℎ 4 ( 4 / * 4 ) and lambda_y, rho_n, etc. as elementwise functions, then write linear expressions to combine them, which are close to the discretized equations.
The equations to solve.One may notice that derivative operators like / are replaced by some variables like Dx.These variables are predefined matrices that match the discretized differential and help the construction of matrices to solve.We observe that the sparse matrices to solve can usually be constructed from several simple matrices, and the construction procedure is tightly related to the differential formula.
To explain this, we start with the first-order derivative.After discretization, the / differential operator is turned into a sparse matrix D, as is shown below.For simplicity, we write the expanded form of a column vector v as a semicolon-separated list: [ 0 ; ...;   ].

𝑢
Based on that, we similarly have  2   2 discretized as D (−1) D () u, which can be simpler noted as D 2 u since the actual shape of each D can always be inferred from what vector it is multiplying to.Beside that, we further define P = [0; I; 0], so that Pu = [0; u; 0], which represents padding on the boundary.Using this notation, we can easily perform the discretization in Equation ( 3) without dealing with scalars: It can be easily verified that Equation ( 5) is equivalent to Equation (4).It naturally specifies how the matrix to solve is constructed, with a number of arithmetic operations on sparse matrices.Furthermore, since that higher dimensional u will require flattening to a vector, D will be broadcasted into D  , D  , and D  , reflecting the higher-dimensional differencing operations.Similarly, the neighborhood averaging A, which has occurred in Fig. 5b in broadcasted forms, is defined as , so that Au =  0 +  1 2 ; ...; However, computing the sparse matrices via these arithmetics at runtime can be costly, especially noticing that we need multiplications between sparse matrices.Given the compile-time known structure of the structured sparse matrices, we have the opportunity to fully resolve the matrix construction at compile-time.Besides, to generate efficient matrix-free compute kernels, the matrixvector multiplication and sparse linear solvers need to cooperate with such structured matrices at compile time.Given the multi-stage programming infrastructure, the first stage (running in Python) is responsible to deal with this.The following subsections will introduce our design behind, and the staging procedure will be introduced in Section 5.

Sparse Matrices as Row Functions
In Mat2Stencil, a sparse matrix is represented as a row function, which takes a row index and produces non-zeros in the specified row.Each non-zero is a pair of a column index and the corresponding matrix value, in the same way as in the adjacency list.The returned non-zero list is then consumed by sparse linear algebra procedures that define the computation, decoupled from the matrix.Speaking in the multi-stage programming framework, the row index, non-zero column

Predefined Matrices Examples
Diff …… @spmat(lambda n: n -1) def diff(col_domain: CartesianDomain, i: IdxExpr): return [(i, -1), (i + 1, 1)] Padding …… @spmat(lambda n: n + 2) def padding(col_domain: CartesianDomain, i: return [(i -2, 1), (i -1, -4), (i, 5)] HPCG Prolong indices, and values are all staged expressions that will be resolved at runtime, while the non-zero list itself is a compile-time Python list.Some examples are shown in Table 1.A domain is assigned to both the row and column indices, representing their valid range starting from 0. Thus, the row function also takes a parameter col_domain, which contains the information on matrix size and will be described in detail in Section 4.2.1.Decorated by @spmat, which is a Python decorator function we provide, the row function is wrapped into a Mat2Stencil matrix object of type SpMat.The decorator also optionally takes a lambda expression to compute the matrix size relatively, the detail of which will be discussed in Section 4.2.2.Besides, through simple if-else sequences, the special boundary in many structured sparse matrices can be well expressed, and so can periodical sparsity patterns that occur in, for example, prolongation in multi-grid solvers.
Our row function design imitates the adjacency list, corresponding the non-zero cells in the matrix to their rows.This abstraction suits our applications well: an unknown variable in a discretized linear equation system like Equation ( 3) is related to a limited and compile-time known number of other unknowns, which are usually neighbors on the grid.Thus in the matrix representing the linear systems to be solved, the non-zeros in a row are in the columns of the neighbors.It makes the row function a natural design: the existing approach through general sparse linear algebra routines also requires users to generate sparse matrices with similar code, but for the more complex end-to-end matrices instead of our simple modular ones.
Yet, during implementing applications, we observe that expressing a conventional matrix is not sufficient to maximally ease user programming.We extend the sparse matrix in Mat2Stencil in the following aspects and expose optimization opportunities.

Multi-dimensional rows and columns.
Representing the structured problems as sparse matrices requires flattening the multi-dimensional unknowns.It adds burdens on user programming to write the flattened index of the rows and columns.Besides, such flattened access is not friendly to our backend optimization: to address the arbitrary dependence Seidel-style stencils may bring to us, the polyhedral analysis need complete information about the original multi-dimension structure to achieve good performance.
As such, we extend the sparse matrix to higher dimensions on the rows and columns.Such a "matrix" with -D row domain and -D column domain is thus a (, )-tensor in the terminology of General Relativity; yet for convenience of understanding and to match the code, we will continue to note it as a "matrix", or SpMat.A matrix object then has two properties, row_domain and col_domain Each domain object has a tuple of integers for its grid size; in our current implementation, only regular Cartesian grids are considered, but the domain type can be extended easily.An -D point in the row domain corresponds to a row in the matrix consisting of non-zeros, each with a -D point in the column domain and the value of the non-zero in the matrix.Correspondingly, a previously flattened vector becomes a multi-dimensional Vector either, with respect to the multi-dimensional space that the physical quantities span over.A domain is similarly attached as a property to each Vector object.
For example, in the prolongation matrix shown in Table 1 we are already using the multidimensional row index passed into the row function and returning multi-dimensional column indices of non-zeros in the row.If, for example, consumed by the SpMV procedure in Fig. 7, the returned column indices are later used to index the multi-dimensional Vector being multiplied against, thus performing the prolongation step mapping the coarser grid to a finer grid.

Automatically deriving row domain from column.
A structured operation usually applies to many different grids, including grids of different sizes and dimensions.We notice that a given column domain for a structured sparse matrix will decide its row domain.In specific, a SpMat is logically right-multiplied to Vectors, thus the column domain is always decided first.As such, we make the row_domain of an Mat2Stencil matrix always following its col_domain.Walking into details, SpMat does not require specific row or column domains in user codes.At each staging site that we decide its concrete shape, its col_domain will be set through a Python property setter.The setter calls a virtual function set_col_domain to set the column domain and compute the row domain.The @spmat generated SpMat optionally receives a lambda expression as its argument to map its sizes of column domain to row, which is used to implement its set_col_domain.

4.2.3
Inner value as a tensor.Multiple (but a small fixed number of) quantities are often attached to the same grid in real-world applications.In previous matrix-based approaches, these quantities are flattened along with the grid dimensions; after retaining the grid dimensions, we then need to allow the SpMat and Vector to have tensors, instead of plain scalars, as its inner values.They thus have a fixed inner_shape, which describes the shape of their inner tensor values, as an extra parameter in constructing a Mat2Stencil sparse matrix.

Linear Algebra on Mat2Stencil
Through the programming interfaces of Mat2Stencil, linear algebra routines can be easily implemented.Wrapped in a domain loop, such routines can retrieve lists of non-zeros in each row of Mat2Stencil matrices, and perform computation according to their algorithm.Those sparse linear algebra routines are able and expected to be implemented in their simplest form, i.e. directly operating on the adjacency list, without the need for manual optimizations.4.3.1 Sparse matrix-vector multiplication (SpMV).The simplest example is SpMV in Fig. 7, multiplying a sparse matrix and a vector.It first adjusts the domains of the matrix and retrieves the expected domain of the product vector accordingly, the mechanism of which has been covered in Section 4.2.2.
Looping over the domain, it reads corresponding rows through the row method of the matrix object, which is a simple wrap of the above row function.It then multiplies the value from the matrix and vector for each non-zero and sums them up to get the final result, following the math of matrix-vector multiplication.

Sparse matrix multiplication.
With both operands expressed in Mat2Stencil, it is possible to give their product still in Mat2Stencil and being resolved all at compile stage through a row-wise spgemm as is shown in Fig. 6a.
Then, as shown in Fig. 6b, the row function of the resulting matrix first walks through the row of the specified index on the left-hand side (LHS), retrieving the non-zeros.It then uses the non-zeros' column indices to index the rows of the right-hand side (RHS) matrix.All the RHS rows are merged in order, multiplied with the LHS values accordingly, and summed up to one non-zero sequence, producing the result row.All this procedure happens in stage 1 and is thus expanded to plain matrix-free codes before stage 2.
Summing up and subtracting between Mat2Stencil matrices is even simpler: they only require merging two rows from the two matrices, similar to combining the multiple rows from RHS in spgemm.We also implement them for Mat2Stencil but do not include details here.@ft.inline def symgs(A: CartesianDomain, r: Vector, x: Vector): # forward sweep for i in x.domain: 4.3.3Enabling spatial dependence.These linear algebra routines can also contain spatial dependence carried by the domain loops, according to their different algorithms, enabling the implementation of more complicated routines required by PDE solvers.An example is in Fig. 8, in which the loops simultaneously read and write on the same Vector object and introduces matrix-defined dependence.The staging procedure will then produce their matrixfree form constructed of regular nested loops, without any indirect access typical in traditional sparse linear algebra, thanks to the staged domain loop indices (will be covered in Section 5).Our analysis and optimization mentioned in Section 6 will resolve the loop-carried dependence and exploit chances to parallelize.
To support our evaluated applications, we have implemented Incomplete Lower-Upper, Symmetric Successive Over-Relaxation, and Symmetric Gauss-Seidel, each with tens of lines of code.Users with science backgrounds will be able to implement their own routines in Mat2Stencil in usually tens of lines, much easier than in traditional sparse algebra libraries.

Grammar of Mat2Stencil
After describing our language design, we thereby demonstrate the grammar of the Mat2Stencil language.As an embedded domain-specific language built upon type-based staging techniques, Mat2Stencil involves various host language (Python) functions during the staging process, noted as InType ⇝ OutType and may have side effects.Many designs mentioned above work as Python components, and are thus invisible in the grammar, including the row function, structured sparse matrix objects, and linear algebra routines.Without hurting the core idea, it also omits details about "inner value as a tensor" from Section 4.2.3.
Note that we support separate classes of statements and expressions for outside and inside domain loops.Domain loops cannot nest with each other; loading and storing the vectors only happens inside domain loops; and the branches inside domain loops, ite  must be conditioned according to integer arithmetics of integer indices  ultimately introduced by domain loops.Global scalars are useful to aggregate values globally, e.g., dot products.The outer if-else-then ite  is useful for iterative solver algorithms to control the workflow, e.g.check convergence.The aforementioned sparse linear algebra routines, invoked with our structured sparse matrices, are all taken place outside the domain loops and compose   nodes.
Then a sparse matrix will be a host language (Python) object of type: Which contains a function domain_col2row, mapping column domain to row domain, and a function row, for retrieving a row of non-zeros in the matrix symbolically.

Customizable Control Flow Virtualization
Similar to LMS, our staging infrastructure works through control flow virtualization.With a decorator, it automatically transforms a function to its virtualized version: each for-loop and ifthen-else branch is transformed into a corresponding virtual call, allowing the code generator to emit stage-2 code wrapping the loop or branch body.But unlike LMS, which directs all control flow to a central overload object defining the stage-2 language, we choose to make the virtual call point to the iterable or predicate expression itself to gain more customizability.
We demonstrate examples of the code transformation in Table 2. Walking into the details, if the iterable or predicate is of staged type (StagedIterable and StagedPredicate accordingly), the staging handler method of the staged object (foreach and if_stmt accordingly) is called to stage future codes.Otherwise, it automatically falls back to a stage-1 control flow.The logical operations, which are not overloadable per the design of Python, are virtualized similarly for StagedPredicate.This staging infrastructure has been merged into FreeTensor as the basis of its frontend.
It is natural to see that the domain loops iterating through all points in a domain have to be staged as loops in stage 2. Otherwise, the staged code size will bloat out as the problem grows, which is undesired since we will deal with large science problems.As such, we implement Staged-Iterable for the domain objects and emit the nested for-loops according to the domain dimensions in the staging handler.It also allows us to do more work in the staging handler behind the scenes to support the backtracking and discussing which we will introduce next.

Resolve undecidable branches through backtracking and discussing
As we have shown in Section 4.2, we want to enable taking branches with regard to row indices and manipulating stage-1 lists accordingly.Besides, in matrix multiplication, we need similar functionality to merge non-zero sequences according to the comparison results of column indices; more user programs also need such branches.They together raise a vital requirement on branching in stage 1 according to expressions related to the domain loop indices.Formally speaking, the   ite  node shown in Section 4.4 accepts a runtime boolean expression   but coordinates with compile-time then/else bodies as Python functions, which is the time-reversing.However, notice that possible combinations of these branches are available at stage 1, we can split the original loop domain into pieces, making control flows deterministic in a single piece.
To achieve this, we backtrack on every undecidable branch and discuss them to be true or false.As is shown in Fig. 9, we backtrack through raising an exception when an undeterministic predicate is met in the ite  if-else-then branch.Such branches must occur inside some domain loop, the stage 1 handler of which will catch the exception and retry staging the loop body twice, each with the predicate assumed to be true or false.The time-reversing branches thus become regular runtime branches.We implement the boolean expressions   as a sub-class of StagedPredicate and implement the checking and raising logic in its handler to achieve this.
It can be seen that as more branches are met, the number of cases easily grows exponentially with the number of branches.We observe that the predicates in Mat2Stencil codes for structured solving turn out to be primarily duplicated; it is reasonable since the domain dimension is limited to no more than 3 in typical usages, and the branches are mostly related to boundaries or grid coarsening so the number of independent predicates on each dimension will be limited to only a handful.As such, we further utilize the Z3 [de Moura and Bjørner 2008] SMT solver to deduplicate the cases.Every time a predicate is taken to generate the loop body for its case, we append the assumption into the Z3 solver.Once a new predicate is met, we first check its truth value through Z3; if deterministic, we will not perform backtracking and instead directly take the branch.Applying such a scheme, we are able to generate clean case-discussing code for matrix-free algorithms.
Summarizing above, the backtrack and discuss algorithm is shown in Algorithm 1.

BACKEND OPTIMIZATION
In this section, we introduce our backend techniques to parallelize and optimize the generated matrix-free code.The most significant part is SewTh, a novel scheduling algorithm to efficiently parallelize arbitrary Seidel-style stencil operators.Unlike typical polyhedral schedulers, which linearly transform nested loops, optionally tile them, and selectively parallelize, SewTh additionally schedules fine-grained point-to-point synchronizations at certain iterations instead of performing global barrier synchronization.We also implement elementwise function inlining and loop fusion with linear transformation based on the Pluto+ algorithm [Bondhugula et al. 2016].
Algorithm 1 Pseudo-code for the backtrack and discuss algorithm.raise Undeterminant(cond)

Sewing the Threads: A New Approach to Parallelize Seidel-style Stencils
As is mentioned in Section 2.3, automatic parallelization of Seidel-style Stencils has not been explored extensively by existing research.To achieve both good memory locality and low overhead starting, our core idea is to assign part of the loop to each thread and perform fine-grained synchronization to satisfy the dependence across threads.Unlike the existing polyhedral code generation approaches synchronizing with global barriers, we generate point-to-point synchronizations based on single-producer, single-consumer (SPSC) lock-free queues.They are thus on-demand and introduce significantly lower overhead.With the large blocks assigned to each thread, it is rare to perform the synchronization.Such fine-grained synchronization sews the threads together, guaranteeing the correctness of produced code.We cover the details of the work decomposition and what to synchronize, how we compute necessary synchronization from analyzed dependence and generate code for them, and how we double x [14][14], r[12][12]; for (int iy = 0; iy < 12; iy++) for (int ix = 0; ix < 12; ix++) { r[12][12]; #pragma omp parallel for for (int thr_ix = 0; thr_ix < 4; thr_ix++) for (int iy = 0; iy < 12; iy++) for (int ix = 3 * thr_ix; ix < 3 * thr_ix automatically prove the parallelism of our plan.We will use the 2D 9-point Gauss-Seidel presented in Fig. 10a as the example throughout this subsection.And we will continue to abbreviate [ 1 ,  2 , . . .,   ] as bold u. 6.1.1Enforced partitioning to threads.To exploit parallelism from the multidimensional loops with spatial dependence, we first assign certain partitioning to the loop space.We partition as few dimensions as possible in all inner loops without any skewing, while leaving the outermost loop and possibly some inner loops not partitioned.Assume the dimension sizes of the outer domain loops are described by N. Organizing the threads into a mesh, their partition sizes are N  , of length 1 less than N, selected so that components of N  are close to each other.Then we enforce the function mapping an iteration x to a thread to be T(x) =  1 ÷   0 , . . .,  +1 ÷    , . . ., in which x ∈  = {x | ∀, 0 ≤   <   } (the looping range.)Within each thread, it still takes the lexicographical execution order, so that the intra-thread dependence naturally holds.An example is shown in Fig. 10c.
It is then only the dependence crossing thread boundaries, or crossing dependence, remained to be considered.The crossing dependence can be computed from the full dependence set, which is extracted from the program with the help of polyhedral dependence analysis and Presburger arithmetics.To start with, the full dependence set computed by FreeTensor dependence analysis is noted as  = {(a ∈ , b ∈ ) | b depends on a}.For all (a, b) ∈ , b must be computed later than a in any valid schedule.The formal definition of crossing dependence is shown in Definition 6.1, which is also the Presburger arithmetics to automatically compute it.Definition 6.1.The crossing dependence, in the form of a Presburger map from thread pair to dependencies, is computed as below: Synchronizing the crossing dependence guarantees two depending iterations belonging to different threads to be executed in order.It is obvious that such synchronization will not cause deadlock: both the intra-thread execution order and inter-thread synchronization are following the lexicographical order of the multi-dimensional loop indices (also the execution order for sequential input program), making the combined synchronization graph acyclic.We propose to compact the crossing dependence as follows.A crossing dependence is considered to be redundant if there is another crossing dependence between the same thread pair with a later or same source and an earlier or same target.Non-redundant crossing dependence is retained during compaction, i.e. compacted crossing dependence.A visual illustration is shown in Fig. 11: since the sequential order guarantees a ′ happens later than a, b later than b ′ , and another dependence a ′ → b ′ exists, a → b is marked redundant and not synchronized explicitly.The formal definition (which is also the algorithm) is listed in Definition 6.2.Definition 6.2.The redundant crossing dependence is computed as: in which the comparisons are carried out in dictionary order.Then the compacted crossing dependencies are: Then we conclude the following properties we want: Theorem 6.3 (Completeness).With the sequential execution order of each thread and the compacted crossing dependence satisfied by point-to-point synchronization, all crossing dependence are guaranteed to hold.Formally speaking, we have Proof.We will prove this theorem by contradiction.Suppose for some (a We can then repeat the process and get a sequence of different dependence (a 2 , b 2 ), (a 3 , b 3 ), . . ., until reaching some (a  , b  ) ∈  + × (T  , T  ).The dependence set is finite, so we can always drain the  − × (T  , T  ) and arrive at  + × (T  , T  ), resulting in the contradiction above.□ Theorem 6.4 (Can Blindly Synchronize).For every thread pair, the sources and targets of the compacted crossing dependence are in the same order, so blindly synchronizing against the thread will be equivalent to synchronizing against the accurate iteration.Formally speaking, we have Proof.We will prove this theorem by contradiction.Suppose there exists Since  + × (T  , T  ) ∩  − × (T  , T  ) = ∅, this contradicts the assumption.□ Then the compacted crossing dependence set between each thread pair can be computed as a quasi-affine map with the help of ILP libraries, ISL [Verdoolaege 2010] in our case, allowing us to generate code against it.A "wait" block is injected before the loop body so that it waits for each source thread that the current loop iteration depends on, and similarly, a "notify" block but inversed after the loop body.According to Theorem 6.3, the compaction ensures that only synchronizing the compacted dependence will be sufficient.Also, according to Theorem 6.4 the synchronization can ignore which loop iteration the other end is and blindly synchronize with the specific thread.We are thus possible to synchronize simply through efficient atomic counters, one for each thread pair.A code generation example is shown in Fig. 10b.6.1.3Modeling the parallelism.The above work decomposition and synchronization together do not necessarily imply sufficient parallelism.It is possible that most threads are busy waiting and only some of the threads are working all the time under inappropriate parameters.We here propose a model for the parallelism lower bound of the generated code.
Existing methods have been utilizing hyperplanes to determine the time axis that enables parallelism.Recall the parallelizable bands in Fig. 3, these are an example of a valid setting of parallelizable hyperplanes: since the nested loop has 2 dimensions, the hyperplanes become 1D lines.Inspired by those methods, we design a set of thread partitioning hyperplanes corresponding to the enforced partitioning proposed above.Say we have the loop dimensions x = [ 0 ,  1 , ...,   ], 0 ≤   <   , and their inner partition sizes   1 ,   2 , ...,    .Then we define the thread partitioning hyperplanes as Such a partitioning hyperplane ensures contiguous execution on each thread, following the threadlocal lexicographic execution order as specified in Section 6.1.1.A visual example is shown in Fig. 12.Each cube cell represents a loop iteration and chessboard-colored ranges are thread partitions.The yellow cells are the loop iterations executed in parallel if the plane satisfies all loop dependence.If such a hyperplane setting satisfies all dependence, then all loop iterations in the same hyperplane and assigned to different threads will be parallelizable.Formally speaking, we check whether for any dependence pair (a, b) ∈ ,   (a) <   (b), as a sufficient condition for the parallelism.If the check passed, the hyperplanes will reveal a possible parallel execution plan, and serve as a lower bound for the parallelism of our generated parallel implementation, which is already fully parallelized except at the short pipeline start and exit.
Take the two dimensional examples in Fig. 13, in which   1 = 3, so that   ( [, ]) = 3 + .If the thread partitioning hyperplanes successfully satisfy all dependence as is in Fig. 13a, our plan is guaranteed to provide near-optimal parallelism (fully parallel except at the beginning and the end).Instead in Fig. 13b, the thread partitioning hyperplanes do not satisfy all dependence: for the brown dependence,   ( [ + 1,  − 3]) =   ( [, ]); in such circumstances, our plan will yield degraded parallelism.
There is a great chance that the loop dependence in a seidel-style stencil operator satisfies this condition.Given a large grid and usually tens of threads, the thread partitions will be much larger than the stencil pattern, the size of which is determined by the order of partial differential equation and usually a small constant.The hyperplanes can then cover all iterations in the near neighborhood and earlier than the current iteration, satisfying all possible dependence.For example, in Fig. 13b, the hyperplanes fail only after the long dependence is added.The dependence is too long such that it spreads over an entire thread partition.With larger loop spaces and stencils still in such sizes, it will be impossible.Actually, in all evaluation cases requiring Seidel-style stencils in Section 7 including micro-kernels and end-to-end solvers, the thread partitioning hyperplanes satisfy all dependence in every kernel.

Memory optimizations
In addition to basically parallelizing everything, including Jacobi and Seidel-style stencils, we introduce memory optimizations as below.While simple to work with, these optimizations are critical to many real-world applications, since the stencils involved in PDE solving are usually memory-bound, making reducing memory traffic important.
6.2.1 Aggressive inlining of pointwise operations.In Section 4.1 we have mentioned elementwise functions, which allow the user to easily build pointwise operations.Using such components will create intermediate variables that can be easily inlined and eliminated.We perform such inlining whenever possible so that the memory traffic is minimized.The inlining is automatically accomplished by recognizing all elementwise domain loops and calling the FreeTensor schedule inline.
6.2.2 Loop fusion with affine transformation.Loop fusion plays a critical role in memory intensive compute kernels, which is usually the case for the stencils in PDE solving.While previous DSLs can perform loop fusion only based on operator-level dependence information, our added support for customizable spatial loop dependence prohibits such simple approaches.Besides, the modern multigrid methods involve finer and coarser grids, and the loops on them can only be fused with strides.As such, we implement automatic loop fusion also based on polyhedral dependence analysis, which iteratively selects one axis from each of the two adjacent loop nests, performs affine transformation computed similar to the PLUTO+ [Bondhugula et al. 2016] algorithm, and fuses the two loops; the procedure is repeated until no level of loops can be fused in the loop nests.At the program level, the loop nests are iteratively fused, scanned in the forward order, and only fuse loops that do not sacrifice parallelism after fusion.The implementation is accomplished by ourselves instead of invoking PLUTO+ directly.We do not perform tiling additionally since SewTh already exposes a tiled execution order.Other than that, we also automatically vectorize sequential loops with best efforts through the corresponding FreeTensor schedule vectorize; FreeTensor determines if a loop is vectorizable also through polyhedral analysis.

EVALUATION
In this section, we evaluate our domain-specific language design and performance through endto-end benchmarking and backend optimization assessment.We set up our experiments on a dual-socket server, each socket being one Intel Xeon Gold 6126 CPU, with 12 physical cores or 24 logical cores per CPU, counting 24 physical or 48 logical together.The server has 565 GB of DDR4-2666 main memory.The server runs Ubuntu 18.04.4LTS with Linux kernel 4.15.0.We use GCC 12.2.0 to compile the programs, using g++ for C++ and gfortran for FORTRAN, including our generated codes and baseline manual implementations.All parallelism is through OpenMP, including reference codes and our implementation.

End-to-end Benchmarking
We evaluate Mat2Stencil against official manual implementations of all the three pseudo-applications in the NAS Parallel Benchmark (NPB) [Bailey et al. 1991], BT, SP, and LU, which are three solvers of a Computational Fluid Dynamics (CFD) problem, and a matrix-free variant of High Performance Conjugate Gradient (HPCG) [Dongarra et al. 2016], which is a widely adopted performance benchmark solving Poisson equation with a multigrid solver.
We use NPB 3.4.2as the baseline for the NPB evaluation and a hand-written matrix-free HPCG from [Zhu et al. 2021] for the HPCG evaluation.Existing stencil DSLs are not capable of expressing these solvers due to the presence of Seidel-style stencils in them, thus we do not include such baselines.We neither compare with the reference HPCG implementation, given it is matrix-based according to the specification and such comparison will be unfair2 .For NPB, we run the Class B, C, D, and E problems (grid size 1023 , 162 3 , 408 3 , and 1020 3 ) on all three pseudo-applications, the sizes of which are specified in the NPB Specification.For HPCG, we run on grids of size 768 3 , 1024 3 , 1280 3 , and 1536 3 .We show the computed grid points per second, which is the grid size divided by time consumption.The evaluation result is shown in Fig. 14.In the three pseudo-applications of NPB, we achieve an average (geometric mean) of 1.25×, 0.94×, and 0.93× performance, compared to the manual matrix-free implementation, respectively.In HPCG, we achieve an average (geometric mean) of 1.04× performance.The performance is competitive at all scales.We also observe that larger grids yield relatively better performance in our implementation, possibly due to that the Python-native interface overhead is less significant at a larger scale.
7.1.1Comparing the Code Length.We evaluate the human effort required to program matrixfree solvers manually or using our Mat2Stencil by comparing the lines of code for different implementations in Table 3. Mat2Stencil implementation takes 1044 lines of Python code for the majority of work and 329 lines of C++ code as a FreeTensor plugin implementing some auxiliary passes.Based on that, we implement all three NPB pseudo-applications in a total of 750 lines of code, with 485 lines among them to be shared across applications.It takes only 6.3% of the total code length compared to the total of three NAS-provided manually matrix-free implementations.Notably, even considering the lines of code for our compiler, we still involve significantly less code.In addition, the matrix-free HPCG implementation through Mat2Stencil takes 16.4% of the code length in hand-written matrix-free implementation.Mat2Stencil significantly releases the burden of programming efficient matrix-free solvers.

Evaluating Compile time.
To demonstrate the compilation cost of Mat2Stencil, we compare the end-to-end cost for all four evaluated workloads on their largest problems.The time breakdown is shown in Table 4.We observe that the stage-1 time, including the backtracking and discussing, takes a minor part of the compilation time.The majority of compile time is in the backend optimizations, which is reasonable since polyhedral schedulers are known to be costly.However, given larger and more long-running problems than the benchmarks, the compilation will take a smaller   [Bondhugula et al. 2008] on microbenchmarks to demonstrate the extra effectiveness of our optimization pipeline in such circumstances.The sequential C++ codes fed into PPCG and PLUTO are generated by the Mat2Stencil frontend, without going through our optimizations.PPCG is invoked with outer-coincidence enforced3 , which is vital for it to parallelize the test cases with loop-carried dependence.We enable diamond tiling in PLUTO to get the best performance on it.We use PLUTO but not PLUTO+ because PLUTO is better maintained and PLUTO+, as an improving fork of PLUTO, focuses on enabling negative coefficients in the linear transformations, which is not useful in our microbenchmarks and will not bring performance improvement.The microbenchmarks first include symmetric Gauss-Seidel (SymGS) and Incomplete LU (ILU) preconditioners with a 3D 27-point stencil on a 1536 3 grid.We also evaluate multiple Jacobi iterations (JacIter) with a 2D 9-point stencil on a 4096 2 grid, which only involves temporal dependence and is well-optimized by previous stencil optimizations.
The result is shown in Fig. 15.PPCG consistently performs worse than Mat2Stencil, even on Jacobi iterations; it is reasonable since PPCG only performs linear transformations.PLUTO times out during transpiling the ILU code, taking more than one hour.Other than that, Mat2Stencil outperforms PLUTO by 3.14× on SymGS, demonstrating the effectiveness of our SewTh parallelizer.Yet, Mat2Stencil achieves only 41.4% performance of PLUTO on JacIter, showing that the diamond tiling gives PLUTO outstanding performance on JacIter, but is not available for the Seidel-style stencils.However, such temporal tiling methods will not help in complicated implicit solvers, since too many loops with their own spatial dependence are attending a single timestep.It can be seen that both inlining and loop fusion helps a lot in NPB pseudo-applications.The nonlinearity in their solved Navier-Stokes equation results in a number of elementwise functions and Jacobi-style stencil operators, which will introduce noticeable memory traffic and footprint if not inlined.Besides, for BT and SP, we observe that the direct solving happens in the three distinct axes, each sweeping forward and backward.Fusing the forward and backward sweeping loops greatly reduces the size of intermediate tensors by removing two unrelated dimensions, thus significantly improving the performance; the same optimization is manually implemented in the baseline.Such loop fusion, requiring permuting and reverting the nested loops, is only possible to be automized with a smart enough loop fuser that can select axes and perform affine transforms automatically.
Instead, SewTh-parallelized HPCG already yields good performance matching the manual implementation, and further optimizations help little.The reason is that HPCG solves a simple Poisson equation through the complicated multigrid conjugate gradient with the SymGS preconditioner and thus ❶ requires no elementwise functions, and ❷ incurs most of the time cost in SymGS.

DATA-AVAILABILITY STATEMENT
The implementation of Mat2Stencil and baselines corresponded to the evaluation results are available at [Cao 2023] together with the running scripts.No external dataset is used in the evaluation.

RELATED WORK
Solving Differential Equations through Matrix-based Approaches.Numerically solving differential equations has been widely treated as sparse linear algebra computation.Sparse matrix-vector multiplication (SpMV) [Merrill and Garland 2016;Pinar and Heath 1999;Williams et al. 2010] are well-studied, covering explicit solving and Jacobi iterations.Other routines are also available in sparse linear algebra, e.g.SuperLU contains a general-purpose efficient ILU implementation [Li and Shao 2011].SciPy [Virtanen et al. 2020] is a commonly used Python library for scientific computing and covers many such routines.
While those approaches decouple the equation and solvers, they are too general to be highly optimized against structured grids.Borrowing their idea of representing differential equation solving in sparse linear algebra, we propose our Mat2Stencil based on matrix abstraction, enabling easy algorithm programming while persisting the performance benefits of matrix-free codes.It is also worth mentioning that [Augustine et al. 2019] takes a totally different approach to optimize a solver already using sparse linear routines, by automatically analyzing the pattern of the sparse matrix and reconstructing matrix-free codes with the best effort.
Solving Differential Equations with Matrix-free Approaches.The best-studied area of matrix-free approaches is Jacobi-style stencil computations, including with or without time iterations [Habich et al. 2009;Kamil et al. 2006;Nguyen et al. 2010;Williams et al. 2007].Early automatic optimization techniques include [Datta et al. 2008;Krishnamoorthy et al. 2007], which work on manually written nested loops.In order to provide easier interfaces for programming stencils, researchers have come up with many stencil libraries and DSLs, including [Huang et al. 2019;Lengauer et al. 2020;Louboutin et al. 2019;Pieper et al. 2021;Tang et al. 2011;Zhang et al. 2017].Among those, Devito [Louboutin et al. 2019] and ExaStencils [Lengauer et al. 2020] present a set of program representations that is pretty much close to the math formula of the differential equation, being the easiest ones in programming.
However, none of these works fully enables programming implicit solvers, including direct solving of multi-diagonal matrices, (Symmetric) Gauss-Seidel iteration, (Symmetric) Successive Under/Over-Relaxation, Incomplete Lower-Upper [Saad 2003], Fine-Grained Parallel Incomplete Lower-Upper [Chow and Patel 2015], etc.The great diversity of such algorithms makes it hard for DSLs to cover all of them.Matrix-free approaches on implicit solvers have been requiring manual optimizations, for example, in [Zhu et al. 2021] and [Yang et al. 2016].[Essadki et al. 2023] introduces automatic code generation for "in-place stencils", which is noted as Seidel-style stencils in this paper and helps the construction of implicit solvers, but not includes advanced high-level abstractions required for easier programming of implicit solvers.Their methodology will also partially degenerate to existing polyhedral approaches for some stencil shapes, e.g. 9 point on 2-D or 27 point on 3-D, in which case our SewTh will perform significantly better.We cannot compare with their work in evaluation due to absence of their implementation.
Multi-stage Programming.Our embedded DSL implementation is inspired by Lightweight Modular Staging (LMS) [Rompf and Odersky 2012], which represents multi-staged programming through types in Scala.We adopt their type-based design to achieve user-transparent staging.There has also been a Python implementation for it named Snek [Decker 2019] that is similar to our staging infrastructure in design, but it does not perfectly feed our needs, thus we implement our own with small tweaks to make the control flow virtualization customizable for different staged types.Other approaches, e.g. the quote-splice in Scala 3.0 [Stucki et al. 2018] and lift-annotated MetaML [Taha 1999], will not fit us well since we do not expect DSL users to get in touch with the staging details.
Tensor Programs.Our backend generates generalized stencils containing quasi-affine loops that manipulate multi-dimensional arrays, or tensors.Our optimizations originate from various previous works on these types of programs.Many studies have been taken on optimizing tensor programs since the invention of FORTRAN.Compiling optimization techniques, especially polyhedral compilation, targeting general tensor programs, can be applied automatically [Allen and Kennedy 2001;Bondhugula et al. 2016Bondhugula et al. , 2017Bondhugula et al. , 2008;;Feautrier 1992;Padua and Wolfe 1986;Verdoolaege and Janssens 2017], but the resulting performance is far from optimal for the important Seidel-style stencils.Nevertheless, compilers implementing such optimizations can be used as a backend to optimize further our solver programs generated by Mat2Stencil and we have evaluated against some of them.Among recent works, FreeTensor [Tang et al. 2022] is a language and compiler to optimize such a tensor program.FreeTensor provides a flexible Python interface to construct the tensor program, which is ideal as the target language of our multi-stage programming.It further provides a set of code-transforming schedules, as performance-tuning primitives.We use FreeTensor as our backend and our proposed optimizations are implemented as a FreeTensor plugin.

DISCUSSION
While Mat2Stencil has proposed a set of optimizations for the Seidel-style stencils involved in implicit solvers, the optimizations for conventional Jacobi-style stencils are left behind.It thus leaves huge potential on such optimizations, such as temporal blocking (when possible, it is sometimes forbidden by backward iterations), cache-aware tiling, etc. Applying such techniques to Seidel-style stencils is a challenging problem left to be solved.
Other than multithreading on CPUs, the latest advancement in accelerators, especially GPGPUs (general-purpose graphics processing units), has gained more and more attention.The SewTh does not fit into their programming model, which does not allow arbitrary inter-block synchronization.Algorithmic adjustments are essential as a result, involving techniques like multi-color reordering; incorporating such changes, new compilation techniques need to be developed to automatically parallelize Seidel-style stencils to make them efficiently run on GPGPUs.Automatic distributed code generation is also possible through polyhedral methods analyzing loop-carried dependences, which has been explored previously [Baghdadi et al. 2019].
Though our current abstraction focuses on Cartesian grids, other types, e.g., star-shaped or hexagonal ones, are possible to be supported with reconsidered predefined matrices.Speaking even further, semi-structured grids such as block-structured grids might also benefit from our method by combining the unstructured matrix-based representation at the coarse level and our sparse matrix representation for the finer grids, requiring more programming optimization designs.Besides, our cases only cover the Finite Difference method solvers and new matrices would be required to address Finite Volume and Finite Elements methods.

CONCLUSION
We propose an innovative domain-specific language (DSL), Mat2Stencil, for solving partial differential equations (PDEs) on structured grids.Mat2Stencil introduces a structured sparse matrix abstraction, facilitating modular, flexible, and easy-to-use expression of solvers across a broad spectrum of PDEs, including explicit and implicit solving algorithms.We evaluate our DSL by implementing four benchmarking programs from the NAS Parallel Benchmarks and High Performance Conjugate Gradients, achieving up to 1.67× and on average 1.03× performance compared to manually matrix-free codes, with significantly less code.
The dependence graph in an implicit solver, demonstrating a symmetric Gauss-Seidel iteration.Critical path marked red.Fig.1.The different computation patterns of explicit and implicit solving algorithms, with a 1D heat equation (Equation (1)) as an example.

Fig. 2 .
Fig. 2. A simple example for type-based multi-stage programming.
typical skewing parallel implementation.Colored bands are executed in parallel, constructing a skewed "time" axis  .Poor locality.X Y (b) A typical skewed tiling parallel implementation.Tiles with the same color are executed in parallel, while each tile is executed sequentially by one thread.High startup overhead.Fig. 3. Automatically-produced polyhedral schedules for 2D 9-point Gauss-Seidel.

Fig. 9 .
Fig. 9.The procedure of backtracking and discussing for resolving undecidable branches.
Fig.10.An example of SewTh that parallelizes a 2D 9-point Gauss-Seidel operator.Equivalent C++ code is presented instead of FreeTensor IR for better readability.
Good parallelism passing the thread partitioning hyperplanes check.
Degraded parallelism detected by the thread partitioning hyperplanes.

Fig. 13 .
Fig. 13.Two examples for checking parallelism with the thread partitioning hyperplanes.

Table 1 .
Examples of sparse matrices in Mat2Stencil.

Table 2 .
Examples of code transformation for control flow virtualization.Standalone temporary function definitions are used in real implementation instead of lambda expressions to enable multi-line bodies.

Table 3 .
Lines of code in evaluated cases under different implementations.Longer than others due to its unique extra math formulas, mentioned in Section 4.1.fraction of the total time even if the application runs only once, making Mat2Stencil practical for real applications. ‡

Table 4 .
Comparing the compile and running time (in seconds) of Mat2Stencil-based programs.The running times listed are at the maximum tested scale.
Ablation study on memory optimizations.To analyze the performance gains of the memory optimizations, we conduct ablation tests on previous benchmarking cases: CLASS D for NPB pseudoapplications, and grid size 1024 3 for HPCG.We start from SewTh-only and incrementally introduce inlining and loop fusion.Together with the manually implemented baseline, the performance numbers are shown in Fig.16.
Fig. 16.Results of ablation study on memory optimizations.ST is SewTh-only, +inl.adds inlining for elementwise ops, +fuse then adds loop fusion.Man. is manually optimized implementation.Lower is better.