Stencil Computation with Vector Outer Product

Matrix computation units have been equipped in current architectures to accelerate AI and high performance computing applications. The matrix multiplication and vector outer product are two basic instruction types. The latter one is lighter since the inputs are vectors. Thus it provides more opportunities to develop flexible algorithms for problems other than dense linear algebra computing and more possibilities to optimize the implementation. Stencil computations represent a common class of nested loops in scientific and engineering applications. This paper proposes a novel stencil algorithm using vector outer products. Unlike previous work, the new algorithm arises from the stencil definition in the scatter mode and is initially expressed with formulas of vector outer products. The implementation incorporates a set of optimizations to improve the memory reference pattern, execution pipeline and data reuse by considering various algorithmic options and the data sharing between input vectors. Evaluation on a simulator shows that our design achieves a substantial speedup compared with vectorized stencil algorithms.


INTRODUCTION
Matrix registers and associated operations have recently been equipped in processors as powerful units for accelerating modern AI and high-performance computing applications.As Matrix Multiplication (GEMM) is a fundamental computation pattern in these applications, many commercial hardware accelerators, such as NVIDIA Tensor Cores [44], Google TPUs [17] and Intel AMXs [26], have been designed to incorporate matrix-matrix multiplication instructions to offer  ( 3 ) arithmetic operations with  ( 2 ) operands.The arithmetic density, i.e, the number of floating point operations performed per byte is  (), one order of magnitude greater than other traditional instructions.
It is well-known that GEMM can be implemented with inner products or outer products.Inner product instructions exist in some architectures, e.g., ARMv8.4.However, it only completes  () arithmetic operations.On the contrary, the outer product performs  ( 2 ) calculations in a single instruction.It can also be used as the building block of other computations such as triangular solvers.Several architectures have proposed outer product accelerator units in their processors, such as the Scalable Matrix Extension (SME) in ARM [55] and the Math Matrix Accelerator (MMA) in IBM Power10 [25].
The outer product is fed with two vectors and produces a matrix.Each element in the output matrix is a product of corresponding values in the two input vectors.The outer product is clearly more lightweight compared with the matrix multiplication.It provides more opportunities to develop flexible algorithms for more problems other than dense linear algebra computing, and more possibilities to optimize the instruction pipelines, data organization and data reuses among input vectors.Take ARMv9-A [4] as an example, this set of new instructions includes data movement from SVE to SME, data movement from SME to SVE and memory, vector outer product calculation, etc.
The stencil computation is identified as one of the thirteen Berkeley motifs [5] and represents a very common class of nested loops in scientific and engineering applications, dynamic programming, and image processing algorithms.Stencils can be classified in terms of the dimension of the space grid (1D, 2D...), the shape (box, star...) and the order.
Most of the existing work adopts the conventional gather view of a stencil definition, where each output element is calculated by summing its neighbor input elements.This perspective makes it difficult to utilize outer products for stencil computations.The scatter view of a stencil [66] describes the updating pattern of one single input element.This standpoint has been used for finegrained in-core optimizations, e.g., improving the data reuse [12] and eliminating the register spilling [51] by combining the gather and scatter views for a single dimension and a single element, respectively.
This paper proposes a novel stencil algorithm using vector outer products.Unlike previous work, the new algorithm arises from the stencil definition in the scatter mode.The algorithm for a 2D stencil is initially expressed with formulas of vector outer products by a series of extensions to the scatter mode.Then it is adapted to other stencil types by identifying a key concept underlying the algorithm, and it is also analyzed theoretically concerning this concept.The implementation incorporates a set of optimizations to improve the memory reference pattern, execution pipeline and data reuse by considering various algorithmic options and the data sharing between input vectors.This paper makes the following contributions: • We propose a stencil algorithm using vector outer products.
The theoretical aspects of the algorithm are also analyzed.• We implement the algorithm with a set of optimizations to boost the in-core performance.We also design an automatic code generator.• We demonstrate the effectiveness of the proposed algorithm, showing substantial performance improvement over a range of 2D and 3D stencils.
The remainder of this paper is organized as follows.Section 2 provides some background on matrix accelerator units and stencil computations.Section 3 describes the stencil algorithm that utilizes vector outer product operations, and also offers a detailed theoretical analysis.Section 4 presents a set of optimizations on the memory reference and data reuse.Section 5 evaluates the performance on a simulator using various 2D and 3D stencils.Section 6 reviews related work and Section 7 summarizes the paper.

BACKGROUND 2.1 Vector and Matrix Extension
The data-level parallelism plays a vital role in boosting the performance of a single CPU.Hardware developments have produced a large number of various vector extensions to common CPU architectures, e.g., SSE, AVX, AVX2 and AVX-512 to X86, NEON and SVE to ARM, RVV to RISC-V.
To further meet the increasing demands of high throughput GEMM, many commercial CPU and accelerators have incorporated special extensions to perform matrix computations.Generally there are two architecture types.The first one provides matrix-matrix multiplication units supported for various matrix sizes.Intel's Advanced Matrix Extensions [26], Nvidia's Tensor Cores [44] and Google's Tensor Processing Units [17] are representatives.
The other type incorporates lightweight vector outer product unit.IBM's Math Matrix Accelerator (MMA) [25], Apple's Matrix Coprocessor and ARM's Scalable Matrix Extension (SME) [55] fall into this type.SME offers a maximal vector length of 1024-bit, longer than the 128-bit vector in MMA.Futhermore, each matrix register in MMA occupies the storage of eight vector registers.Thus SME allows more optimization opportunities with more vector and matrix registers.It is desirable to allocate more registers as input vectors while keeping additional storage for large output matrices.Thus, in this work we only focus on the SME of ARM.
ARM Scalable Matrix Extension (SME) with ARMv9 enhances the CPU architecture's support for matrix operations.SME works with the existing Scalable Vector Extensions (SVE) and provides additional matrix tile storage and outer product instructions using SVE registers as input vectors.SME defines architectural state capable of holding two-dimensional matrix tiles, and a streaming SVE mode which supports execution of SVE2 instructions with a vector length that matches the tile width, as well as load, store, and move instructions that transfer a vector to or from a tile row or column.The total matrix register file size is fixed.Since the row number is also fixed and matrix registers requires a square shape, the available number of matrix register depends on the element data-type.The fundamental computation instruction accumulates the outer product of two vectors into a matrix register.

Stencil Computation
A stencil is defined on a structured physical grid, therefore the variables on grid points are generally coded with arrays.Moreover, stencil computation involves a temporal evolution of the variable associated with each grid point.Specifically, it computes the value at time  + 1 using neighbor values at time .Usually, the implementation utilizes two array copies  and .The stencil computation along the time dimension reuses the two data copies as the input and output array alternatively.This work only considers the algorithm inside one-time step from time  to  + 1 and always uses  as the output array at time  + 1 and  the input array at time .
The vectorization of stencil computation is straightforward since the data dependences are all carried by the outmost time loop.Generally, one can rely on compilers to generate codes that utilize hardware SIMD units.The matrixization of stencil computation is also studied recently [34,56].Both are from the conventional gather view of stencil.

DESIGN 3.1 Observations
We present several observations on the current architecture specification, including the matrix storage, matrix accumulation and vector product.They provide insights into the algorithm design in this section and algorithm implementation described in the next section.
Firstly, the matrix register assembling is achieved either by the vector register to matrix register movement or the memory load/store.Both approaches operate at the vector granularity, i.e., they demand  instructions to fully fill all the  2 elements in a matrix register.There are no available intra-or inter-matrix register re-organization instructions.For example, to transpose a matrix register on current architectures, it has to move all its columns to vector registers (or memory) and then move them back to rows of another matrix register.Thus it either consumes more vector registers or incurs more memory references.On the other side, the architectures often provide substantial data re-organization instructions for vectors.Consequently assembling data with vector registers is cheap and flexible.
Secondly, the outer product instruction involves a matrix register as both input and output.Each element in the matrix register is added with the corresponding result from the outer product.Thus the outer product offers a capability of  2 addition operations besides the  2 multiplications.For stencil kernels targeted in this work, the numbers of multiplication and addition operations are approximately equal.Therefore, it is desirable to map all the stencil computations to outer products and fully utilize the 2 2 flops per instruction.
Thirdly, compared with the matrix-matrix multiplication instruction, the outer product operation only provides a theoretical  (1) arithmetic density due to the size of the output matrix.However, if the output matrix serves as an intermediate accumulator matrix that is reused by a set of outer products.The outer product can be still viewed as a high computation intensity,  ( 2 ) operations over input vectors of  () size.
The above three observations motivate us to reduce the movement of output elements and keep them in a matrix register as long as possible.Meanwhile, we resort to employing the rich set of vector re-organization operations to assemble demanded vectors for updating the fixed output matrix.
Finally, an outer product operation creates a 2-dimensional matrix from two one-dimensional vectors.The operation follows a scatter mode where one single element in an input vector is scattered to a column or a row in the output matrix.This motivates the reformulation of the stencil computation in a similar scatter mode.It also indicates that when considering the outer product operation related to physical grid, it is meaningful only if the two input vectors are linearly independent in terms of their directions.
With this view, our algorithm is not applicable to one-dimensional stencils.

Basic Formula
A stencil represents an updating pattern of neighbor points.It is often defined in a ℎ mode, i.e., a formula identifying the calculation of a new point  , using its neighbor points from .Equation (1) shows the 2D9P stencil.The new value at point (, ) on the two-dimensional grid is calculated by multiplying its 9 neighbors with corresponding coefficients and gathering the results with a reduction operation.
Consequently, the 2D9P stencil can be simply identified by its coefficient matrix in gather mode   , as shown in Equation (2).  is a (2 + 1) × (2 + 1) matrix, where  is the stencil order.For the 2D9P stencil, we have  = 1.
In essence, the gather representation resembles an inner product of two vectors.One vector contains all the 9 coefficients and the other one assembles corresponding neighbors for each output grid point.It opposites the principle of outer product of vectors.First, it is impossible to assemble an output matrix of  that gives rise to reasonable vector organizations for coefficients and input values.Second, it cannot perform the addition operation at the input vector side with outer product instructions.Since the outer product is the basic target operation of this work, it is hard to develop an algorithm from this perspective.
Fortunately, a stencil can be described in a  mode.The scatter mode shows how to use just one point in the input array to update its neighbors in the output array.Equation (3) shows the 2D9P stencil in the scatter mode.The input value  , is multiplied with coefficients and the results are accumulated to its corresponding neighbor points in the output array.The symbol ⊙ denotes an element-wise multiplication.
Similarly, the 2D9P stencil can be simply identified by its coefficient matrix in scatter mode   , as shown in Equation (4).The coefficient matrix in scatter mode   can be obtained by reversing the rows and columns of the coefficient matrix in gather mode   , as shown in the following Equation.

𝐶
Here  2 +1 is the reversal matrix, where the 1 elements reside on the anti-diagonal and all other elements are zero.It is a special case of permutation matrix and  3 for the 2D9P stencil is shown as follows.
Since the outer product takes vectors as input, we limit the scope to just one column of the output matrix and coefficient matrix in the scatter mode.The following formula picks the middle column of  and   of Equation (3).
The final formula that completes the computation of a  ×  subblock of  for a 2-dimensional stencil with order  is developed from Equation ( 7) by three extensions: horizontally stretching the output value from a 3 × 1 vector to a 3 ×  matrix, gathering all the contributions to transfer the accumulation arrow to an equal symbol, and vertically stretching the involved 3 × 1 vectors to  × 1 vectors with a parameterized stencil order  .Combining these extensions, we obtain the final formula that can be directly processed with a series of vector outer product operations.
Firstly, enlarging Equation (7) horizontally is identical to extending the vector-scalar product ⊙ to a vector outer product ⊗.It is achieved by simply vectorizing each element of the  vector and  , horizontally to  elements along the  dimension, as shown in Equation (8).We can also derive this formula by grouping  contiguous updates of Equation (7).This formula can be directly processed by a single vector-vector outer product instruction.
Secondly, we seek to extend the accumulation arrow in Equation (7) to an equal symbol.According to the stencil definition in the scatter mode, there are other 4 input elements,   −2, ,   −1, ,  +1, and  +2, that should be scattered to the output vector (  −1, ,  , ,  +1, )  with corresponding coefficient vectors containing  01 ,  11 and  21 .The following formula summarizes all these contributions and we refer to this summation as the coefficient line summation  ( * , 1), where ( * , 1) indicates all the coefficients in the second column of   .The accumulation arrow in Equation ( 7) can be replaced with an equal symbol by summarizing over all coefficient lines, i.e.,  ( * , 0) +  ( * , 1) +  ( * , 2).
Finally, we adapt Equation ( 7) to a vector of size  by filling zeros to the coefficient vector and successive elements below  , to the output vector.The reason is that in the 2D9P stencil, each input element of A is at most scattered to 3 output elements with one coefficient line.In addition, we adapt Equation ( 7) to a 2-dimensional box stencil with parameterized order  .It matches a (2 +1) × (2 +1) coefficient matrix and consequently yields 2 + 1 coefficient lines with 2 + 1 weights in each line.Similar to Equation ( 7),  +, is scattered to  +, with  2 −,1 for 0 ⩽  ⩽ 2 .The combination of the above two adaptations leads to the following formula.
To simplify the description of the final formula, we introduce the outer product coefficient matrix   as shown in Equation (11).  expands the coefficient matrix in the scatter mode   by adding two zero matrices of size ( − 1) × (2 + 1) below and above it.For example, the second column of   is the vector (0, . . ., 0,  2,1 , . . .,  0,1 , 0, . . ., 0)  .According to Equation (9), every coefficient vector that participates in the computation is a subsequence of such a one-column vector.
Put it all together, the final formula is shown in Equation (12).The outer summation with the index  corresponds to 2 +1 columns of   , or equivalently, coefficient lines of   .For each coefficient line, the inner summation with the index  gathers all (2 + ) related vectors form the input matrix  and matrix   .

Adapt to Various Stencils
The essential concept underlying the basic formula shown above is the coefficient line.The rest of this section discusses the adaption of it to various stencils and presents a theoretical analysis.High dimension stencils.For higher-dimensional box stencils, the coefficients in the scatter mode form a tensor and they can be grouped into a set of coefficient lines similarly.Take the 3D27P stencil as an example, where the coefficients are  ,, (, ,  = 0, 1, 2), it consists of 9 coefficient lines  (, * , ).Adapting Equation (12) to the 3D27P stencil is straightforward.The output matrix is  , :( +−1), :( +−1) and the input vector of  is   ′ , ′ , ′ :( ′ +−1) where  −  ⩽  ′ ⩽  +  , 0 ⩽  ′ <  + 2 and 0 ⩽  ′ ⩽ 2 .The detailed formula is similar and is thus omitted.
Star stencils.A star stencil can be regarded as a box stencil by setting corresponding weights that are empty in the star stencil to zero.The following formula presents the coefficient matrix in the scatter mode of the 2D5P star stencil.Therefore it can directly employ Equation ( 12) to perform the stencil computation.
However, the first and third coefficient lines only contain one nonzero and consequently each vector outer production in these coefficient line summations is degraded to a scalar-vector multiplication.Thus it does not exploit the  ( 2 ) arithmetic ability of vector outer productions and incur extra overhead for the coefficient vector organization.
The aforementioned three extensions and final formula target the middle column of the coefficient matrix.However, the concept of coefficient line is not restricted to a specific dimension.It is legal to extract the second row of   in Equation ( 13) and those extensions can also be applied to ( , −1 ,  , ,  ,+1 )= , ⊙ ( 12 ,  11 ,  10 ) similarly.Note that the coefficient line lies on the  dimension, thus the vector extension of  , should be along the  dimension.Equation (14) provides the first extension.The rest deductions are similar and thus the final formula is also omitted.
The final formula for 2D star stencils can be obtained by merging the two accumulations, i.e.,  ( * , 1) +  (1, * ).The final formula can be expressed with a single equation.However, for 3 and higher-dimensional star stencils, accumulations of all coefficient lines cannot be merged into a single formula.A detailed discussion will be provided in Section 4.1.
Other Stencils.Although this work mainly discusses common stencils like the star or box shape, the concept of coefficient line is flexible to handle other irregular shapes.The following coefficient matrix has non-zero weights only on the main diagonal and antidiagonal.Thus it only requires two coefficient line summations,  () and  ().
For the anti-diagonal, the initial equation in the scatter mode is shown in Equation (16).Note that the elements in the output vector, i.e.,   −1,+1 ,  , and  +1, −1 in the output  matrix form the same direction to the anti-diagonal.Similarly, the extensions are also applicable to Equation ( 16) and the final formula is omitted.

Analysis
One outer production performs  ( 2 ) multiply-and-add operations.Thus we expect a 1/ decrease in terms of the instruction number.Since the vectorization unit also provides the fused multiply-andadd instructions, it is reasonable to consider only the multiplication operations.For 2D box stencils, the total number of multiplications is  2 × (2 + 1) 2 , where  2 is the problem size and (2 + 1) 2 is the number of coefficients.With vectorization, each instruction performs  multiplications simultaneously.The vector instruction number is then divided by , i.e.,  2 × (2 + 1) 2 /.According to Equation (12), each subblock  × incur (2 + 1) × (2 + ) outer product operations and there are total  2 / 2 subblocks of the output matrix.Thus the total number of outer product operations is  2 (2 + 1) × (2 / + 1)/ and the average instruction number per output vector decreases from 2 + 1 to 2 / + 1 compared with the vectorization.
This decrease can also be derived from the perspective of the coefficient line.Each line contains 2 + 1 weights.Using the outer product, Equation (12) shows that one line leads to 2 + coefficient vectors for processing  row vectors in the output matrix .Therefore, the instruction related to each weight is reduced to (2 + )/.This interpretation is also applicable to higher-dimensional box stencils.
For -dimensional star stencils, the number of nonzero weights is (2 •  + 1), while the number is (2 + 1)  for box stencils.Each coefficient line for the star stencil still incur 2 + vector productions and there are total  lines.The average number of outer production is  (2 +)/ per output vector, compared with 2 •+1 vectorization instructions.Thus the decrease is from 2 + 1/ to 2 / + 1.This decrease is smaller than box stencils and the reason is that the middle weight, e.g.,  11 in   (Equation ( 13)) for the 2D5P star stencil, only exists in one coefficient line.

Minimal Cover with Axis-parallel Coefficient Lines
The final formula indicates that the optimal execution in terms of the instruction number desires a minimal set of coefficient lines covering all the nonzero weights.Note that the weights correspond to points with integer coordinates in a Euclidean space.Megiddo and Tamir [39] proved that the problem is NP-hard if the coordinates of points are relaxed to rationals.This problem is reducible to the minimal vertex cover in a graph when the covering lines are restricted to axis-parallel ones.We first consider 2-dimensional stencils, where the graph is actually bipartite.In a bipartite graph , the vertex set can be divided into two sets  and  such that  and  are disjoint,  ∩  = ∅, and complete,  ∪ contains every vertex in .An edge in the edge set  of  only connects one vertex from  to another from  .Thus a bipartite graph can be denoted as  ( ,  , ).A vertex cover of a graph is a set of vertices such that every edge has at least one endpoint in the set.There exist polynomial solution, e.g.Hungarian algorithm, based on the König theorem [30] for bipartite graphs.
Converting the minimal covering lines problem of a 2dimensional stencil to the minimum vertex cover of a bipartite graph is achieved by interpreting the coefficient matrix of the stencil as the adjacency matrix of the bipartite graph  ( ,  , ).The vertex set  contains a vertex   if and only if there exists at least one  ∈ {0, . . ., 2 } such that    is nonzero.Similarly, the vertex set  contains a vertex   if and only if there exists at least one  ∈ {0, . . ., 2 } such that    is nonzero.Every nonzero coefficient    is mapped to an edge  , that connects   to   .The minimum vertex cover  of  ( ,  , ) corresponds to a minimal set of line cover  .Each   ∈  implies a horizontal coefficient line in  and   a vertical coefficient line.
We present an example to illustrate the conversion.Equation ( 17) shows the coefficient matrix in the scatter mode of a stencil and the corresponding bipartite graph  is plotted in Figure 1a.The vertex set  of  consists of five nodes:  0 ,  1 ,  2 ,  3 ,  4 , and each vertex corresponds to a row in the stencil coefficient matrix   .Similarly, the vertex set  contains  0 ,  1 ,  2 ,  3 ,  4 , and each one representing a column in   .The edge set  comprises 13 edges connecting a   node with a   node, corresponding to the 13 non-zero coefficients in   .Optimal solutions of the minimal vertex cover problem of this graph contain four vertices: either  2 ,  1 ,  2 ,  3 or  2 ,  1 ,  2 ,  3 .The four shaded vertices in Figure 1a illustrate the first solution.It ensures that the union of edges that are connected to these four vertices forms the edge set .In correspondence with our coefficient line cover problem, the vertex cover solution is the minimum set of coefficient lines that is covering all coefficients in the coefficient matrix, as depicted in Figure 1b where each black dot corresponds to a nonzero coefficient.
For higher dimensional stencils, the conversion is similar to that of 2-dimensinoal stencils and the graph is a -partite -uniform hypergraph.A -partite -uniform hypergraph  = ( , ) consists of a set of vertices  and hyperedges , where  can be partitioned into  parts such that every hyperedge edge in  consists of exactly one vertex from each part.The problem of finding a minimum vertex cover of a -partite -uniform hypergraph is a NP-hard optimization problem for  ⩾ 3.One can employ approximate algorithms [35] to obtain sub-optimal solutions.

IMPLEMENTATION
It is straightforward to code the stencil program with the final formulas developed above.However, the program's performance is often bound by the data accesses.Therefore, although the above analysis shows a decreased number of vector outer product operations, it may require a larger number of memory references of output vectors, or inferior memory access patterns of input vectors.This section discusses several optimizations on memory references.We rearrange the formula to find a balance between the number of outer products and memory reference patterns, and reschedule the operations of the formula to obtain an efficient data reuse pattern with a multi-dimensional register tiling.

Data Access Pattern
Although an optimal coefficient line cover leads to the minimal outer product instruction number, the memory data reference may be inferior.Take the 2D5P star stencil as an example, for calculating  ( * , 1), as shown in Equation ( 8), the elements in the input vector ( , , . . .,  ,+−1 ) are contiguous in memory. 1n the contrary, Equation ( 14) indicates that the input vector in the  (1, * ) calculation contains strided elements.Consequently  (1, * ) requires vector gather loads, which is memory inefficient.
Furthermore, the output subblock of matrix  can be configured to be the same shape for the 2D5P star stencil, i.e.,  × for both coefficient lines.But for higher dimension star stencils,  has to be accessed multiple times with three or more orthogonal coefficient lines.The reason is that the in-register subblock shape must contain the direction of the coefficient line.Take the 3D7P star stencil as an example, for calculating  ( * , 1, 1) the matrix shape must contain the leftmost dimension specified by * in  ( * , 1, 1).Therefore, it is legal to store either  ×1× or  ××1 in the matrix register with the input vector  1×1× or  1××1 , respectively.Then it is easy to  see that we cannot complete the update with one matrix shape for three or more coefficient line summations.
In sum, there is a trade-off between the number of outer product instructions and the memory access pattern.Specifically, more orthogonal coefficient lines results in more inefficient data accesses of input vectors of  and additional references of output subblocks of , which may outweigh the benefits of less vector outer products.Since the star stencils can also be regarded as box stencils as mentioned before, the non-zero weights can also be covered by parallel coefficient line.We provide reasonable coefficient line cover options and the corresponding theoretical analysis for star stencils.
Table 1 lists two options, parallel and orthogonal, for the 2D star stencil with order  .The parallel option resembles the 2D box stencil with order  and includes 2 + 1 coefficient lines  ( * , ) (  = 0, . . ., 2 ). Figure 2(a) illustrates for the 2D5P stencil ( = 1).The middle line  ( * ,  ) contains 2 + 1 weights and consequently generates 2 +  vector outer products.All other coefficient lines have only one nonzero weight and produce  vector outer products.Every input vector of  consists of contiguous elements  1× .In the orthogonal option, there are only two coefficient lines and each one leads to 2 +  vector outer products.Figure 2(b) illustrates the 2D5P stencil.The input vector for  (, * ) has the form  ×1 that contains non-contiguous elements.
For 3D star stencils, the parallel and orthogonal options can be easily obtained by extending those of 2D star stencils.As shown in Table 2, the numbers of coefficient lines are 4 + 1 and 3 for the parallel and orthogonal options, respectively.All input vectors of  in the parallel option are contiguous accesses and the output matrix can be set to the same shape  1×× .In the orthogonal option,  (, , * ) leads to strided accesses to  and  ( * , ,  ) demands  ×1× , which requires an extra matrix reorganization.
The non-contiguous input vector,  ×1 in 2D stencils or  1××1 in 3D stencils, can be loaded by a gather instruction, which is inefficient.Since we always ensure that a subblock of  is updated by both non-contiguous vectors  ×1 or  1××1 and contiguous vectors  1× or  1×1× , it can employ matrix registers to produce these non-contiguous input vectors by a matrix transpose, i.e., vector-to-matrix moves in rows and matrix-to-vector moves in columns.
The orthogonal option for 3D stencils requires two output shapes,  1×× and  ×1× .It has to perform redundant memory references to .Thus the orthogonal option is expected to beat the parallel option only for stencils with large orders.To find a balanced solution, Figure 2(c) illustrates a hybrid option and the last row in Table 2 lists its features.This option completes the update with a single subblock shape  1×× .Its access pattern for the input vector is similar to that of the orthogonal option, and the output subblock to that of the parallel option.The number of outer products is between those of the two options.

Multi-dimensional Register Tiling
Register tiling is a classic optimization technique to reduce the loop overhead and improve in-core data reuses.For the proposed algorithm, the tiling granularity is the subblock of .As the contiguous memory access is efficient than the non-contiguous one, it prefers to tile along the unit-strided dimension.Figure 3(a) illustrates the tiling for 3D stencils with a tiling factor  = 4.The four contiguous blue subblocks along the  dimension are calculated and written to memory together with a set of matrix registers.Due to the low operational intensity, the stencil computation is often regarded as a memory-starving kernel.Since neighbor elements lie along all dimensions, it is desirable to explore more in-core data reuses along other directions.Figure 3(b) shows the tiling along the  dimension with a factor  = 4.An element may be scattered to all its neighbors and only needed to be loaded to the vector register only once.
According to the formula, an input vector is scattered to neighbor rows of the matrix register by a single outer product.Therefore it actually indicates an implicit tiling, i.e., an input data may be reused along the  dimension in 2D stencils or the  dimension in 3D stencils.This dimension is not further tiled, as we do not observe a performance improvement with it for both 2D and 3D stencils.
Generally, loop tiling is only applied to the innermost loop or the unit-strided dimension for in-core data reuses.With its dependence pattern, the stencil computation desires data reuses over all dimensions.For example, the tiling techniques often block the stencil computations along multiple space dimensions at the cache level.However, we are not aware of any multiple-dimensional register tiling methods for stencil computations.The reason is that the register resources are limited compared with the large cache size.Fortunately, with additional matrix registers capable of holding 2-dimensional subblocks, multi-dimensional register tiling is feasible.Figure 3

Outer Product Scheduling
With multi-dimensional register tiling, there exists  × subblocks of  in matrix registers.The naive scheme that updates them one after another is inefficient.Firstly, the computation of a single matrix tile with a continuous outer product operations leads to read-after-write dependence between adjacent instructions.Simply processing multiple matrix registers one by one does not alleviate these conflicts and will only rely on hardware scheduling of a large number of instructions.Secondly, according to Equation ( 12), each matrix uses the same set of coefficient vectors.It is hard to reuse those common vectors without a sophisticated software instruction pipeline.Take the 2D9P box stencil as an example, all the 3 coefficient lines require total 3( + 2) coefficient vectors.It fails to reuse them across subblocks only rely on dynamice scheduling and has to reload them for each subblock.
Furthermore, each outer product actually scatters an input vector to multiple rows of a subblock of , i.e., to all its neighbors along the  dimension for 3D box stencils.Similarly, one input vector may also be reused by adjacent  subblocks of  along the  dimension.The naive scheme also fails to utilize this type of reuse and incurs multiple loads of the same input vector.
To address these problems, we reschedule the outer products of all in-register subblocks.Since a coefficient vector corresponds to input elements from a plane  * ,, * , we resort to load all needed input vectors within a plane and scatter them to all matrix registers.Therefore, the coefficient vector storage can be reused to fill the next ones for updating with  * ,+1, * in the next plane.For the  dimensional reuse of input vectors, it is desirable to continuously update all other  − 1 neighbor  subblocks after obtain an input vector.
In sum, our scheduling groups the vector products according to the input vectors of , i.e., sort according to firstly scatter one input vector along the  dimension, assemble input vector along the  dimension and maximal reuse registers for coefficient vectors across  planes.The arrows in the tiled blue subblocks in Figure 3 illustrate this scheduling.
The final problem is the scheduling along the unit-strided dimension.One well-known problem induced by the vectorization of stencils is the data alignment conflict.It arises from the fact that continuous vectors require that the same value appears at different positions of vectors.In this work, we adopt the data reorganization method.All the needed elements of  are loaded to vector registers and all the input vectors involved in outer products are derived through inter-register assembling.Figure 3(a) presents an example where the two loaded vectors (, , , ) and (, , ,  ) are combined to obtain the needed input vector (, , , ).

Put It All Together
Algorithm 1 shows the pseudo-code for the 3D box stencil with order  .The first nested loop in Line 1 to Line 3 sets the multidimensional register tiling  × matrix registers  to zero.Without loss generality, we assume the updated elements are  0:( −1), 0:(−1),0:( * −1) , i.e., the first group of  ×  subblocks of .
Following the scheduling scheme discussed above, the middle nested loop in Line 4 to Line 15 assembles the coefficient vectors  and input vectors  , and performs the outer products.The outermost loop in Line 4 iterates over all related  indices, i.e., all  planes.Inside each iteration, it firstly assembles (2 + 1) 2 coefficient vector  for the current  in Line 5 to Line 7. We omit the details,   11) and (12).All the following outer products in this iteration only use these  s to product with the input elements in the plane.The inner loop in Line 8 iterates over all indices in the  dimension and each iteration loads all needed elements  ,,− :( •+ −1) to vectors.The next double-nested loop in Line 10 and 11 assembles every involved input vector  by the data reorganization of vectors loaded in Line 12.Each  is scattered to all s in the innermost loop in Line 13 if the  index falls into the range of 0 to  − 1.Finally, the last nested loop in Line 16 to 18 writes all subblocks  to the output .
Algorithms for other stencil types and dimensions are similar.We also implemented a code generator using Python.It accepts the stencil type, the coefficient line option, and register tiling factors, and automatically creates the final codes.Note that the generated code only keeps the loops in Line 4 and Line 8, and fully unrolls all other loops.Therefore the branch in Line 14 is eliminated.

EVALUATION 5.1 Methodology
Our algorithm is denoted as STOP (STencil with vector Outer Product).We conducted the experiments on a proprietary ARMv9-A simulator, whose key parameters are configurable.The number of outer production unit is set to 1.The vector length is set to 512-bit, holding 8 double-precision floating-point numbers.The size of the matrix register is 8 × 8.The register number for vector is 32 and 8 for matrix.Resembling Kunpeng 920 CPU, the memory hierarchy in CPU consists of a 64KB L1 data cache and 64KB instruction cache, and a 512 KB private L2 cache.We tested box and star stencils in 2D and 3D with various orders in double-precision type.The compiler is Bisheng 2.1.0which is based on the clang version 12.0.0.

Results
Figure 4 shows the performance of star stencils with different coefficient line options.The parallel option obtains the best performance for order=1 in all cases.However, the orthogonal and hybrid curves are flatter than the parallel curve as the order increases.This is due to the higher growth rate of the outer product number for the parallel option.According to Table 1 and 2, the increases are  () and  (1) for the parallel and orthogonal options, respectively.Similarly, the parallel option also incurs more data transfers than the orthogonal one as the order increases.
Figure 4a and 4b exhibit the performance of the in-cache problem size of 64 2 and the out-of-cache size of 512 2 for 2D star stencils, respectively.In both cases, the orthogonal option is faster for orders larger than 1.For stencils with the out-of-cache problem size, the orthogonal curve is even flatter than that with in-cache size.The reason is that it has little influences on the memory performance as the order increases.Specifically, with the order increased by 1, each subblock of  requires two more boundary rows.Since the L1 cache can hold 16 rows of 512 double-precision floating-point numbers, it does not raise the memory transfer volume.Thus we expect little performance change.
For 3D star stencils, the orthogonal option in Figure 4d is not that flatter as the 2D case in Figure 4b.The reason is that the L1 cache can hold two 64 2 planes and additional boundary planes while a higher order require more L2 cache or memory accesses.For large problem size, the hybrid option beats the other two for high orders.The reason is similar since the latter leads to additional data reorganizations to subblocks of .This extra overhead is significant for out-of-cache data.
Figure 5 shows the effectiveness of the multi-dimensional register tiling and instruction scheduling optimizations.Although the register tiling seems to have limited effects in all cases, the instruction scheduling is actually built upon it.All star stencils adopt the best coefficient line options indicated in Figure 4.For 2D stencils, the register tiling only applies over the contiguous  dimension.The register tiling factors   are 8 and 4 for parallel options (all stencils with order 1) and orthogonal options (star stencils with high orders), respectively.These options are also listed in brackets in Table 4. -8 means the parallel coefficient line and   = 8. -4 denotes the orthogonal coefficient line and   = 4.All box stencils employ the parallel option and thus are omitted.
For the small problem size fit in the cache, Figure 5a and 5c show substantial improvements for all stencils.Since the instruction scheduling optimizes the data access and data reuse pattern and apparently box stencils generate more input vectors, star stencils achieve fewer speedups compared with box stencils.For the large problem size, Figure 5b and 5d exhibit similar trends.A performance model is desired to determine the optimal option and make a detailed analysis, which is left as future work.
Figure 6 shows the performance of the compiler's automatic vectorization, DLT [22], temporal vectorization (TV) [63] and STOP on 2D and 3D stencils with order 1.It also includes two more problem sizes that fit in cache, 128 2 and 256 2 for 2D, and 16 3 and 32 3 for 3D.Since box stencils incur more data reuse than star stencils, the STOP method achieves more obvious improvements for them on problem sizes fit in cache.Table 4 lists speedups over the auto-vectorization on all stencils.The STOP method achieves the best performance on all 3D stencils.The TV method extremely reduces the memory volume up to a fourth and it obtains better improvement for the outof-cache problem size on some 2D stencils.The DLT speedups are consistent with the values in [22].The TV has limited or negative speedups for 3D stencils, as presented in [63].
Overall, the speedups of all methods increase as the order increases for all box and star stencils in 2D and 3D.For 2D and 3D box stencils, the speedups are in the range of 3 to 5. The 2D star stencils achieve smaller speedups around 1.5×.On the contrary, the 3D star stencils get better improvements due to the -dimensional reuse.As for the optimization, all star stencils adopt the same coefficient line options as those in Figure 4a and 4c for the middle two problem sizes.For 2D stencils, the register tiling factors are also the same values.All 3D stencils adopt the -dimensional register tiling except for the 3D star stencils on the largest problem size, which accepts the hybrid coefficient line option.For the STOP method, all stencils obtain the maximal speedups with the smallest problem size, except for the 3D27P box stencil that gets a speedup of 1.93×.The reason may be the small total space size and for 16 3 it delivers a better improvement of 3.85×.For larger problem sizes beyond the values in Table 4, we can expect similar speedups for the STOP method by utilizing a blocking scheme with similar blocking sizes.
Table 3 shows the cache miss rate of different type of stencils with different problem sizes.The first three rows shows the cases where the data size is larger than the cache size.For 2D box stencils, the cache miss rate decreases as the order increases.This is because with the increased order, memory accesses increase proportionately with a quadratic power growth, and the newly accessed data largely overlaps with the existing data and maybe already reside in the cache.While for star stencils, as the order increases, memory accesses increase with a linear power growth.The last three rows exhibit the results with smaller problem sizes.They show the similar trends and all the miss rates are smaller than those with larger problem sizes.The results are also consistent to the STOP performance.For example, the cache misses for the orthogonal option in Figure 4b are 40%, 41% and 51% while in Figure 4d are 52%, 68%, and 70%.Thus the first curve is flatter.

RELATED WORK
To improve performance, tiling and vectorization are the two most powerful transformation approaches.Tiling [27,32,38,57,58,61] aims to explore the data locality and parallelism of multiple loop nests.There exists an extremely large number of tiling methods for the stencil computation, e.g., overlapped tiling [14,24,31,40,42,[46][47][48], time skewing [2,3,28,49,60], diamond tiling [6,8,20,45], cache-oblivious tiling [15,16,52,54], split tiling [10,19,23,31,59,68] and hybrid tiling [18,37,53].This work is orthogonal to tiling techniques.According to the experiment results, it is desirable to reuse data blocks over several time steps and achieve a higher speedup with the in-cache data.A combination of the two techniques is our future work.Universal vectorization techniques [1,21,29,33,43,43,50,62,66] are extensively studied by the compiler community.Production compilers often simply loads all the needed vectors from memory straightforwardly for stencil computations.It incurs the wellknown data alignment conflict.The milestone approach to address the data alignment conflict for stencil computations is the Dimension-Lifting Transpose method [22].It turns to put the points with read-read dependencies in the same position of different vectors.However, it's hard to implement the transpose in-place and it often chooses to use an additional array to store the transposed data.
Other sophisticated methods, the data reorganization vectorization [9,67] and the temporal vectorization [63], targeting stencil computations are also proposed.The data reorganization vectorization loads each input element to vector register only once and assembles the required vectors via inter-register data permutations instructions.This work utilizes the data reorganization technique [67] to reduce the memory references of the input array.The temporal vectorization [63] vectorizes the stencil computation in the iteration space and assembles points with different time coordinates in one vector.It leads to a small fixed number of vector reorganizations that is irrelevant to the vector length, stencil order, and dimension.And it is also applicable to Gauss-Seidel stencils.
Several work [34,56] has studied the transformation of the stencil computation to matrix computations.TCstencil [34] is the seminal work that adapts the stencil computation to matrix-matrix multiplications.However, their method is developed from the gather view of the stencil definition and only applies to two-dimensional stencils.Since TCstencil targets the Tensor cores on GPU, it is not amenable to the instruction scheduling optimization that reduce both the coefficient and input vector reorganizations.Furthermore, TCstencil does not map all the stencil computation on matrix units as we do and employs the traditional vector arithmetic instructions to perform some calculations.Moreira et al. [41] studied the convolution, a similar computation pattern to the stencil on IBM's Matrix-Multiply Assist with the outer product instructions.However, it is also derived from the gather mode and no experimental result is reported.

CONCLUSION
We have presented a new algorithm for stencil computation with vector outer products.It arises from the scatter view of the stencil definition and centers on a key concept of coefficient line.A set of optimizations is further proposed to improve the memory reference pattern, execution pipeline and data reuse, including optimizing data access pattern, multi-dimension register tiling and outer product instruction scheduling.Evaluation on a simulator shows that it achieves a maximal speedup of 4.71× compared with the vectorized algorithm.
In future work, we will design a comprehensive performance model to provide optimization parameter recommendations tailored to each specific scenario.Furthermore, the current automatic code generation is limited to simple stencil code generation based on input parameters, such as stencil type parameters (dimensions, shape, point count, etc.), and optimization options like tiling factor and dimensions, and CLS selections.We plan to further enhance the automatic code generation, specifically by integrating it with a domain-specific language.

Figure 1 :
Figure 1: Minimal Cover with Axis-parallel Coefficient Line

Figure 2 :
Figure 2: Coefficient line options for star stencils (c) pictures a two-dimensional tiling over the  and  dimensions with factors  =  = 2.

Figure 3 :
Figure 3: Multi-dimensional register tiling and outer product scheduling

Figure 4 :
Figure 4: Performance of star stencils with various coefficient line options

Table 1 :
Features of  options for 2D star stencils

Table 2 :
Features of  options for 3D star stencils

Table 3 :
Cache Performance