Compiling Structured Tensor Algebra

Tensor algebra is essential for data-intensive workloads in various computational domains. Computational scientists face a trade-off between the specialization degree provided by dense tensor algebra and the algorithmic efficiency that leverages the structure provided by sparse tensors. This paper presents StructTensor, a framework that symbolically computes structure at compilation time. This is enabled by Structured Tensor Unified Representation (STUR), an intermediate language that can capture tensor computations as well as their sparsity and redundancy structures. Through a mathematical view of lossless tensor computations, we show that our symbolic structure computation and the related optimizations are sound. Finally, for different tensor computation workloads and structures, we experimentally show how capturing the symbolic structure can result in outperforming state-of-the-art frameworks for both dense and sparse tensor algebra.


INTRODUCTION
Linear and tensor algebra operations are the key drivers of data-intensive computations in many domains, such as physics simulations [Martín-García 2008;Ran et al. 2020], computational chemistry [Hirata 2006;Titov et al. 2013], bioinformatics [Cichocki et al. 2009], and deep learning [Hirata 2003;Smith and Gray 2018].Due to their importance, many specialization attempts have been made throughout the entire system stack from hardware to software layers.The tensor accelerators [Hegde et al. 2019] and TPUs [Jouppi et al. 2017] provide efficient tensor processing at the hardware level.On the software level, there have been advances in providing highly-tuned kernels [Dongarra et al. 1990], as well as compilation frameworks that globally optimize tensor computations [Gareev et al. 2018].
There is a trade-off for using tensor algebra frameworks between the specialization degree and the flexibility of the structure such as sparsity (Figure 1).On the one side of the spectrum, extensive research has been done on dense tensors without leveraging any structure.Such tensors appear in deep neural networks and computational physics.As all the memory access patterns are known at compilation time, one can provide heavily-tuned implementations without any knowledge about the content of tensors.As a result, the high-performance engineer or compiler has enough reasoning power to make decisions on parallelization, vectorization, and tiling.
However, many real-world applications involve tensors that exhibit symmetries and, more generally, specific structures which can be exploited to significantly reduce computational costs.Sparse tensor algebra for instance only captures the pattern of zero/non-zero elements in tensors during runtime, and has been at the center of recent interest [Kjolstad et al. 2017;Strout et al. 2018].However, postponing the memory access patterns to the runtime hinders the specialization power of the compiler [Augustine et al. 2019].As a partial remedy, there have been efforts [Spampinato and Püschel 2016;Tang et al. 2020] to statically determine the structure of matrices during the compilation time.However, these are limited to fixed-size matrices.
To resolve the dilemma between using tensor algebra frameworks focusing on either dense or sparse tensors (Figure 1), this paper introduces StructTensor.StructTensor captures the structure of tensors symbolically at compilation time.On the one hand, the underlying compiler can use this symbolic information to specialize the code at the level of dense computations.On the other hand, the compiler can leverage this symbolic information in order to eliminate unnecessary and redundant computation.
In StructTensor, all tensor computations and structure information are translated to a single intermediate language called Structured Tensor Unified Representation (STUR).STUR propagates the structure knowledge throughout the computation at compile time.This is followed by efficient C++ code generation.
Specifically, we make the following contributions: • We present StructTensor, the first framework that supports structured computation for tensor algebra (Section 3).For a tensor  , the structure handled by StructTensor comes as a pair (  ,  ), where -  tracks the symbolic sparsity structure of the tensor, -  tracks the redundancy structure, which captures symmetries and repetition patterns.• We propose STUR, a unified intermediate representation (IR) that can express tensor computations.It allows capturing non-optimized tensors as well as their symbolic structures   ,  in a single IR (Section 4).• We show how StructTensor uses the structure information in its compilation process (Section 5).It leverages STUR and rewrites tensors in 3 steps (also see Figure 3): (1) StructTensor propagates the structure information on the syntax tree of tensor computation (Section 5.1) (2) Several optimizations are applied to the unified representation for tensor computations and their structure leading to computation over a compressed tensor (Section 5.2) (3) StructTensor generates C++ code for the tensor computation over the compressed form, as well as reconstructing the uncompressed tensor (Section 5.5) • We give a mathematical view on the problem of lossless tensor computations that we study in this paper, and show the soundness of our rewrites and structure inference in Section 6. • We experimentally evaluate StructTensor for different tensor computation workloads and structures (Section 7).We show that StructTensor can leverage the structure of tensors to generate computation over compressed tensors in order to outperform state-of-the-art dense and sparse tensor computation frameworks.

BACKGROUND
Structured Linear Algebra.There are several well-known structures for matrices, such as diagonal, symmetric, and lower/upper triangular, that considering them while doing calculations over matrices can reduce computational time.Figure 2 represents a set of linear algebra operations and well-known matrix structures.
The grammar for structured linear algebra operations and a subset of structure inference rules.
Example -Diagonal Matrices.Imagine the case of the Kronecker product between two diagonal matrices  × and  × .Typically, the Kronecker product of these matrices is defined as: where the result dimension is  × .Therefore the computational cost would be  ( 2  2 ).However, if the computation is structure-aware, one can leverage the fact that all non-zero elements are on the diagonal of matrices.Therefore, it is sufficient to perform the multiplication over only those elements, which results in the following computation: where the result is diagonal as well.As a result, the computational complexity for this Kronecker product reduces to  ().Furthermore, since the structure of the result is known (diagonal), this information can be used in further computations efficiently.Figure 2 shows a subset of inference rules to determine the output structure based on input structures.Sparsity and Redundancy Structures.Structures that distinguish zero and non-zero values are referred to as sparsity structures, such as diagonal and lower/upper triangular.Other structures like symmetric can capture redundancy patterns and are thus referred to as redundancy structures.
Because of the redundancy pattern, it is possible to reconstruct the whole matrix by storing half of the data even though they are non-zero.Knowing and propagating structures in many cases helps reduce computational costs [Spampinato and Püschel 2016].
Intricate Structures.The aforementioned structures cannot cover more complicated patterns.
Imagine the case of creating the covariance matrix for the polynomial regression degree two model. 1igure 4 shows the covariance matrix created for a polynomial degree two model on a data element with 4 features.Each number and color is associated with distinct elements, and elements with the same number have the same value.As it is represented in Figure 4, there are only 65 distinct elements even though the covariance matrix dimension is 20 × 20.The redundancy pattern in this matrix is sophisticated and cannot be captured using the existing structures.Utilizing such structure information enhances the performance of machine learning tasks that require covariance matrix creation such as training polynomial regression models.
Tensor Structures.The complicated structure of the mentioned covariance matrix can be captured in the form of a higher-order tensor structure.This covariance matrix contains all degree-2, degree-3, and degree-4 interactions between features in itself.The reason that redundancy happens is that these interactions are computed multiple times.For example, degree-2 interactions inside the  matrix form a symmetric structure.Degree-3 and degree-4 interactions have some complex forms of redundancy as well as being symmetric.These interactions cover all the elements inside the covariance matrix.By representing degree  interactions as a tensor of order-, the computation can be done more efficiently.This way, the only unique elements in degree two, three, and four tensors  2(, ), 3(, , ), 4(, , , ) This structure in the new data format forms a generalized symmetric pattern that is easy to capture and leads to efficient computation.In the following sections, it is explained how StructTensor handles sophisticated structures and generates efficient C++ code for the computations.

OVERVIEW
In this section, the overall architecture of StructTensor is described (cf. Figure 3).Input.StructTensor gets linear algebra and tensor algebra expressions and the structure of their parameters as input.These operations and structures subsume the ones that are mentioned in Figure 2. All tensor expressions, as well as their structures, are represented in the StructTensor unified representation, called STUR.Sparsity and redundancy structures are captured in a unique set and redundancy map, respectively.The unique set contains unique and non-zero elements, and the redundancy map has the mapping from redundant and non-zero elements outside the unique set to their corresponding value in the unique set.These two subsume SInfo and AInfo used in LGen [Spampinato and Püschel 2016]   Example -Polynomial Regression Degree-2.We consider the creation of a covariance matrix for polynomial regression degree-2.The covariance matrix for an element with  features is defined as: x This represents the outer product of x with itself, and thus can be represented using the vector outer product operation ⊗.The vector x is a vector resulting from the concatenation of the feature vector f with the vector resulting from the interaction of the features.In terms of linear algebra, this is expressed as: Here, ⊕ is the vector concatenation, ⊗ is the vector outer product, and  () flattens the matrix  of size  ×  into a vector of size  • .By inlining the definition of x and by using the identity for the distribution of vector concatenation and vector outer product, we have: The operations || and // correspond to matrix horizontal and vertical concatenation, respectively.StructTensor improves the performance for this computation in two levels of granularity.In a coarse-level of granularity, it detects that (f ⊗  (f ⊗ f)) and ( (f ⊗ f) ⊗ f) are computing the same elements but in a different layout.Thus, it only performs one of the computations and it would be sufficient to compute the following vector outer product terms: In a finer-level of granularity, for each of the terms 1, 2, 3, StructTensor detects a generalized symmetric structure.For 1 it detects a standard symmetric structure where it is sufficient to only keep the upper half of the matrix.For 2 and 3, there are ∼ 5× and ∼ 23× redundant elements, the patterns of which can be seen in Figure 4.After inferring such structures, StructTensor generates the following C++ code for each of these terms: Note that the nested iterations only cover a subset of the range.
In the next sections, we will thoroughly elaborate on how StructTensor works.

STUR: STRUCTURED TENSOR UNIFIED LANGUAGE
In this section, the syntax of the unified intermediate representation, STUR, is represented.We elaborately explain the grammar for the unique set, redundancy map, and compressed tensor computation in this section.Moreover, details of how structured linear algebra is represented in STUR can be found in this section.

The syntax of STUR
Grammar. Figure 5 shows STUR grammar which covers the grammar for tensor, unique set, and redundancy map computations.Each program () is made of several rules ().Each rule is in the form of an assignment from a body () to an access to a collection ().The collection could be a tensor ( ), compressed tensor (  ), unique set (  ), or redundancy map (  ), whereas the access index can be multiple index variables ( ), an index variable, or a constant value.The assignment body is represented as a sum of factor products ( ).Each factor restricts the domain of values that an index variable can have.This is achieved through a collection access or a comparison term.
Example -Simple Tensor Operation.Consider the following STUR program: This program consists of two rules, with two input tensors  1 and  2 .The first rule constructs the tensor  3 of order-2, i.e., a matrix, which is computed by performing an element-wise multiplication of the two input matrices.The second rule constructs an order-1 tensor, i.e., a vector, that contains the elements of the diagonal of the matrix  3 .This is achieved by (1) restricting the range of  and  by the comparison term  = , and (2) existentially quantifying over  by not including it in the head of the rule.Sum-of-Product Semantics.The addition and multiplication in STUR are defined based on the underlying collection; for unique sets and redundancy maps, the addition and multiplication are defined as set union and intersection, whereas for tensors and compressed tensors they are defined as real number addition and multiplication.Each rule can existentially quantify the free variables of the body by not including them in the rule head.This is known as marginalization in the AI community.All the free variables in the head should already be defined in the body (cf. Figure 6) Syntactic Sugar.To simplify the presentation, we consider the following syntactic sugar, where  corresponds to a list of arguments  1 , . . .,   : The sparsity structure of all the distinct elements of a tensor is encoded in a unique set.The grammar of STUR already captures the definition of unique sets (cf. Figure 5); a unique set is provided as a sum of products of comparison terms or access to other unique sets (and redundancy maps).Unique sets enhance performance by restricting index boundaries.Example -Chess Pattern Unique Set.The following STUR specifies the unique set of a matrix  of size  ×  with a chess-board pattern for the sparsity: 2) The first factor product specifies the elements of the even rows (that have an odd column index), and the second one specifies the ones for odd rows (with an even column index).Redundancy Maps.The remaining non-zero elements which do not appear in the unique set domain are covered by the redundancy map.A redundancy map keeps the association between these elements' indices and their corresponding indices in the unique set.Similar to the unique set, it is represented as a sum of products of comparison terms or access to other unique sets and redundancy maps.For a tensor of order-, the redundancy map has 2 index variables.The first  index variables correspond to the indices of the redundant element, whereas the second  one corresponds to the indices from the unique set.The redundancy map enables the capability of reconstructing the full matrix from the compressed version that only contains unique elements.Restricting the computations to the elements of the unique set and reconstructing the uncompressed final result once, when all the calculation is over, can significantly improve the performance.Example -Identical Row Matrix Redundancy Map.Consider a matrix  of size  × , where all rows are the same.The unique set and redundancy maps for this matrix are defined as follows: The first rule specifies that the unique elements are only in the first row ( = 0).The first term of the second rule (0 <  < ) restricts the range of redundant elements to the ones from all the rows except the first one.The last two terms specify the index of the corresponding element from the unique set by specifying the first row ( ′ = 0) and the same column as the redundant element ( ′ = ).Compressed Tensor.StructTensor leverages the structure for better performance by representing the tensor in a lossless compressed format.This is achieved by combining the original tensor where (, ) =  ( 2 ), Fig. 8. Representation of well-known matrix structures in STUR.In all cases, ⟦⟧=  and thus   and   correspond to its unique set and redundancy map, respectively.Furthermore, we assume that (, ) =  () for all cases and   (, ,  ′ ,  ′ ) := ∅ unless stated explicitly.
with the unique set in order to extract only the unique elements.The compressed tensor   for the original tensor  is defined as: () :=  () *   () Naïvely executing the computation using this formula can even make the performance worse.After performing simplifications, the code generator produces code that iterates over the domain provided by the unique set for operations and excludes all the other elements of that tensor.This way, no extra computational cost is imposed while computing the result.When the computation is done, the uncompressed tensor  is retrievable by using   .Example -Upper Triangular Matrix Compressed Tensor.A  ×  upper triangular matrix compressed tensor is calculated as follows: When  (, ) appears in a computation, the optimizer converts it to   (, ) and only uses elements with 0 ≤  ≤  < .Therefore, only half of the elements are used in the computation, which leads to a ∼ 2× speed up.

Translating Structured Linear Algebra to STUR
Operations.The representation of linear algebra operations is shown in Figure 7. Imagine the matrices  1 and  2 with dimensions  1 ×  1 and  2 ×  2 are represented in STUR as ⟦ 1 ⟧ and ⟦ 2 ⟧, respectively.The element in -th row and -th column of  1 •  2 where • is an operator is specifies that element (, ) of the output is taken from ⟦1⟧ when  <  and  <  and from ⟦2⟧ when  ≥  and  ≥ .Structures. Figure 8 shows the representation of well-known matrix structures in STUR.Structures are translated to a combination of a unique set and a redundancy map.For example, if matrix  has a symmetric structure, the upper triangular section of that is counted as the unique set.Therefore, the lower triangular region is considered as the redundant section; the redundancy map keeps the transformations from the lower triangular region to the corresponding upper region.Hence, if the representation of  in STUR is  , the unique set and redundancy map are provided by the last rule of Figure 8.When  < , according to   (, ,  ′ ,  ′ ), the values of  (, ) will be retrieved from values in  ( , ).

COMPILATION
In this section, we show how StructTensor symbolically computes and propagates the structure.Optimizing the tensor computation using the inferred structure leads to structure-aware code generation.

Structure Inference
Inference for unique sets.Figure 9 shows the inference rules for the output structure.These rules follow the assumption that input redundancy maps are empty and only capture sparsity patterns.Since all operations are translated to multiplication, addition, or projection in STUR, providing inference rules only for these operations is sufficient.Moreover, having more than two operands is handled by breaking them into subexpressions with two operands and storing the results in intermediate sets.
In the case of multiplication, the result is non-zero only if both inputs are non-zero values.Having a zero operand in multiplication makes the final result zero.Therefore, the unique set of output is calculated from the intersection of non-zero elements in both operands.Addition on the other hand is non-zero even if one of the operands is non-zero.Hence, any element in the unique set of operands can lead to a non-zero element in output.Consequently, the unique set of output is the union over the unique set of its operands.Since projection is defined by summing up all values, it follows the same rule as addition.Any non-zero element in the unique set of the input can create a non-zero element in the output.So the output unique set for projection is computed by unioning over all unique set values in that dimension.Inference for redundancy maps.The redundancy map of the output is required in order to have access to every element in the output tensor and reconstruct it.Figure 10 shows the rules to infer Special cases: output redundancy map and unique sets.Rules for the tensor outer product, addition, direct sum, and repetition are provided.For the cases where there is no rule in Figure 10, first, the redundancy maps of inputs are set to empty by combining all non-zero elements from the unique set and redundancy map.Then, rules from Figure 9 are applied to the computation to calculate their unique set.
Example -Structure Inference.Assume the following definition: Furthermore, consider the dimensions as ( 1 ,  2 ,  3 ) =  () and ( 4 ,  5 ) =  ( ), and the unique set and redundancy map of inputs are: ) * ( ′ 5 =  4 ) Then output unique set and redundancy map are: The provided set of rules in Figures 9 and 10 are for now relatively minimal and are meant to be extensible.
For example, the rule for self-multiplication can be easily extended to consider higher-degree self-multiplications.The idea is that our rewriting system will always choose the most specialized (and therefore optimizing) rule first.

Optimizations
Rule Inlining.The intermediate tensors, unique sets, and redundancy maps are materialized.Inlining these definitions can improve performance in various ways.First, one can avoid the materialization overhead.Second, the inlining is an enabling transformation that can open opportunities for further optimizations.During the inlining, the tensor index variable needs to be alpha renamed in order to avoid capturing free variables.Logical Simplifications.After inlining, the factors inside unique sets and redundancy maps might be repetitive or result in ∅ or other simple rules.Logical simplification removes the repetitive conditions.Furthermore, there could be contradicting conditions that will result in ∅ (e.g., ( ≤ ) * ( > ) lead to ∅).Multiplying by ∅ results in ∅ (i.e., ∅ *  = ∅), and it is the neutral element for addition (i.e., ∅ +  = ).Example -Structure Optimization.Consider the previous example.After inlining, the unique set and redundancy map are transformed as follows: By further simplifying the redundancy map by propagating the rules for ∅, we have:

Use case 1: structured linear algebra
In this section, we show that StructTensor can recover the output structure of linear algebra operations by using the inference and optimization rules presented earlier.Next, we provide two examples.
Example -Upper-triangular Hadamard product by Symmetric (UHS).Assume two square matrices  and  with dimension  ×  and with upper triangular and symmetric structures, respectively.The element-wise multiplication is represented in STUR as follows: (, ) :=  (, ) *  (, ) The unique sets and redundancy maps of the input matrices are represented as follows:

Code Generation
As the final step, StructTensor generates C++ code for the optimized STUR expression.To generate loop nests, first, the index variables are ordered following the same syntactic order as the input tensor expression2 .Afterward, StructTensor computes the range of each index variable based on the output unique set.The range for each index variable is computed as the maximum of all its lower bounds and the minimum of all its upper bounds.
Knowing the lower and upper bound of variables is enough to generate the loop nests.By taking the same steps over the redundancy map, boundaries can be determined, and the final result can be reconstructed.If there is a chain of computations, first all of them are done in the compressed space, and then the final result is reconstructed based on the output structure.
To generate code for a program that is written in the form of a sum of products, our code generator iterates over each multiplication, and generates loop nests per multiplication.In the deepest loop nest, the value of the result is updated by the value of the tensor expression appearing in the body of the rule.
StructTensor hoists loop-invariant expressions outside the loops whenever possible.However, there is no data-layout optimization applied [Chou et al. 2018;Kandemir et al. 1999;Schleich et al. 2022]; an  dimensional tensor is stored in an  dimensional C++ array.Further optimizations such as multithreading and vectorization are also left for the future.Example -Vector Self-Outer-Product.The outer product of a vector with size  by itself in STUR is defined as follows.
(, ) 10, output structure is: The structure is fed to the code generator, and compressed computational loop nests are generated based on the unique set of .Since all the elements in matrix  should be accessible in the end, reconstruction code is generated based on the redundancy map of the output.Figure 11 shows the computation and reconstruction code snippets for this program.

SOUNDNESS OF STRUCTURE INFERENCE
We present a simple and more mathematical view of what unique sets and redundancy maps compute.We then derive several properties that they satisfy.Furthermore, we state a soundness theorem for our structure inference, formally anchoring the fact that they do not lose information and compute the same as the original tensor.
6.1 Abstract view on the reduction problem Abstractly, we can view the problem as follows.Let T  1 ,...,  (R) be the real vector-space of  1 ×. ..×  real tensors.Given a fixed  ∈ T  1 ,...,  (R), we study the problem of finding a pair of linear maps (  ,   ) : T  1 ,...,  (R) → T  1 ,...,  (R) such that the following simple equation holds We call such a pair (  ,   ) a solution to the reduction problem for  .The intuition is that   represents a projection,   ( ) represents a compressed version of  , and   ensures that this compression is lossless.Note that there are a lot of solutions to this problem, e.g.  could simply be the identity, and we are naturally interested in solutions that approximately minimize the number of non-zeros elements of the compressed tensor   ( ).Finding optimal solutions is also feasible but would make us lose time overall for the final computations we are interested in.Note that, as   ,   are linear, they can be represented by  1 × . . .×   ×  1 × . . .×   tensors.In this work, we will restrict to the cases where   ,   can be represented with tensors valued in {0, 1} and both satisfy that, when seen as matrices, the sum of the elements on each row is at most 1.Such matrices are called partial injective relations.We further ask   to be a projection.In such a case,   is an orthogonal projection and therefore verifies   •   =   (an orthogonal projection is automatically a partial injective relation).Any partial injective relation  satisfies the nice property that  ( ⊙ ) =  () ⊙  () for all , , where ⊙ is the Hadamard product.In fact, this is a characterization of partial injective relations.
We indistinguishably use tensors and the linear maps they represent, assuming fixed bases of the underlying vector spaces at play.From the abstract setting, the tensors   ,  ,  we infer and use in our unified IR are derived as follows: In other words,   is the support of the compressed tensor   =   ( ), and   is, up to a small optimization, the support of the tensor representation of   .From Equation 3 and the definitions of   ,  ,  (Equation 4), we obtain several properties that are key in reconstructing  and for optimisations.
Proposition.The following Properties are valid equations, where  is the list of free variables in each case: (1)   () *   (,  ′ ) = ∅.This optimisation ensures that unique elements should not be mapped to any other element by   .(2)   () =   () *  () is the equation defining the compressed tensor in our IR.
(3)   () *   (,  ′ ) = ∅.Elements that are stored in the compressed format should only exist in the unique set.
(5)  () =   () +   (,  ′ ) *   ( ′ ).This is the reconstruction process of the tensor  given its compressed format   , its unique set   , and its redundancy map   .This captures the key lossless property of the contraction.
Proof sketch.Note the following extra elementary properties that we will use in the proofs. ⊙ ⌊⌋ =  for any  and  () ⊙  =  ⊙  () =  () ⊙  () for any ,  and orthogonal projection  (which is a restatement of a well-known characterization of orthogonal projections in terms of inner products). (2) (3) is easily obtained from (1) and the fact that   commutes with ⌊−⌋. (4)

Soundness results
Given a tensor  , we write I  : (  ,   ) → (  ,  ,  ) for the mapping sending a pair of linear maps verifying Equation 3 to the tensors defined by Equation 4. We say that (  ,  ,  ) implements the solution (  ,   ) to the reduction problem for  .
In this work, we never explicitly construct   and   , but we do inductively define implementations (  ,  ,  ) of solutions to the reduction problem for  , by induction on the structure of  .The first natural question is therefore whether the inferred tuple (  ,  ,  ) is sound, that is whether is obtained as an implementation of a solution for the reduction problem for  .This is a sufficient condition, as by Property 5 above we can reconstruct  from such a tuple (  ,  ,  ).Theorem.Let ,  be tensors.Assume given pairs (  ,   ) and (  ,   ) that are solutions to the reduction problem for  and  , respectively.Let (  ,   ,   ) and (  ,   ,   ) be their respective implementations.Further, assume that  is given by the premise of an inference rule from Figure 10.Then, there exists a solution (  ,   ) for the reduction problem for  such that its implementation (  ,  ,  ) is given by the conclusion of the inference rule corresponding to  's definition.Proof sketch.We sketch the proof for some of the important cases.A key rewrite we use in the optimization phase is inlining definitions.This operation is sound in our language.

EXPERIMENTAL RESULTS
In this section, we experimentally evaluate StructTensor by considering several tensor processing kernels over different tensor structures.We study the following questions: • How does leveraging the redundancy structure affect the run-time performance and in which cases StructTensor performs better than the state-of-the-art dense tensor algebra frameworks?• How advantageous is symbolic sparsity over dynamic sparsity?• How practical is StructTensor in real-world applications?Is it worthwhile to perform the computation over a compressed tensor and then decompress the result in practice?

Experimental Setup
We use a server with operating system 18.04.5 LTS Ubuntu equipped with 10 cores 2.2 GHz Intel Xeon Silver 4210 CPU, and 220 GB of main memory.C++ code is compiled with GCC 7.5.0 using C++17 and -O3, -pthread, -mavx2, -ffast-math, and -ftree-vectorize flags.All experiments are run on a single thread, and the average run-time of five runs is reported.As the competitors, we use the latest version of TACO3 [Kjolstad et al. 2017], Numpy 1.23.4 [Harris et al. 2020], PyTorch 1.12.1 [Paszke et al. 2019], and TensorFlow 2.10.0 [Abadi et al. 2015] with and without the XLA backend.

Redundancy Structure
We consider six structured matrix kernels that correspond to the task of creating and addition of covariance matrices for three machine learning models: linear regression, polynomial regression degree-2, and polynomial regression degree-3.As all the matrices only involve redundancy structure, we need to use the dense representation in all systems.As the competitors, we consider TACO Polynomial Regression Degree 3 -Addition Fig. 12.The run time comparison of the covariance matrix creation and addition for linear regression and polynomial regression degree-2 and degree-3.For linear regression dense frameworks show better performance, as the redundancy structure does not pay off.For models with higher redundancy, StructTensor significantly outperforms the competitors thanks to avoiding redundant computation.(fully dense format), PyTorch, TensorFlow (w/ and w/o the XLA backend), NumPy, and a naïve implementation without leveraging the redundancy structure.Figure 12 shows the run time comparison.The missing numbers mean that the process was killed due to a segmentation fault.For example, TACO can only handle the polynomial regression degree-2 up to 200 features because it gets killed for more than that.For the kernels with more redundancy (i.e., polynomial regression degree-2 and degree-3), StructTensor outperforms all the competitors from one to two orders of magnitude.However, in linear regression StructTensor only removes half of the computation and because of a lack of data-layout optimizations, loop tiling, and vectorization, performs worse than dense tensor frameworks such as PyTorch and TensorFlow.

Sparsity Structure
To evaluate the effectiveness of StructTensor for the sparsity structure, we consider three tensor kernels: Tensor Times Matrix (TTM), Tensor Hadamard Product (THP), and Matricized Tensor Times Khatri-Rao Product (MTTKRP).StructTensor is evaluated against TACO with three different formats: fully dense, fully sparse, and smart.In the smart version, we use the most efficient sparse format for TACO based on the sparsity of input and output tensors at each dimension.Table 1 shows the definition of these kernels, the different input structures we considered, their representation in STUR, and the data format at each dimension for TACO (Smart). 4For StructTensor we consider an additional naïve version that does not leverage the symbolic structure.
Figure 13 represents the run time of each implementation on each kernel.In all kernels i) TACO dense is performing similarly to the naïve implementation, ii) TACO smart is outperforming other data format selections for TACO as expected, and iii) StructTensor outperforms (in 7 out of 9 experiments) or performs on par with TACO smart (despite having a better data layout and generating a more optimized code).

End-to-end Experiments
We evaluated StructTensor on an in-database machine learning task.We used StructTensor to create a covariance matrix for linear and polynomial regression degree-2.We consider the following real-world datasets: Retailer [Khamis et al. 2020]   and full version.StructTensor is compared with a naïve implementation that does not leverage the redundancy structure.
We use the idea of a semi-ring covariance matrix data-structure [Nikolic and Olteanu 2018;Shaikhha et al. 2022] for both implementations.This data structure pushes joins after aggregates, which heavily improves both memory consumption and run time.In a semi-ring covariance matrix, different degrees of interaction are decoupled and stored separately.This means that for polynomial regression degree-2 (cf.Section 3), we store degree-2 (1), degree-3 (2), and degree-4 (3) interactions separately.Furthermore, degree-3 interactions are calculated twice in both (f ⊗  (f ⊗ f)) and ( (f ⊗ f) ⊗ f).As the naïve version does not use the structure information, it cannot detect such coarse-grained redundancy information.
Figure 14 represents the run time of StructTensor in comparison with the naïve version.StructTensor produces a structure-aware code that reduces the computations by avoiding redundant computation.Hence, there should be a reconstruction phase that rebuilds the final result and put all elements that have not been computed in their corresponding positions based on the redundancy map.However, the reconstruction time is negligible in comparison with the rest of the computation, as it can be seen in Figure 14.

RELATED WORK
Dense Tensor Algebra.Polyhedral frameworks such as isl [Verdoolaege 2010] and CLooG [Bastoul 2004] provide efficient loop nest code generation for dense and affine tensor algebra computations.Tensor Comprehension [Vasilache et al. 2018] provides a polyhedral-based DSL supporting generalized Einstein notation that leads to optimized Cuda code generation for dense deep learning computation.TensorFlow [Abadi et al. 2015] provides dense tensor operation for large-scale machine learning applications.5All these works provide efficient kernels for dense linear and tensor algebra but for structured tensors, still, they do unnecessary and/or redundant computations.Sparse Tensor Algebra.The sparse polyhedral framework [Strout et al. 2018] extends the ability of polyhedral compilation to support sparse tensor algebra as well.TACO [Kjolstad et al. 2017] handles sparse and dense computation over tensor algebra.However, unlike StructTensor, none of these work support redundancy-aware computation.Moreover, sparsity is handled in run-time that leads to irregular memory access that are hard to optimize for the compiler [Tang et al. 2020  Specialized Sparse Linear Algebra.[Augustine et al. 2019] take a different approach by breaking the irregular sparsity patterns into sub-computations with regular structures so they can remove indirect access and provide vectorization to the linear algebra code.EGGS [Tang et al. 2020] further specializes the computation to a sparsity pattern and creates the expression tree of the result by unrolling the entire computing.Performing common-subexpression over the expression tree can partially remove redundancies, but cannot detect symmetric-style patterns.StructTensor infers the redundancy patterns at compilation time to specialize the generated code.Structured Linear Algebra.
LGen [Spampinato and Püschel 2016] proposes a polyhedral-based technique for code generation of small-scale structured linear algebra.The unique elements are provided in SInfo, and the redundancy information is kept in AInfo.However, it does not support higher-order tensor computations and is limited to fixed-size small-scale matrices.To the best of our knowledge, there is no previous work supporting the structure for higher-order tensors.
Declarative Data Processing Languages.STUR is closely connected to logic languages such as Datalog and Prolog.There are two main differences.First, STUR does not allow recursive definitions.Second, in addition to sets, STUR allows for aggregations over tensors.Dyna [Eisner and Filardo 2010] extends Datalog by adding to support for maps to real numbers rather than only supporting sets, which are maps to boolean values.FAQ [Khamis et al. 2016] is the main source of inspiration for STUR, which allows for a combination of different semi-rings.The main focus for FAQ has been on efficient algorithms for evaluating sparse tensor contractions appearing in database query engines [Khamis et al. 2020] (e.g., worst-case optimal joins).STUR provides additional constructs for arithmetic operations over the indices and restricting the ranges, which are crucial for efficient structured tensor computations.

CONCLUSION & OUTLOOK
In this paper, we presented StructTensor, a compiler for structured tensor algebra.We considered two classes of structures: (1) sparsity patterns, and (2) redundancy structures.We proposed STUR, a unified IR that is expressive enough for tensor computation and captures both forms of structures.
We have shown the soundness of transformations and inference rules.Finally, the experimental results show that StructTensor outperforms the state-of-the-art tensor processing libraries.
For the future, we see two clear directions for improvement.First, StructTensor mainly focuses on structure-specific optimizations, and apart from code motion, does not do anything smart for loop optimizations.It would be interesting to use polyhedral-based optimizations (e.g., using CLooG [Bastoul 2004]) to generate efficient loop nests.Second, for storing tensors, we mainly use  − arrays, and even the compressed tensors still allocate the memory for the entire uncompressed tensor.Using better layouts can significantly reduce the memory pressure, and would possibly improve data locality.

Fig. 4 .
Fig. 4. The covariance matrix of Polynomial Regression degree-2.Its redundancy pattern is shown on the left and its unique elements in compressed format are on the right.

Fig. 5 .
Fig. 5. Grammar of STUR.The meta variables  and  range over the name of index variables and tensors, respectively.

Fig. 7 .
Fig. 7. Representation of linear algebra operations in STUR.Here, • and ⊗ are overloaded; they correspond to matrix multiplication/Kronecker product for matrices and inner/outer product for vectors, respectively.
Fig. 9. Inference rules for unique set.Here we assume the redundancy maps of inputs are empty.

Fig. 11 .
Fig. 11.Code snippets for linear regression covariance matrix creation compressed computation (left) and result matrix reconstruction (right).

Fig. 13 .
Fig. 13.Tensor operation kernels run time comparison using different input and output structures.The naïve version of StructTensor and the dense format of TACO show similar performance.In most cases, the symbolic sparsity employed in StructTensor shows significantly better performance in comparison with the best sparse format used in TACO.

Fig. 14 .
Fig. 14.Run time of in-database machine learning over Favorita and Retailer datasets.Here, we compute the covariance matrix of linear and polynomial regression degree-2 over the join of multiple relations.

Table 1 .
List of tensor kernels for StructTensor evaluation.TACO allows specifying if each dimension is Sparse () or Dense ().TACO (Smart) specifies the most efficient storage format.
with 13 continuous features and Favorita [Favorita 2017] with 6 continuous features.Both of them have a small (consisting of 25% of the data elements) ].