Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions

Optimizing deep neural network (DNN) execution is important but becomes increasingly difficult as DNN complexity grows. Existing DNN compilers cannot effectively exploit optimization opportunities across operator boundaries, leaving room for improvement. To address this challenge, we present Souffle, an open-source compiler that optimizes DNN inference across operator boundaries. Souffle creates a global tensor dependency graph using tensor expressions, traces data flow and tensor information, and partitions the computation graph into subprograms based on dataflow analysis and resource constraints. Within a subprogram, Souffle performs local optimization via semantic-preserving transformations, finds an optimized program schedule, and improves instruction-level parallelism and data reuse. We evaluated Souffle using six representative DNN models on an NVIDIA A100 GPU. Experimental results show that Souffle consistently outperforms six state-of-the-art DNN optimizers by delivering a geometric mean speedup of up to 3.7× over TensorRT and 7.8× over Tensorflow XLA.


Introduction
No day goes by without hearing about the success of deep neural networks (DNNs).Indeed, advanced DNNs have demonstrated breakthrough effectiveness in solving a wide range of tasks, from drug discovery [11,16] and self-driving cars [28] to e-commerce [26,59].
A DNN model is typically expressed as a computational graph using deep learning (DL) programming frameworks like TensorFlow [2] and PyTorch [41].By separating the expression of the computational graph from the implementation of low-level operators, DL frameworks abstract away the hardware complexity and have become the standard method for writing DNN code.However, using high-level programming abstractions presents significant challenges for low-level performance optimization, especially during model inference when deploying a trained model in a production environment where the response time is crucial [17,21].
Efforts have been made to perform optimizations across the operator boundaries to increase parallelism, decrease memory access traffic or utilize memory bandwidth more efficiently.One promising approach is operator/kernel fusions, which involves merging multiple operators into a single kernel to enable analysis and optimizations across operators.This line of research includes works using hand-crafted rules [37], loop analysis [56], or just-in-time compilation [58] to guide and perform fusions.Typically, these methods use a bottom-up strategy by first performing operator fusion in the graph representation to merge multiple operators into a partition and then generating an optimized kernel for each partition.However, a key challenge is determining the optimal boundaries of partitions or which operators should be fused together.
Despite the success of bottom-up approaches to operator/kernel fusion, optimization opportunities can still be overlooked.One such issue arises from separating the operator fusion and code generation stages.This can result in misplaced operators into different kernels, leading to extra memory access overhead and preventing otherwise possible optimizations.As we will show in the paper, state-of-theart kernel fusion methods can miss important optimization opportunities, leaving much room for improvement.
We present Souffle, a novel top-down approach for optimizing inference across DNN operator boundaries.Unlike bottom-up strategies, Souffle first processes the whole computation graph as a single, merged kernel and then divides the graph into partitions, i.e., subprograms, through a global analysis from the top-level, considering data reuse in shared memory/registers and the generated instructions when determining partitioning boundaries.Each partition would be organized into a kernel.Afterwards, at the bottom level, Souffle performs a series of semantic preserving transformations for each subprogram to simplify the tensor expression and eliminate redundant memory access for the corresponding kernel.To this end, Souffle introduces two new mechanisms: a tensor-expression-based global analysis to identify critical partitioning points and a semantic preserving transformations approach that uses affine transformation to simplify the tensor expressions of each subprogram.Compared with existing bottom-up fusion approaches, the benefit of our top-down approach is that it globally determines the kernel boundaries by considering the generated code of the kernels.Tensor-expression-based global analysis.Souffle conducts global dependence analysis on a tensor dependency graph generated from the entire DNN model.It utilizes tensor expressions (TEs) [12] to encode dataflow information of operators and tensors.By mapping higher-level operators to simpler TEs, Souffle performs data-flow analysis and code optimization around these TEs, simplifying the complexity of analysis and optimization and resulting in better code.TEs offer concise semantics, allowing us to translate the task of analyzing and optimizing complex operators to a more manageable problem of analyzing and optimizing simpler mathematical expressions.For instance, a softmax operator can be represented by two TEs with simpler data dependence relationships: one is a one-relies-on-many TE (reduction), and the other is a one-relies-on-one TE (element-wise).Since Souffle's analysis is conducted on the TEs without making any assumptions of low-level library calls, it can optimize across complex operators, even when the operators have complex data dependency like many-to-many, when other methods fail to do so.Semantic-preserving transformation.After the top-level stage, the computation graph has been divided into multiple subprograms with each subprogram being mapped to a kernel.However, each subprogram contains a large number of TEs which would introduce a large number of redundant memory accesses across these TEs.Therefore, Souffle applies affine transformations to combine multiple TEs to a single TE thus eliminating the redundant memory accesses.This process is performed within the subprograms and relies on the TE-based global analysis.The transformation is fully automated and flexible as the tensor expression precisely describes the mathematical computation of operators in a simple form.Putting it all together.Souffle first conducts data-flow analysis on the tensor dependency graph of the entire DNN model using TEs.This analysis captures essential information such as tensor shapes and live ranges across operator boundaries, allowing for precise element-wise analysis to infer data dependence.Souffle then partitions the TEs into subprograms using compiler heuristics and conducts local optimization within each subprogram using semanticpreserving mathematical transformations to reduce memory accesses.The optimized subprogram schedule is found by considering the computation characteristics of the subprogram's TEs.With precise dependence information at the TE level, Souffle can optimize memory access latency by reusing tensor buffers and improve instruction-level parallelism by overlapping memory load and arithmetic instructions.Since Souffle's code optimizations are based on subprograms of fused operators rather than individual operators, the optimization boundary of operators is eliminated.Evaluation.We have implemented a working prototype of Souffle 1 on TVM.We evaluate Souffle on six DNN models   Table 1.Performance for the generated kernels of Fig. 1. with diverse and representative model architectures and compare it against six state-of-the-art DL optimizing and kernel fusion frameworks, including XLA [27], Ansor [57], Ten-sorRT [38], Rammer [33], Apollo [56], and the MLIR-based IREE compiler [1].Our evaluation, performed on an NVIDIA A100 GPU, shows that Souffle outperforms existing solutions, delivering a geometric mean speedup of up to 3.7× and 7.8× over TensorRT and XLA, respectively.Souffle is highly flexible and can fuse operators where state-of-the-art kernel fusion strategies fail.It is compatible with TensorFlow and ONNX [13] models, and can be integrated with general DL compilers like TVM [12,57].

Write After Read Dependency
Contributions.This paper makes the following contributions: • It presents a new top-down approach for identifying and exploiting optimization opportunities across operator boundaries (Sec.5); • It shows how to effectively leverage the global analysis to perform local optimization at the kernel level represented as tensor expressions (Sec.6); • It demonstrates how low-level tensor expressions can be employed to perform instruction optimizations beyond operator boundaries (Sec.6.5).

Working Example
As a motivation example, consider optimizing a standard BERT model [14] on an NVIDIA A100 GPU.This model is based on the Transformer architecture [10,49] and is using FP16 for inference.Fig. 1 depicts how TensorRT and Apollo map operators of a simplified sub-computation graph from BERT into kernels2 .This subgraph contains representative DNN operators like general matrix multiplication GEMM, reshape, permutation, element-wise arithmetic operators like add or exp, and reduction operators like reduce_sum.The compiler maps these operators to individual kernels, which significantly impacts performance.

Performance Evaluation
We measure the resulting kernels using the NVIDIA Nsight Compute [39].Table 1 shows that neither TensorRT nor Apollo can provide an optimal mapping for the evaluated DNN.The subgraph created by TensorRT and Apollo in Fig. 1 loads 16.52MB and 27.78MB of data from global memory, giving an execution time of 62.34 and 179.1, respectively.A better strategy, which is the one chosen by our approach, is to refine and map the subgraph into a single kernel.This strategy reduces the number of bytes loaded from the global memory to 8.87M with an execution time of 57.7, translating to 1.1× and 3.1× faster running time than TensorRT and Apollo, respectively.We want to highlight that TensorRT has been specifically tuned for Transformer-based models with close-sourced, hand-optimized low-level operator implementations (like GEMM).Therefore, we consider the Souffle improvement over TensorRT on BERT to be significant given that Souffle does not have access to some of the NVIDIA-optimized operator implementations used by TensorRT.Furthermore, as we will show later in Sec. 8, Souffle also significantly outperforms other DNN compilers, including XLA and IREE, on this DNN model.

Missed Opportunities
After closely examining the profiling data and the kernel fusion outcomes, we identified several optimization opportunities that TensorRT and Apollo miss: Fail to explore optimization between memory-and compute-intensive kernels.As depicted in Fig. 1, part of BERT requires to perform element-wise memory operators, e.g.reshape and permutation (Element-wise memory operators 2 and 3 in Fig. 1).TensorRT and Apollo leverage manually crafted rules to fuse adjacent element-wise memory operators together while both of them fail to further perform optimization between the fused operators and their precedent computation operators, e.g.(GEMM) operators in Fig. 1.Souffle performs optimization between memoryand compute-intensive kernels, and eventually eliminates all element-wise memory operators.In summary, manually crafted rules cannot cover a diverse set of computation patterns and miss the optimization opportunity in this case.Suboptimal fusion strategy for reduction operators.Both strategies choose to map the GEMM and the reduction operator to separate kernels, which requires storing the entire tensor data that reduction operators rely on to expensive global memory before reduction occurs.Souffle aggressively fuses reduction operators with adjacent computationintensive operators, such as R0-2(R for Reduction Operator) with GEMM0 and GEMM1, as shown in Fig. 1(c).This is achieved through a two-phase reduction approach: performing partial reduction in a per-block fashion and using atomicAdd for global reduction.As a result, the entire tensor data can be kept on-chip, with only the partial result stored in global memory.A global synchronization (e.g.grid synchronization in CUDA [40]) is inserted to synchronize running blocks, as shown in Fig. 1(c).This optimization applies to all reduction operators in Fig. 1.Moreover, Souffle can cache the output of arithmetic operator 1 on-chip for reuse in arithmetic operator 2.
Poor optimizations across computation-intensive kernels.Like many other DNN frameworks, TensorRT and Apollo try to fuse multiple computation-intensive operators of the same type, but fail to capitalize on the opportunities across different types of operators.Consider Fig. 1(d) that shows how two dependent GEMM operators execute asynchronous memory copies and tensor core computations when they are grouped to kernels under two different strategies.The first is to map the GEMM operators into two separate kernels, as they do not consider fuse compute-intensive operators.The second is to map them to a single kernel.Ten-sorRT and Apollo use the former, and Souffle uses the latter.By putting two GEMM operators into one single kernel (right part of Fig. 1(d)), Souffle allows the pipeline execution of loading W3 of GEMM3 while computing GEMM2.Souffle is designed to take such cross-operator pipeline optimizations.

Our Insights
Based on the observations outlined earlier, there is a need to analyze DNN models to fuse operators, perform automatic transformations on tensor operations, and optimize within a fused kernel.A more effective kernel fusion strategy makes extracting crucial tensor information such as live range and tensor data reuse possible.This information can then be used to analyze the fine-grained dependencies at the element-wise level, leading to better kernel-level optimization.

Preliminaries
Souffle utilizes TVM's tensor expression (TE) [12] as an intermediate representation for analysis and optimization.The TE specifies how output elements are computed from input tensors.In the TE Program shown in Figure 2, TE0 is an example TE for the GEMM, where the rk parameter defines the reduction axis (i.e., on which dimension of a tensor will be traversed to perform the reduction), with a reduction index ranging from 0 to 64.The output tensor O0 is computed using the compute operation, which specifies the computation to be performed on each data element and the output shape.
The iteration variables i and j correspond to the output shape dimensions, and their iteration domains can be inferred naturally.Essentially, TE uses a pure functional language [47] to describe tensor computation, allowing for individual computation of each output tensor element.Note that our optimizations also apply to other DSLs like antares [34] and tensor comprehension [48] with similar concise semantics and expressiveness.We choose TVM due to its popularity and the established toolchain.

Overview of Souffle
Souffle is our open-source framework for DNN code optimization.It is designed to overcome the three limitations of existing DNN inference frameworks identified in Section 2. It enhances data reuse, optimizes reduction operations, and enables cross operator boundary optimization.Currently, it supports TensorFlow models and optimizes DNN inference on a single NVIDIA GPU.But many of our analyses and optimizations can be applied to AMD GPU and other accelerators.Fig. 2 shows an overview workflow of Souffle, which takes a model as input and uses TVM to lower the model down to TE on which we perform analysis and optimization.TE lowering.For a DNN model, Souffle first lowers each operator to its corresponding TEs to form a TE program.Fig. 2 shows that the five operators are lowered to five TEs marked with TE0 to TE4.Global computation graph analysis.The lowered TE program is passed to the Souffle analysis module.Souffle At the element-wise level, Souffle analyzes the fine-grained dependencies between the output and input tensors of each TE, as described in Sec.5.Fig. 2 shows the analytical results including element-wise data dependency and computational complexity for the five TEs.At the tensor level, it finds that O0 is accessed by TE1 and TE3, which reveals the data reuse opportunity across multiple TEs.
Resource aware program partitioning.Souffle analyzes the tensor dependency graph and uses Ansor [57] to schedule compute-intensive TEs.It partitions the input TE program into multiple subprograms based on resource usage and transforms each subprogram into a computation kernel.For example, in Fig. 2, if the number of blocks of TE4 exceeds the max blocks per wave limit, Souffle partitions the TE program into two subprograms.The first subprogram includes TE0, TE1, TE2, and TE3, while the second includes TE4.TE transformation.The subprograms together with the data-flow analysis and tensor information are sent to the TE transformation module to generate semantic preserving but optimized TEs.In Fig. 2, the computation of TE1 and TE3 is scheduled to the inner loop of TE0 and TE2 respectively.TE schedule and transformation are explained in Sec.6.
Joint optimization and code generation.The transformed TE subprograms are fed to a scheduler optimizer (Ansor [57] in our case) to generate a schedule for the TE subprogram.
Next, Souffle composes schedules within a subprogram into one single function represented by TensorIR [15] for joint optimizations of instructions and data reuse within the subprogram.Finally, the optimized subprogram is passed to the back-end code generator to produce CUDA kernels.In Fig. 2, ldg2s stands for load from global memory to shared memory, wmma_16x16 stands for warp matrix multiply-accumulate, and sts2g stands for storing shared memory to global.Souffle wraps the TE's corresponding code in if statement to match the launch dimensions and inserts global sync primitives (grid.sync() in this example) to synchronize data across thread blocks.SO0 is cached in shared memory and reused across operator boundaries (TE1 and TE3 in this working example).We describe these procedures in Sec.6.5.

Global Computation Graph Analysis
Souffle applies two levels of analysis on the TE's tensor dependency graph.The first identifies data reuse opportunities and the second extracts the element-wise data dependence.

Identifying data reuse opportunities
Tensors are often reused both in temporal and spatial dimensions, providing opportunities for exploiting data reuse to eliminate expensive memory operations.As discussed in Section 2.3, there are two types of tensor reuse in our working example shown in Fig. 1(a) and Fig. 1(b).First, the three GEMM0 operators share the same input tensor which can be reused spatially.Fusing the three GEMM0 operators into one kernel would allow us to load the input once and reuse it three times across GEMM0 operators.Such a reuse opportunity is common in DNNs, including recurrent neural networks [44], convolution neural networks [29,45,55] and Transformer models [31,49].Spatial data reuse optimizations apply to tensors that are consumed by operators that have no data dependencies, and the operators will be horizontally transformed as described in Sec.6.1.The second type of reuse opportunities can manifest in the temporal dimension.Temporal data reuse opportunities apply to tensors that are used more than once by operators that have data dependencies, and guide the tensor reuse optimization which is described in Sec.6.5.Consider again our working example in Fig. 1, the result of element-wise arithmetic operator 1 (termed as A1) is used by two dependent operators R1 and A2.Once again, accesses to the global memory can be eliminated if we cache the computation output of A1 on register/shared memory.Souffle identifies these data reuse opportunities from the TE tensor dependency graph at the tensor level by first traversing the computation graph to gather all the tensors accessed by more than one TE.It records the set of operators,  (  ) = {  , ...,   }, that shares with tensor   to enable code optimizations, as described in Sections 6.

Intra-TE element-wise dependency analysis
Souffle captures element-wise data dependence from output to input tensors within a TE by defining the iteration space as the output shape, and the data space as the domain of iteration and reduction variables for each input tensor.This simplifies the element-wise dataflow from input to output tensors, as we only need to record the relevant input tensor elements for a element of the output tensor.The information also enables reduction operator fusion at the TE transformation stage, which other optimization tools such as TensorRT and Apollo do not support.
Our key observation is that the intra-TE data dependence falls into two categories.Firstly, for TE without a reduction axis (see also Section 3), each output tensor element relies on only one input tensor element (termed as one-relies-on-one).Secondly, for TE with a reduction axis, each output element relies on all the elements of all the reduction dimensions of input tensors (termed as one-relies-on-many).With this observation, we can greatly simplify the dependence analysis process compared to the source code or operator-level analysis that other kernel fusion approaches rely on.
One-relies-on-one TEs.Souffle adopt quasi-affine maps [7,36] to represent element-wise dependency for an onerelies-on-one TE.The mapping from output to input can be expressed in the form  − →  + − →  where − →  is the indices of output tensor,  is a constant matrix from Z × and − →  is a constant vector from Z  .Here,  is the output tensor's dimension and  is the corresponding input tensor's dimension.Note that multiple indices of the output tensor may rely on the same index of the input tensor.For instance, relation  1 can be represented as: One-relies-on-many TEs.For a one-relies-on-many TE, Souffle extracts the region of input tensor accessed by combining the iteration space and the input tensor's index function.As the iteration domain of reduction axes is a constant value, the mapping can be expressed in the form of  = {[ 0 , . . .,   ] ↦ → {[ 0 , . . .,   ], [ 0 , . . .,   ]} :  0 , . . .,   }, where [ 0 , . . .,   ] is a set of reduction variables and their ranges.For instance, relation  0 can be expressed as follows: } represents a set of elements with  ranging from 0 to 64.We stress that the element-wise dependency for compute-intensive operators like GEMM and convolution can be easily analyzed as the tensor expression explicitly gives the reduction axes.
TEs with one-relies-on-one dependency are then transformed in Sec 6.2, and TEs with one-relies-on-many dependency are then scheduled in Sec 6.3 and Sec 6.4.

TE characterization
Souffle classifies a TE as memory-intensive (e.g., re-duce_sum) or computation-intensive (e.g., GEMM) by estimating the compute-memory ratio for a TE.The ratio is computed by dividing the number of arithmetic instructions by the number of memory accesses.As a result, this ratio unit represents the number of arithmetic operations per tensor element that is both read and written.In this work, the classification threshold is empirically set to 3. A ratio less than the threshold indicates that the TE is memory-intensive.

TE Program Partitioning
Souffle tries to generate large kernels to maximize data reuse and eliminate extra kernel launches.However, using global synchronization imposes a constraint that the thread block count cannot exceed the maximum block count per wave.If this constraint cannot be satisfied, Souffle partitions the TE program into multiple TE subprograms.In Souffle, a TE subprogram serves as the fundamental unit for high-level TE transformation, middle-end schedule optimization, and back-end code generation.It can include several operators mapped to one GPU kernel.

Semantic-preserving TE Transformations
After Souffle has collected the reuse and dependence information as described in Section 5, it then looks at opportunities to automatically transform the TEs to improve the performance within each TE subprogram.Souffle supports both horizontal and vertical TE transformations and transforms TEs in the same subprograms.Horizontal fusion fuses branches of operators into a single kernel [33,52].Horizontal transformation in Souffle is similar to horizontal fusion and is applied to branches of independent TEs.Vertical transformation is similar to vertical fusion [37,58] and is applied to multiple consecutive TEs with a one-relies-on-one data dependence.We stress that our horizontal and vertical transformations are more flexible than the fusion strategies  used by IREE and Rammer [33], which will not fuse operators with one-relies-on-many operators into one kernel.Furthermore, semantic-preserving transformation ensures the preservation of arithmetic operations(e.g.add, exp) while satisfying data dependence requirements.In contrast, some transformations used by other DNN optimization approaches may not preserve the semantics.For example, TASO [20] optimizes the DNN graph by subgraph substitution and might replace add with a concat+convolution.

Horizontal transformation for independent TEs
Souffle tries to transform multiple independent TEs to a single TE to increase parallelism.Souffle first compares the output tensor's shape for each independent TEs and tries to concatenate them as a single TE.Souffle concatenates output tensors from multiple independent TEs to one as each TE can only produce one output tensor.Souffle adds predicates based on the region of output and rewrites the TE.Subsequently, Souffle then assesses whether these TEs consume the same input tensor.Notably, the opportunity for tensor reuse, as discussed in Sec 5.1, is examined.Consequently, the shared tensor only needs to be loaded once within the generated kernel.Therefore both the number of kernel and global memory transactions can be reduced.Figure 3 gives an example of the horizontal TE transformation, both TEs share the same reduction variable, .The output of the first and the second TEs are (4, 16) and (2, 16) and can be concatenated on the second axis to a single tensor with shape (6,16).If the outputs of independent TEs can not be concatenated, it adds an if_else statement to select the corresponding input tensor for concatenated TEs, similar to Rammer [33].

Vertical transformation for one-relies-on-one
TEs Souffle vertically transforms TEs with one-relies-on-one data dependence to one TE to reduce the generated kernels and reuse data on registers.This is enabled by the quasiaffine maps representation (Sec 5.2).To this end, Souffle first transforms all one-relies-on-one TEs by applying the index mapping function from the child TEs to their parent TEs.It then applies the index mapping functions and creates a single semantic preserving TE.Assume we have  TEs,  0 , , where the output of   is the input of  +1 .The mapping function can be expressed as   ( − →   ) for   .The transformed TE's mapping function from  +1 to   can then be computed using the following function: For the example in Figure 4, the index mapping function of three TEs, ,  and , can be converted to a single mathematically equivalent function -effectively reducing the number of TE by 3x -as: Using the method described above, Souffle iteratively refines multiple one-relies-on-one TEs from a set of consecutive TEs until no further possible refinements can be found.It then applies a schedule from it's compute-intensive parent TE to attach the memory-intensive one-relies-on-one TEs to compute-intensive TE, described in the next subsection.Compared to the hand-crafted transformation rules used by TesorRT, Apollo and Ansor, our semantic-preserving transformation has a better generalization ability.

Schedule TEs
Souffle uses Ansor to generate optimized schedules for compute-and memory-intensive TEs.Note we have already generated a schedule for compute-intensive TEs during TE program partitioning in Sec.5.4.It propagates the computeintensive producer's schedule for memory-intensive TEs to maximize data reuse.For one-relies-on-one TEs, Souffle first schedules them based on their compute-intensive TEs tile size, then safely inlines them with their producer's compute statement.For one-relies-on-many TEs, Souffle reduces locally to reuse the producer's data on shared memory/register, then uses atomic primitives to reduce across thread blocks.

Merging TEs Schedule
After scheduling TEs, Souffle merges the schedules of TEs within a subprogram to create a holistic function using Ten-sorIR [15].It adds predicates if the launch dimension of TEs differs and inserts global sync primitives between TEs with one-relies-on-many dependency.Finally, it performs several optimizations described in the next section.

Optimizations within a Subprogram
Souffle supports two types of optimizations within a TE subprogram.The first is instruction-level optimization aiming to overlap asynchronous GPU memory loads with arithmetic instructions.Note that this pipeline execution is scheduled across TEs and without global data dependency analysis the optimization can not be done.The second is to reuse tensor buffers across TEs (and potentially across multiple original operators).Souffle performs subprogram-level optimization after TE schedules within a subprogram have been generated by the underlying scheduler (Ansor in this work).It utilizes the global computation dependency analysis results (Sec.5) to apply the two optimizations.Instruction-level optimization.Souffle regroups instructions within a fused subprogram containing multiple original operators to execute memory and arithmetic instructions in parallel.This is accomplished by the scheduling load/store and computation instructions for pipeline execution across operator boundaries.For instance, in the BERT model discussed in Section 2.1 and Figure 1(d), the Souffle-generated schedule issues NVIDIA instructions LDGSTS.E.BYPASS.128 and HMMA.16818.F16 in parallel, where the former instruction copies 128 bits from the GPU global memory to shared memory, and the latter computes GEMM on NVIDIA tensorcores.Tensor reuse optimization.Souffle maximizes tensor buffer reuse across TEs with a simple software-managed cache, using a Least Recently Used (LRU) policy to replace tensor buffers (e.g., matrices and vectors) from shared memory at runtime.It scans instructions linearly until shared memory is exhausted, spilling the shared memory to global memory and adding a memory barrier if the shared memory is exhausted.

Implementation Details
We implemented Souffle with 10K lines of C++ and 1K lines of Python code.We integrated Souffle with TVM [12] using Ansor [57] as its schedule optimizer.However, Souffle can work with other schedulers compatible with TEs.Souffle supports element-wise operators, broadcasts, reductions (including reduce_sum, GEMM and Conv), reorganized operators like reshape and shuffle operators like transpose.Souffle does not support non linear algebra operators like TopK or Conditional.We use direct convolution which is the default implementation of Ansor.We use the base model from [32] 7 Experimental Setup DNN workloads.We evaluated Souffle on representative and diverse DNN workloads in Table 2.These include natural language processing (BERT [14]), computer vision (Swintransformer [31] -Swin-trans.for short) and knowledge discovery (MMoE [32]) that implements the latest mixtureof-expert DNN.These also include classic convolutional and recurrent networks like ResNeXt [55] and LSTM [18].We use single-precision floating-point (FP32) for all operators, except for GEMM for which we use half-precision floatingpoint (FP16) to use the tensor cores.we target model inference and set the batch size to one.

Competing Baselines
We compare Souffle against six strong baselines: XLA (Tensorflow v2.10).The TensorFlow XLA compiler can fuse DNN operators like point-wise and reduction operators and performs optimizations on the fused operator.Unlike Souffle that performs analysis and optimizations on TEs, XLA performs analysis on its high-level operators(HLO).TensorRT (v8.2).This GPU-vendor-specific framework optimizes the inference of DNNs on NVIDIA GPUs [38].
Ansor(TVM v0.8).This state-of-the-art DNN optimizer builds upon TVM.It uses auto-tuning techniques to find good tensor schedules from hand-crafted templates.Rammer (v0.4).This DNN compiler is also known as NNFusion [33].It generates a spatial-temporal schedule at compile time to minimize scheduling overhead and exploit hardware parallelism through inter-and intra-operator co-scheduling.
Apollo.This represents the state-of-the-art fusion framework for inference optimization [56].Apollo considers both memory-and compute-bound tensor operators for kernel fusions and uses hand-crafted rules to exploit parallelism between independent tensor operators.IREE(released on 30 Dec. 2022).The intermediate representation execution environment (IREE) builds upon the LLVM MLIR project [1,30].IREE is designed to lower DNN models to MLIR dialects to optimize model inference.IREE utilizes the linalg dialect to perform the operator fusion, which supports loop affine fusion optimization and global analysis.

Performance Report
We use NVIDIA Nsight Compute to profile DNN model's computation latency and record performance metrics.We found that the variance across different runs is less than 2% and only reports the geometric mean.

Experimental Results
In this section, we first present the overall results of Souffle and the competing approaches, showing that Souffle outperforms all other schemes across evaluated DNN models (Section 8.1).We then quantify the contribution of individual optimizations to the performance improvement for each DNN workload (Section 8.2 and 8.3), compare Souffle with alternative schemes on selected workloads (Section 8.4), and discuss the negligible compilation overhead introduced by Souffle (Section 8.5).

Overall Performance
Table 3 gives the end-to-end execution time (in ms) of each DNN model running on an A100 GPU.Note that some compilers failed to compile and execute certain DNNs.Overall, Souffle outperforms competing methods across all DNNs.Souffle builds upon TVM's Ansor, but it can significantly boost the performance of the native TVM + Ansor implementation, giving an average speedup of 3.9× (up to 8.5×) over Ansor.Furthermore, it improves NVIDIA-tuned TensorRT by 3.7× on average (up to 7.9×), with a similar performance improvement over Rammer, Apollo, and IREE.The results demonstrate that Souffle delivers consistent and robust performance improvement across DNN workloads.Kernel or operator fusion techniques such as XLA, Rammer, and Apollo can surpass Ansor in certain scenarios, highlighting the importance of kernel fusion.However, these approaches can only merge a limited set of operators and lack efficient instruction-level optimizations across some operators, resulting in redundant computations.
Rammer relies on hand-crafted rules for operator fusion and can only merge sibling operators in the computation graph.It does not perform element-wise data dependence analysis or reuse tensor buffers, limiting its ability to optimize operators with shared input-output buffers.Similarly, XLA maps some computation-intensive operators (e.g., GEMM) to a BLAS library call and cannot merge such operators with others.XLA relies on hand-crafted rules for operator fusion and cannot optimize some computation patterns, such as merging two consecutive reduction operators in the BERT model.
Apollo also relies on loop fusion rules and can only merge two reductions with the same tile size.Moreover, it does not support schedules with global synchronization, further restricting a scheduler's optimization.IREE only fuses producer-consumer types of fusions with parametric tile-and-fuse optimizations.By contrast, the horizontal and vertical transformations supported by Souffle are more flexible and can fuse operator patterns unsupported by IREE.As such, IREE cannot fuse computation-intensive operators (e.g., batch_matmul) to reduce GPU global memory accesses.
Compared to other kernel fusion techniques, Souffle can identify more data reuse opportunities by operating on TEs, which have simple and well-defined semantics and do not rely on inflexible fusion rules.Souffle can utilize data reuse across operators with different-shaped buffers and perform instruction-level optimizations for unfused operators.Souffle outperforms competing baselines due to these advantages.

Performance Breakdown
We conducted a series of experiments to evaluate the performance benefits of Souffle's optimizations.We gradually activated our optimizations, starting from the TVM + Ansor generated code (V0) and then adding our TE horizontal trans.(V1), TE vertical trans.(V2), global sync with global synchronization API (V3), and subprogram-level optimization (V4), as described in Sec.6.1, 6.2, 6.4, and 6.5, respectively.Table 4 reports the impact of individual optimizations on inference time reduction for each DNN model.Our horizontal and vertical TE transformation schemes benefit all DNN workloads, increasing SIMD parallelism and reducing memory accesses.Transformer-based BERT and Swin-Trans.also benefit from global sync and subprogram-level optimization, which enable overlapping load and tensor core's arithmetic instructions and tensor buffer reuse.

Analysis of Performance Advantages
We identified two reasons for Souffle's improved performance over TensorRT and Apollo: reduced GPU kernel calls and reduced GPU memory data transfers.We use a microbenchmark taken from EfficientNet to illustrate the performance contribution of the two optimizations.The submodule is the building block of EfficientNet and repeats many times with different input sizes (marked with M0 to M9).The pattern of this sub-module is common in many DNN models and existing DNN frameworks fail to optimize it optimally.Fig. 5 shows four versions: 5a unfused with generating each TE to one kernel, 5b fused with Ansor's fusion, 5c Souffle's global-sync with generating the whole sub-module to one kernel but without any data reuse; 5d with Souffle's data reuse.Fig. 6 shows the normalized speedup of the four versions with the horizontal axis being the different submodules.Global sync can achieve 1.31× speedup on average compared with Ansor's unfused, with performance improvements coming from kernel calls reduction and lightweight CUDA grid sync.Enabling data reuse further improves the speedup from 1.31× to 1.84× on average.Souffle's reduced GPU kernel calls and increased data reuse can both significantly improve the performance.However, it's non-trivial to separate the performance contribution for end-to-end models, as TE transformation and global synchronization may both reduce kernel calls and reduce memory access.We report the reduced GPU kernel call and GPU memory data transfers in the following.
Reduce GPU kernel call.GPU kernel calls can be expensive, and it takes around 2  to launch a kernel on an NVIDIA A100 GPU.Consider the performance of TensorRT and Souffle again when optimizing BERT.Like Sec. 2, we classify the computation kernels in BERT into compute-(like GEMM) and memory-intensive kernels (like softmax).We then measure the execution latency (in clock cycles) of each kernel.Souffle is more flexible in fusing operators, which reduces the number of kernels and kernel invocation overhead compared to TensorRT.For example, TensorRT maps a BERT layer to 10 kernels, while Souffle can partition one layer into two kernels and perform instruction-level optimization.Souffle reduces the memory-intensive kernel latency from 31.0  (in TensorRT) to 25.5  by buffering intermediate results in fast memory and GPU registers for BERT one layer.
We also examine IREE's fusion performance on BERT.IREE misses two optimization opportunities for BERT: it does not fuse GEMM and softmax operators and several GEMM operators.IREE launches 180 kernels and takes 2.22 ms for execution.In comparison, Souffle launches 24 kernels and takes 1.22 ms.

Case Study on LSTM
Following the discussion in Sec.8.3, we conducted studies on the LSTM model to reveal new optimization opportunities offered by Souffle, which achieved a performance improvement of 4.3× over TensorRT and 2.2× over Rammer.We compared Souffle with Rammer, the most performant baseline, as discussed in Sec.8.3.Fig. 7 shows the fusion strategy used by Rammer and Souffle for an LSTM with 10 cells (listed vertically in Fig. 7).Each cell has its dedicated weight tensors (marked as  and  in Fig. 7), hidden states (ℎ) and output ().In each time step , the -th cell performs general matrix-vector multiplication (GEMV for short) using  its weight tensors (  and   ), hidden state (ℎ  ) and output ( −1 ) from ( − 1)-th cell, updates its hidden state (ℎ  ) and generates output (  ) for the current time step.Fig. 7 shows the fully unrolled time step loop.The LSTM operators alongside the diagonal line are independent, i.e. no data dependency exists.Both Souffle and Rammer exploit such optimization opportunity, i.e. the wavefront parallelism, and fuse the GEMV computation to different blocks of a kernel.With the TE-based global analysis, Souffle discovers that the weight tensors ( and  ) of each LSTM cell are reused across all time steps (temporal reuse).It utilizes the global synchronization and generates one kernel for the entire model, as shown in the right part of Fig. 7. On the other hand, the Rammer version needs to load the weight tensors in every wavefront, resulting in a longer execution time compared to Souffle.We measured GPU global memory data transfer and pipeline utilization for the optimized LSTM.As Table 6 show, Souffle-optimized code reduces memory loads by orders of magnitude compared to Rammer's version (21 vs 1911) and increases pipeline utilization for both the load store unit (LSU) and fused multiply-add unit (FMA).

Compilation Overhead
Souffle employs Ansor and TVM for schedule search and generation.The compilation overhead of Souffle + Ansor is mainly from the time required for searching the program schedule using native Ansor implementation.The additional overhead introduced by Souffle involves two-level dependence analysis, model splitting, schedule tuning, and global optimization.Our measurements on six evaluated models indicate that Souffle adds up to 63 overhead on top of Ansor, which is negligible compared to the hours Ansor requires for schedule search.This overhead can be reduced by using faster optimizer like Roller [60], which is orthogonal of Souffle.

Discussion
Expression power of TE.Souffle relies on the expression power of tensor expressions, which currently does not support all DNN operators, e.g., it does not support resize.Souffle maps these TE-unsupported operators to a computation kernel and uses the back-end operator library implementation but without fusing them with other operators.Given the active developer community of TVM, we expect this limitation to be addressed by future TVM releases.Cost model for TE program partitioning.Souffle extracts tensor information by compiling the raw TE program.This can be improved by building a cost model [53]  Slowdown.Performance slowdown can occur when Souffle extends the schedule from compute-intensive TEs to memory-intensive reduction TEs (discussed in Sec.6.3).This introduces synchronization between blocks, potentially hampering parallelism for reduction TEs.A potential remedy is to create a cost model to decide whether fusing these TEs is beneficial.

Related work
Loop and kernel fusion.Loop fusion is commonly used to improve the performance of CPU programs [3,8,9,23].Recent research has also utilized kernel fusion to optimize GPU programs by reducing data traffic to/from off-chip memory [42,50].Various domain-specific kernel fusion policies have been proposed for workloads like data center applications [54], mesh computing [6], machine learning workloads [4] and image processing [35].Souffle leverages loop fusion to optimize DNN inference through compiler transformations, building on these previous research efforts.Operator fusion in DNN compilers.Operator fusions can enhance performance by improving data reuse and reducing on-chip data traffic.To seek out fusion opportunities, DNNFusion classifies operators and defines rules based on their classification [37].Astitch [58] and Rammer [33] fuse independent operators to leverage inter-operator parallelism, while Deepcuts [22] uses rules to fuse kernels based on GPU hardware parameters.Apollo [56] adopts a partitionbased approach to search for fusion opportunities within sub-graphs.However, these approaches rely on hand-crafted rules with extensive engineering efforts and may miss optimization opportunities, as discussed in Sec. 2. Jeong et al [19] proposed a dynamic programming algorithm to decide whether to pin or fuse an activation map on DNN accelerators with a global buffer.DNNFusion [37] classifies operators based on the element-wise mappings from input to output, but it can not fuse many-to-many with many-to-many operators (like GEMM and Softmax), while Souffle can further reduce the overhead of kernel launch.Furthermore, DNNFusion lacks global analysis of tensor reuse opportunities and may miss the temporal and spatial data reuse opportunities, which are critical to improve the performance as shown in Sec 8.2.Souffle improves upon previous operator fusion techniques by utilizing control and data-flow analysis on the tensor dependency graph to partition TEs into subprograms.TEs have clear and simple relations, and they can be combined to represent numerous DNN operators.Souffle leverages the well-defined semantics in TEs to perform precise data dependence analysis for instruction scheduling and data reuse optimization.Additionally, Souffle applies semantic preserving transformations to refine TEs.Its optimization capabilities have better generalization ability as TEs can be combined to represent more complex operators.Global analysis and fusion optimization.TensorFlow XLA [27] and MLIR [43] also conduct global analysis on the input program graph.XLA utilizes profitability analysis on its high-level operations intermediate representation before deciding on tiling and fusion.However, XLA relies on handcrafted heuristics to fuse operators, which can be challenging for high-level operators.For instance, XLA's fusion heuristic cannot fuse two consecutive reduction operators in the BERT model.Moreover, as XLA operates at the operator level and some operators are mapped to low-level library calls, it cannot optimize across libraries.In contrast, Souffle takes a different approach by lowering high-level operators into lower-level tensor expressions (TEs), which have concise semantics.Operating on TEs rather than assuming high-level operators enables Souffle to optimize flexibly across operator boundaries.Unlike XLA, Souffle can merge GEMM and Softmax operators and optimize across reduction operators.
The MLIR -affine-loop-fusion pass utilizes a slicing-based method to identify producer-consumer and sibling fusion opportunities.Souffle implements a lightweight, specialized global analysis on TEs, which can be easily integrated into DNN inference engines.Moreover, Souffle offers more optimization opportunities than just fusion.For example, it enables joint optimizations across multiple compute-intensive TEs in a TE subprogram and facilitates horizontal and vertical transformations for sibling fusion.Optimizing individual operators.Numerous compilerbased approaches exist to optimize individual operators, including TVM [12,51], XLA [27], Tiramisu [5], and TACO [25].These compilers often represent operators in high-level forms such as TEs or linear algebra, enabling aggressive optimization without complex analysis through domain-specific knowledge.Souffle is orthogonal to these techniques.

Conclusion
We have presented Souffle, a top-down compiler-based approach for improving DNN inference.Souffle identifies optimization opportunities across DNN operators by performing data-flow analysis on the entire tensor dependence graph built from tensor expressions.It groups tensor expressions into subprograms and performs local optimization through semantics-preserving transformations, instruction scheduling, and tensor buffer reuse.We evaluated Souffle on six DNN models using an NVIDIA A100 GPU and compared it to six state-of-the-art DNN optimizing frameworks.Souffle outperformed them with a speedup of up to 7.9× over TensorRT.

Figure 1 .
Figure 1.How TensorRT (a), Apollo (b) and Souffle (c) map a BERT computation graph into kernels.The Souffle optimization leads to fewer GPU memory accesses and faster execution time than TensorRT and Apollo.

Fig. 1 (
a) and (b) show the suboptimal kernel fusion strategy employed by TensorRT and Apollo for reduction operators.

Figure 4 .
Figure 4. Example of vertical TE transformation.

Figure 7 .
Figure 7. How Rammer (a) and Souffle (b) map a LSTM graph into computation kernels.
Selection of partitioning point.We only consider compute-intensive operators as candidate partitioning points.Compute-intensive TEs typically use much more shared memory and registers than memory-intensive TEs.Excessive use of shared memory and registers pushes the occupancy up, thus limiting the max blocks per wave and making it infeasible for global synchronization.Souffle transforms memory-intensive TEs and uses their compute-intensive producer TE's schedule to achieve better data reuse (Sec.6).Get required resource.Souffle gets the kernel launch dimension and the register/shared memory occupancy from the TE schedule produced by the schedule optimizer (Ansor in this work).Partitioning algorithm.Souffle ensures resource constraint being satisfied in TE program partitioning using an analytical model.Given a GPU with a total of  registers/shared memory, Souffle extracts the maximal launch dimension   and the maximal occupancy of register/shared memory   for all compute-intensive TEs in the current TE subprogram.It then checks whether the constraint   *   <  can be satisfied for all selected TEs within a subprogram.Souffle uses a greedy algorithm to partition the TE program, starting with an empty   and using Breadth First Search (BFS) to add TE   to   .If adding   to   violates the constraint, Souffle creates a new subprogram  +1 by adding this TE to the new sub-program and repeats the process until all TEs have been allocated to a subprogram.

Table 2 .
DNN models and datasets used in our evaluation.

Table 3 .
End-to-end model runtime () -lower is better.

Table 4 .
Execution time () with Souffle individual optimizations

Table 5 .
The number of GPU kernel calls and global memory data transfer size ( bytes) of the resulting code.

Table 5
Souffle reduces the number of kernels from 120 and 240 (generated by TensorRT and Apollo, respectively) to 24.Similar kernel call reductions are observed in other DNN workloads.Operator fusion is one of the key features of XLA.Nonetheless, XLA leverages libraries such as cuBLAS to execute compute-intensive operators.Consequently, it faces limitations in fusing compute-intensive operators with memory-intensive counterparts, thereby hindering the potential reduction in kernel count.For instance, XLA generates 6 custom calls to invoke cuBLAS to run the GEMM operators for one BERT layer.While Souffle seamlessly propagates the schedule of compute-intensive TEs to memory-intensive TEs and generates one kernel.Reduce GPU memory data transfers.GPU global memory data transfer is known to be expensive and it is desired to reduce the amount of data transfers from the global memory.To do so, Souffle maximizes tensor buffer reuse through TE program partitioning (Sec.5.4) and TE transformation (Sec.6).Table 5 also compares the amount of GPU global memory data transfers measured by Nsight Compute for Ten-sorRT, Apollo, and Souffle.Souffle-generated code incurs significantly fewer data transfers compared to TensorRT and Apollo.For example, in BERT, Souffle reduces the memory transaction from 361.8M and 880.5M bytes (loaded by TensorRT and Apollo, respectively) to 226.8M bytes.

Table 6 .
GPU performance counter values for LSTM optimized by Rammer and Souffle.
to estimate occupancy from the TE program.Reusing dynamic-shaped tensors.Certain DNN operators have unknown tensor shapes at compile time.Our current implementation does not support reusing tensors of dynamic shapes.To address this, we can generate multiple versions of a kernel and choose the appropriate one based on shape information available at execution time.Fusion in DL training.DL compilers like TensorFlow XLA also enable operator fusion in training (forward inference and backward parameter updates).Our TE-based transformation can be integrated into DL compilers to accelerate forward and backward passes during training.However, intermediate tensors must be kept in global memory in DL training for backward gradient-based optimization like Adam [24], restricting operator fusion chances.Our main focus is optimizing model inference after DNN training.Support for TE transformation in DL training is left for future work.