TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs

Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Since the computation pattern is sparse and irregular, specialized high-performance kernels are required. Existing GPU libraries offer two dataflow types for sparse convolution. The gather-GEMM-scatter dataflow is easy to implement but not optimal in performance, while the dataflows with overlapped computation and memory access (e.g.implicit GEMM) are highly performant but have very high engineering costs. In this paper, we introduce TorchSparse++, a new GPU library that achieves the best of both worlds. We create a highly efficient Sparse Kernel Generator that generates performant sparse convolution kernels at less than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads. Consequently, TorchSparse++ achieves 2.9x, 3.3x, 2.2x and 1.7x measured end-to-end speedup on an NVIDIA A100 GPU over state-of-the-art MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference; and is 1.2-1.3x faster than SpConv v2 in mixed precision training across seven representative autonomous driving benchmarks. It also seamlessly supports graph convolutions, achieving 2.6-7.6x faster inference speed compared with state-of-the-art graph deep learning libraries.


INTRODUCTION
Sparse convolution [12,18] plays a crucial role in a variety of cutting-edge applications, including augmented/virtual reality (AR/VR), autonomous driving, and recommendation systems.For instance, in advanced driver assistance systems (ADAS) and autonomous driving technology, data is collected from 3D sensors in the form of 3D point clouds.These point clouds often exhibit an exceptionally high spatial sparsity, with up to 99.99% spatial sparsity.In such cases, employing dense 3D convolutions for point cloud processing becomes inefficient.Likewise, social media graphs, like those found on platforms such as Twitter, exhibit even greater sparsity.As an illustration, the adjacency matrix of Twitter's social graph contains only a minuscule fraction, approximately 0.000214%, of the possible connections [56].Therefore, there is a urgent need for efficient inference and training system for these sparse workloads.
Sparse convolution modifies the definition of regular convolution by only performing computation at a sparse set of output locations rather than the entire feature map.It is arguably the most important building block for almost all state-of-the-art 3D perception models (e.g.3D semantic segmentation [10,31,41], 3D object detection [1,6,8,17,50,51,53,58], 3D reconstruction [9], Another recently used large dataset riving [57], but with fewer classes, is not .
ITTI dataset [17] provides synthetically tial images with depth information and annotation.The depth information can nerate point clouds.However, these point w the same characteristics as a real rotatding defects like reflections and outliers.ese datasets, our dataset combines a large points, a large variety of classes, and senerated by a commonly employed sensor us driving, which is distinct from all pubasets, also shown in Table 1.
ticKITTI Dataset based on the odometry dataset of the nchmark [19] showing inner city traffic, but also highway scenes and countryside lsruhe, Germany.The original odomets of 22 sequences, splitting sequences ng set, and 11 to 21 as test set.For conoriginal benchmark, we adopt the same aining and test set.Moreover, we do not original odometry benchmark by providr the training data.Overall, we provide ans for training and 20 351 for testing, y a wide margin the largest dataset pubuse the KITTI dataset as a basis for our lae it allowed us to exploit one of the largest ns of raw point cloud data captured with a re expect that there are also potential synr annotations and the existing benchmarks le the investigation and evaluation of adirections, such as the usage of semantics ometry estimation.other datasets (cf.Table 1), we provide tial point clouds generated with a comotive LiDAR, i.e., the Velodyne HDLly available datasets, like Paris-Lille-3D [6], also use such sensors, but only proed point cloud of the whole acquired seindividual scans of the whole sequence, e we provide the individual scans of the one can also investigate how aggregating ive scans influences the performance of entation and use the information to recjects.8 classes, where we ensured a large overth the Mapillary Vistas dataset [39] and t [10] and made modifications where nec- essary to account for the sparsity and vertical field-of-view.
More specifically, we do not distinguish between persons riding a vehicle and the vehicle, but label the vehicle and the person as either bicyclist or motorcyclist.
We furthermore distinguished between moving and nonmoving vehicles and humans, i.e., vehicles or humans gets the corresponding moving class if they moved in some scan while observing them, as shown in the lower part of Figure 2. All annotated classes are listed in Figure 3 and a more detailed discussion and definition of the different classes can be found in the supplementary material.In summary, we have 28 classes, where 6 classes are assigned the attribute moving or non-moving, and one outlier class is included for erroneous laser measurements caused by reflections or other effects.
The dataset is publicly available through a benchmark website and we provide only the training set with ground truth labels and perform the test set evaluation online.We furthermore will also limit the number of possible test set evaluations to prevent overfitting to the test set [55].

Labeling Process
To make the labeling of point cloud sequences practical, we superimpose multiple scans above each other, which conversely allows us to label multiple scans consistently.To this end, we first register and loop close the sequences using an off-the-shelf laser-based SLAM system [5].This step is needed as the provided information of the inertial navigation system (INS) often results in map inconsistencies, i.e., streets that are revisited after some time have differ- Visualization of 3D auto labels on the Waymo Open Dataset val set (best viewed in color with zoom in).Object points are colored by object types with blue for static vehicles, red for moving vehicles and orange for pedestrians.Boxes are colored as: green for true positive detections, red for false positives and cyan for ground truth boxes in the cases of false negatives.transform segmentation iterative tta Acc@0.7/0.7. Comparing with alternative designs of dynamic object auto labeling.Metrics are box accuracy with 3D IoU thresholds 0.7 and 0.8 for vehicles on the Waymo Open Dataset val set.
Effects of temporal context sizes for object auto labeling Table 8 studies how the context frame sizes influence the box prediction accuracy.We also compare with our singleframe (S-MVF++) and multi-frame detectors (M-MVF++) to show extra gains the object auto labeling can bring.We can clearly see that using large temporal contexts improves the performance while using the entire object track (the last row) leads to the best performance.Note that for the static object model, we use the detector box with the highest score for the initial coordinate transform, which gives our auto labeling an advantage over frame-based method.

Qualitative Analysis
In Fig. 6, we visualize the auto labels for two representative scenes in autonomous driving: driving on a road with parked cars, and passing a busy intersection.Our model is able to accurately recognize vehicles and pedestrians in

Method
Context frames static dynamic Acc@0.7/0.8Acc@0.8. Effects of temporal context sizes for object auto labeling.Metrics are the box accuracy at 3D IoU=0.7,0.8 for vehicles in the WOD val set.Dynamic vehicles have a higher accuracy because they are closer to the sensor than static ones.challenging cases with occlusions and very few points.The busy intersection scene also shows a few failure cases including false negatives of pedestrians in rare poses (sitting), false negatives of severely occluded objects and false positive for objects with similar geometry to cars.Those hard cases can potentially be solved with added camera information with multi-modal learning.

Conclusion
In this work we have introduced 3D Auto Labeling, a state-of-the-art offboard 3D object detection solution using point cloud sequences as input.The pipeline leverages the long-term temporal data of objects in the 3D scene.Key to our success are our object-centric formulation, powerful offboard multi-frame detector and novel object auto labeling models.Evaluated on the Waymo Open Dataset, our solution has shown significant gains over prior art onboard 3D detectors, especially with high standard metrics.A human label study has further shown the high quality of the auto labels reaching comparable performance as experienced humans.Moreover, the semi-supervised learning experiments have demonstrated the usefulness of the auto labels for student training in cases of low-label and unseen domains.Semantic segmentation and scene completion of 3D point clouds are usually studied separately [2,3], but with the emergence of large-scale datasets such as ScanNet [4] and SemanticKITTI [1], researchers have discovered a deep intertwining of an object's semantics with its underlying geometry, and since, have begun exploiting this with the joint learning of semantic segmentation and scene completion to boost model performance [5].For instance, speculating that an object occluded by vehicles and surrounded by leaves is a trunk simplifies the task of inferring it's shape.Conversely, inferring the shape of a pole-like object forms a prior on it's semantic class being a trunk rather than a wall.While previous semantic scene completion methods built on dense 2D or 3D convolutional layers have done well in small-scale indoor environments, they have struggled to maintain their accuracy and efficiency in outdoor environments for several reasons.For one, dense 2D convolutional methods that thrived in the feature rich 2D image space are no longer sufficient when tackling large and sparse LiDAR scans that contain far fewer geometric and semantic descriptors.Furthermore, the dense 3D convolution becomes extremely wasteful in terms of computation and memory since the majority of the 3D volume of interest is in fact empty.Thereby, our main contributions are listed as the following: (a) a sparse tensor based neural network architecture that efficiently learns features from sparse 3D point cloud data and jointly solves the coupled scene completion and semantic segmentation problem; (b) a novel geometric-aware 3D tensor segmentation loss; (c) a multi-view fusion and semantic post-processing strategy addressing the challenges of distant or occluded regions and small-sized objects.Given a single sparse point cloud frame, our model predicts a dense 3D occupancy cuboid with semantic labels assigned to each voxel cell (as shown in Fig. 1), generating rich information of the 3D environment that is not contained in the original input such as gaps between LiDAR scans, occluded regions and future scenes.
In order to effectively complete occluded voxel regions from LiDAR scans, we focus on exploiting the geometrical relationship of the 3D points both locally and globally.In this work, we utilize point-wise normal vectors as a geometrical feature encoding to guide our model in filling the gaps according to the object's local surface convexity.We also leverage a LiDAR-based flipped Truncated Signed Distance Function (fTSDF [5]) computed from a spherical range image as a spatial encoding to differentiate free, occupied and occluded space of a scene.As for future scenes, because these regions are far from the vehicle and are primarily road or other forms of terrain, we propose a 2D variant of the sparse semantic scene completion network to support the construction of the 3D scene via multi-view fusion with Bird's Eye View (BEV) semantic map predictions.To tackle sparsity, we leveraged the Minkowski Engine [6], an auto-differentiation library for sparse tensors to build our 2D and 3D semantic scene completion network.We have also adopted a combined geometric inspired semantic segmentation loss to improve the accuracy of semantic label predictions.Since our network is trained in a complex real-world autonomous driving dataset with 20 classes of dynamic and static objects, and the input data is simply a voxelized LiDAR point cloud appended with geometrical and spatial feature encodings, our model can be deployed on-the-go with various LiDAR sensors.We demonstrate this by applying our method to unseen real-world voxel data, which yields reasonable qualitative results.Our experiments show that our model outperforms all baseline methods by a large margin, with exceptional performance in the prediction of small, under-represented class categories such as bicycles, pedestrians, traffic signs and more.

Related Works
We review the related works across four major areas: volume reconstruction, point cloud segmentation, semantic scene completion, and multi-view fusion.
Volume Reconstruction.There are several approaches to inferring complete volumetric occupancy of shapes and scenes from partial or sparse geometric data.Efficient methods based on object symmetry [7,8] and plane fitting [9] apply for small non-complex completion tasks.In larger  Semantic segmentation and scene completion of 3D point clouds are usually studied separately [2,3], but with the emergence of large-scale datasets such as ScanNet [4] and SemanticKITTI [1], researchers have discovered a deep intertwining of an object's semantics with its underlying geometry, and since, have begun exploiting this with the joint learning of semantic segmentation and scene completion to boost model performance [5].For instance, speculating that an object occluded by vehicles and surrounded by leaves is a trunk simplifies the task of inferring it's shape.Conversely, inferring the shape of a pole-like object forms a prior on it's semantic class being a trunk rather than a wall.While previous semantic scene completion methods built on dense 2D or 3D convolutional layers have done well in small-scale indoor environments, they have struggled to maintain their accuracy and efficiency in outdoor environments for several reasons.For one, dense 2D convolutional methods that thrived in the feature rich 2D image space are no longer sufficient when tackling large and sparse LiDAR scans that contain far fewer geometric and semantic descriptors.Furthermore, the dense 3D convolution becomes extremely wasteful in terms of computation and memory since the majority of the 3D volume of interest is in fact empty.Thereby, our main contributions are listed as the following: (a) a sparse tensor based neural network architecture that efficiently learns features from sparse 3D point cloud data and jointly solves the coupled scene completion and semantic segmentation problem; (b) a novel geometric-aware 3D tensor segmentation loss; (c) a multi-view fusion and semantic post-processing strategy addressing the challenges of distant or occluded regions and small-sized objects.Given a single sparse point cloud frame, our model predicts a dense 3D occupancy cuboid with semantic labels assigned to each voxel cell (as shown in Fig. 1), generating rich information of the 3D environment that is not contained in the original input such as gaps between LiDAR scans, occluded regions and future scenes.
In order to effectively complete occluded voxel regions from LiDAR scans, we focus on exploiting the geometrical relationship of the 3D points both locally and globally.In this work, we utilize point-wise normal vectors as a geometrical feature encoding to guide our model in filling the gaps according to the object's local surface convexity.We also leverage a LiDAR-based flipped Truncated Signed Distance Function (fTSDF [5]) computed from a spherical range image as a spatial encoding to differentiate free, occupied and occluded space of a scene.As for future scenes, because these regions are far from the vehicle and are primarily road or other forms of terrain, we propose a 2D variant of the sparse semantic scene completion network to support the construction of the 3D scene via multi-view fusion with Bird's Eye View (BEV) semantic map predictions.To tackle sparsity, we leveraged the Minkowski Engine [6], an auto-differentiation library for sparse tensors to build our 2D and 3D semantic scene completion network.We have also adopted a combined geometric inspired semantic segmentation loss to improve the accuracy of semantic label predictions.Since our network is trained in a complex real-world autonomous driving dataset with 20 classes of dynamic and static objects, and the input data is simply a voxelized LiDAR point cloud appended with geometrical and spatial feature encodings, our model can be deployed on-the-go with various LiDAR sensors.We demonstrate this by applying our method to unseen real-world voxel data, which yields reasonable qualitative results.Our experiments show that our model outperforms all baseline methods by a large margin, with exceptional performance in the prediction of small, under-represented class categories such as bicycles, pedestrians, traffic signs and more.

Related Works
We review the related works across four major areas: volume reconstruction, point cloud segmentation, semantic scene completion, and multi-view fusion.
Volume Reconstruction.There are several approaches to inferring complete volumetric occupancy of shapes and scenes from partial or sparse geometric data.Efficient methods based on object symmetry [7,8] and plane fitting [9] apply for small non-complex completion tasks.In larger multi-sensor fusion [7,27,30], end-to-end navigation [29]).It also exhibits similar computation pattern to (relational) graph convolutions [19,36].Despite achieving dominant performance, the sparse and irregular nature of sparse convolution makes it harder to be processed on GPUs and there is no vendor library support.Dedicated libraries [18,21,40,49,50] with specialized high-performance kernels or even specialized hardware accelerators [14,15,28] are required for sparse convolution.As a result, many industrial driving assistance solutions still prefer pillar-based models [25], which flatten LiDAR points onto the BEV space and process them with a 2D CNN.These approaches cannot take full advantage of 3D geometry from LiDAR and tend to have much worse accuracy.Several pioneering implementations of sparse convolution have adopted different dataflows for this operator.For instance, Spar-seConvNet [18] and SpConv v1 [50] use the vanilla gather-GEMMscatter dataflow.It was improved by TorchSparse [40] that optimizes the gather-scatter paradigm through fusing memory operations and grouping computations adaptively into batches to improve device utilization.Dataflows based on gather-scatter can be implemented using vendor libraries with relative ease.However, they are fundamentally restricted in performance due to the inability to overlap memory access and computation.MinkowskiEngine [12] proposes the fetch-on-demand dataflow, which is optimized by PCEngine [21].Recently, SpConv v2 [49,50] has adapted the implicit GEMM dataflow for dense convolution to the sparse domain, achieving state-of-the-art performance on real-world workloads.Nevertheless, the best representative of these memory-computation overlapped dataflows, implicit GEMM, is extremely hard to implement.The metaprogrammer for SpConv v2 has more than 40k lines of code, making it hard for the community to further improve upon it.
To address the significant challenge of achieving both ease of implementation and state-of-the-art performance, we present TorchSparse++ (Figure 1), a high-performance GPU library that combines the best of both worlds through the Sparse Kernel Generator and the Sparse Autotuner.Tackling a fundamentally sparse and dynamic workload, we propose a general method to adapt existing tensor compilers that are optimized for dense and static workloads, Figure 2: Sparse convolution (Equation 1) on Δ 2 (3): computation is performed only on nonzero inputs.
unlocking their potential to generate kernels that can deal with sparsity and variable workload shapes.On top of the generated kernels, we further extend the design space of existing point cloud libraries.We design a Sparse Autotuner to efficiently search for the best dataflow configurations through group-based tuning for a diverse set of workloads within the enlarged design space.The results of our Sparse Autotuner challenged the conventional design wisdom of using amount of computation, DRAM access or even total runtime for computation kernels as the indicator for end-to-end performance.

BACKGROUND AND MOTIVATION
Without loss of generality, we use point cloud workloads to illustrate the computation pattern of sparse convolution.A point cloud sparse tensor can be defined as an unordered set of points with features: {(  ,   )}.  is the quantized coordinates for the  th point in the -dimensional space Z  .  is its -dimensional feature vector in R  .Coordinate quantization is done through  = ⌊ (raw)  /⌋, where  is the voxel size vector.Unique operation is further applied to all quantized coordinates.For example, in CenterPoint [53], the point clouds on Waymo [38] are quantized using  = [0.1m,0.1m, 0.15m].This means that we will only keep one point within each 0.1m×0.1m×0.15mgrid.

Definition of Sparse Convolution
Following the notations in [40], we define the -dimensional neighborhood with kernel size  as Δ  () (e.g.x i/o (Figure 2) on the  th output point is defined as: where is a binary indicator,  is the stride and   ∈ R  in × out corresponds to the weight matrix for kernel offset  ∈ Δ  ().

Sparse Convolution Dataflows on GPUs
Current implementations of sparse convolution on GPUs can be categorized into three distinct dataflows (Figure 3).The first is the gather-GEMM-scatter approach, which is weight-stationary and was inspired by early explicit im2col attempts [23] for convolution implementation.The second dataflow is the fetch-on-demand approach, which is a kernel fusion version of gather-GEMM-scatter.Finally, the implicit GEMM approach is an output-stationary alternative inspired by its dense counterpart [11].

Gather-GEMM-Scatter Dataflow.
Early sparse convolution implementations utilized a gather-GEMM-scatter dataflow [18,50].This dataflow is weight-stationary and features an outer host loop over   kernel offsets.For each offset  ∈ Δ  (), we compute maps M  = {(  ,   )|  =   +  }, as shown in Figure 4. We gather all input features  in  , resulting in a |M  | ×  in matrix in DRAM, and multiply it by weight   ∈ R  in × out .Finally, we scatter the results back to output positions  out  according to M  .For example, since We gather  in 0 and  in 4 , multiply them by  −1,−1 , and scatter the results back to  out 1 and  out 5 .A variant of this dataflow [40] aims to reduce both computation and data movement time by fusing and reordering memory accesses and grouping computation for different weights.
Gather-GEMM-scatter is straightforward to implement.Following feature gathering, computation for each offset  involves a dense matrix multiplication, which can be handled by existing vendor libraries like cuBLAS and cuDNN.Only scatter and gather operations need to be optimized in CUDA.However, this dataflow is fundamentally inefficient due to the lack of overlap between computation and memory access, as illustrated in Figure 3a,b.It is thus impossible to hide data orchestration latency with pipelining.
2.2.2 Fetch-On-Demand Dataflow.The gather-GEMM-scatter implementation requires three separate CUDA kernel calls in each host loop iteration over .An alternative fetch-on-demand dataflow [12,50] (named by [40]) merges the gather, matrix multiplication, and scatter kernel calls into a single CUDA kernel.Instead of materializing the |M  | ×  in gather buffer in DRAM, it fetches { in  |(  ,   ) ∈ M  } on demand into the L1 shared memory, performs matrix multiplication in the on-chip storage and directly scatters the partial sums (resided in the register file) to corresponding outputs { out  |(  ,   ) ∈ M  } without first instantiating them in a DRAM scatter buffer.Hong et al. [21] further improve the vanilla fetch-on-demand dataflow by introducing block fusion, where the sequential host loop over  is converted to a parallel thread block dimension.As such, the computation of all s is merged into a single kernel.Similar to gather-GEMM-scatter (without adaptive grouping in [40]), the fetch-on-demand dataflow has zero redundant computation.It further overlaps computation with memory access and saves DRAM writes to gather and scatter buffers.
However, it cannot save any DRAM write to the final output tensor, which means  | M  |  out × (4×-10× in real workloads, since each point typically has 4-10 neighbors) larger write-back traffic than the theoretical optimal value.Furthermore, the block-fused fetch-on-demand dataflow [21] suffers from write-back contentions between different threads.For example, both  −1,0 and  −1,1 in Figure 4 may attempt to write back to  out 3 .Therefore, it is necessary to introduce atomic operations to serialize all DRAM writes to the same location.Since gather and scatter operations are now combined into GEMM, the entire computation kernel in the fetchon-demand dataflow must be implemented in CUDA.This is more complex than the gather-GEMM-scatter approach.
Similar to fetch-on-demand, implicit GEMM overlaps computation with memory access (Figure 3).This allows us to hide the memory latency through pipelining.Like im2col in 2D convolution, an implicit GEMM implementation is output-stationary.So it achieves the theoretical minimum DRAM write-back traffic.However, despite having lower DRAM traffic compared to fetch-on-demand, implicit GEMM has non-negligible redundant computation.As shown in Figure 5, we assume that each warp contains four threads.All GPU threads within a warp execute in lockstep.Whenever a thread has a non-empty neighbor at weight , all threads in the warp will either perform computation or waste cycles for that weight.This leads to 34 redundant computation in Figure 5, which is even more than 22 effective MACs in this example.
To address this issue, SpConv v2 excludes unsorted implicit GEMM in their design space and utilizes bitmask sorting to minimize computation overhead.Following the approach taken by DSTC [45], each output point is assigned a   -dimensional bitmask that indicates the presence of its neighbors.These bitmasks are treated as numbers and sorted, and the order of computation for different outputs is adjusted accordingly.For instance, warp 0 calculates  out 0−4 in Figure 5, but it calculates  out 4,5,0,2 in Figure 6b instead.Thanks to sorting, computation overhead is reduced from 34 MACs to 26 MACs.In practical applications, sorting can reduce redundant computation by up to 3×, but it remains unclear whether this reduction translates into proportional speedups.

Motivation
As mentioned above, gather-GEMM-scatter is easy to implement but has poor performance.The more performant dataflows with overlapped computation and memory access cannot be implemented with the help of vendor libraries.Implementing the stateof-the-art implicit GEMM dataflow alone is a daunting task, as demonstrated by the SpConv v2 authors who had to painstakingly re-implement the entire CUTLASS framework from scratch with a custom Python-based template metaprogrammer [48].The resulting code base has over 40,000 lines of code which increases the risk of errors for developers.This also makes it challenging for the community to explore a wider design space for sparse point cloud convolution kernels, hindering further performance improvements.
Therefore, in TorchSparse++, we want to first demonstrate in Section 3 that highly efficient dataflows with overlapped computation and memory access can be generated with a relatively low engineering complexity (comparable to implementing gather-GEMMscatter).With the efficient kernel generator as a cornerstone, we further showcase in Section 4 that the design space for sparse point cloud convolution could be significantly extended, and there exists solutions that are up to 1.7× faster in inference, 1.3× faster in training compared with the incumbent state-of-the-art within this vast space.Tackling a fundamentally sparse workload, we also challenge traditional thinking on dense GPU kernel design.Our research reveals that typical first-order performance indicators, such as total computation, DRAM access, or even total runtime for all sparse convolution computation kernels, cannot accurately reflect the end-to-end runtime of sparse point cloud workloads.This is because sparse workloads require expensive mapping operations.On top of this observation, we will further demonstrate that end-to-end  We introduce Sparse Kernel Generator, a code generator that integrates on-chip MMA subroutines from [4] directly at the source code level, unlocking the potential of using dense, fixed shape tensor compiler to generate programs for sparse, dynamic shape workloads.Gray: constant code, red: fixed metaprogramming template, blue: generated automatically by existing tensor compiler for each tile size.optimal dataflows could sometimes choose configurations with up to 6× computation overhead and 4× larger DRAM footprint.

SPARSE KERNEL GENERATOR
In this section, we introduce the Sparse Kernel Generator, which is a metaprogrammer that can efficiently generate sparse convolution GPU kernels.Existing metaprogrammers, such as TVM [4], are designed to generate optimized GPU computing schedules for dense and fixed-shape workloads.However, point cloud workloads are naturally sparse and have dynamic shapes.

Dense to Sparse Adaptation
Leveraging the information from Section 2, we establish the relationship between sparse convolution and dense GEMM kernels, as summarized in Table 1.We show that the fetch-on-demand and implicit GEMM dataflows with their overlapped memory access and computation can be seen as generalized GEMM kernels with sparse DRAM loading and write-back iterators.Take implicit GEMM as an example, we start from its equivalent-sized dense GEMM workload in Section 2.2.3.We notice that position (, ) in  im2col-in is mapped to position (M ,/ in , % in ) in  in .Here M  out × |Δ  ( ) | is the output-stationary representation of the maps defined in Section 2.2.1.For the  th output point, if its  th neighbor is non-empty, then M , is the index of this neighbor; otherwise M , = −1.For example, in Figure 5, M 2,3 = 1 since the fourth neighbor of Here we assume indices start from 0. By introducing this one level of indirect addressing, we can easily transition from a dense GEMM to a sparse implicit GEMM when loading data from DRAM to L1 shared SRAM.Since the DRAM→L1 memory access to  is dense, one can reuse the CUDA code segment for 2 nd operand loading in dense GEMM.Based on this formulation, as in Figure 7, a sparse convolution kernel can then be decomposed into three parts.The gray code is always constant.Blue code depends on the tile sizes and can be automatically generated by the existing compilers [4].The red code cannot be generated by existing dense tensor compilers due to sparsity, but it can be generated from a fixed template that only takes in tiling sizes as input parameters.Consequently, we only need to manually implement the short red code template and a TensorIR [13] template that outputs the blue onchip MMA subroutine, which only takes hundreds of lines of code (orders of magnitude cheaper than the SpConv v2 code generator).
For simplicity, we did not visualize performance optimization techniques such as double buffering and pipeling in Figure 7.However, these techniques will not impact the design of our code generator.Similar analysis and code transformation can also be applied to the fetch-on-demand dataflow.

Static to Dynamic Adaptation
Thanks to the adaptation described in Section 3.1, we can now easily implement sparse convolutions in dataflows with overlapped computation and memory access.However, the simplicity of the code generator comes at the cost of a reduced design space.Our Sparse Kernel Generator only allows the tiling sizes to be tuned, while leaving most of the dimensions in the tensor program design space to be fixed (e.g. the order and split of the loop nests).Fortunately, we argue that such reduced design space does not compromise the performance.We present an idealized experiment in Figure 8.We manually traverse all possible tile sizes for different layers in MinkUNet [12] on SemanticKITTI [2] and apply compiletime constant folding to maximize performance.We benchmark the resulting sparse kernel with the lowest latency against cuBLAS, which runs an equivalent-sized GEMM problem due to the lack of sparsity support.It turns out that we can achieve > 100% cuBLAS utilization on average by only tuning tile sizes.Notably, for the last workload, the equivalent-sized dense GEMM problem can run at ≈90% device utilization on RTX 3090.If we ignore redundant  computation (Figure 5), it is safe to assert that extending the design space beyond tile sizes will not significantly improve final performance on this workload.
Despite achieving encouraging results in the idealized experiment, it remains challenging to transfer the performance to real systems.Unlike dense workloads, each sparse point cloud sample has a different shape in terms of the number of points.Precompiling constant-folded kernels for all possible workloads, as is done by TVM and TensorRT in the dense domain, is impossible for us.Naively unfold the constants in fixed shape kernels and revert them back to workload shape parameters will degrade the performance by up to 1.7×.This totally undermines the good results achieved in Figure 8. Worse still, the first red instruction in Figure 7 now requires explicit boundary check in flexible shape kernels, which brings up to 1.35× performance overhead as well.
To this end, we present two simple yet effective strategies to address these two performance roadblocks.
We first pinpoint that the slow addressing of  in is the reason why constant unfolding ruins the performance.Unlike in dense GEMM, accessing  in requires two inefficient division and modulo operations with  in as an operand, which are necessary just for addressing.This impacts the efficiency since  in is stored in the RF and has an access latency no shorter than L1 on GPUs.Worse still, accesses to  in are located in the innermost layer of the long  loop (|Δ  ()| ×  in , ranging from 1728 to 6912 in Figure 8).Fortunately, we notice that most of the addressing computation is irrelevant to the innermost loop variable ldA in Figure 7. Therefore, it is possible for us to lift the loop invariants out of the loop.For real tiling sizes with LD_A_THR=4 and 8, this at least reduces addressing cost by 4-8×.We further analyze the template and perform loop invariant hoisting wherever possible.Ablation studies in Section 6.2 shows that addressing simplification can fully close the up to 1.7× constant unfolding overhead.
Likewise, among all boundary checks in the dynamic shape kernel, the one for accessing map within the innermost ldA loop is the most time-consuming.Although loop invariant hoisting does not apply in this case, we can solve this issue by padding the first dimension of map to be a multiple of cta_M.With this simple modification, no boundary check on map access in Figure 7 is required since we can ensure that every access stays within bounds.With that reduced control flow overhead, we close the final 1.14-1.35×performance gap between fixed and dynamic shape kernels.

SPARSE AUTOTUNER
Based on the simple yet powerful Sparse Kernel Generator, we present Sparse Autotuner.It first significantly enlarges the design space of existing libraries (illustrated in Figure 9) and then applies group-based configuration tuning across this enlarged space.

Design Space Augmentation
Thanks to the simplicity of Sparse Kernel Generator, we can easily expand our design space.Since the generator can produce fetch-ondemand kernels, we can effortlessly incorporate this dataflow in our designs.Besides, for implicit GEMM, number of splits (Figure 10) is an important tunable dimension in the implicit GEMM dataflow that was previously overlooked.Similar to the SplitK technique [24] in dense GEMM kernel design, one could split the sequential  loop in Figure 7 into  parts.By doing so, each split (whose  loop is now × shorter) can compute in parallel and write to a separate DRAM buffer.These partial sums are later reduced by a summation kernel to produce the final result.We also reorder the computation in each split following Figure 6, which involves argsorting  individual bitmasks and reordering the map accordingly.For example, after reordering, the first row calculates part of  out 0 ,  out 3 ,  out 3 , while the full feature of  out 0 is calculated in the 1 st , 4 th , 6 th rows by two thread blocks collaboratively.As such, there are more common zero neighbors for each thread block and the redundant computation is further reduced from 26 in Figure 6 to 22 in Figure 10.When integrating support for arbitrary split implicit GEMM, we notice that it is beneficial to reorder the map in an offline manner for a similar reason in Section 3.2.
Group 1 Tuning (End-to-end latency) iGEMM (s=0) iGEMM (s=1) iGEMM (s=3) FoD Group 2 Tuning (End-to-end latency, group 1 fixed) Time Figure 12: Group-based autotuning: Layers using the same maps will be assigned to the same group.After group partition, we exhaustively traverse all choices in our design space in a group-by-group manner and selects the best group configuration that leads to the lowest end-to-end latency.Reusing tuner in Figure 12 (a) Binding fwd-dgrad (for low-end devices) Reusing tuner in Figure 12 (b) Binding dgrad-wgrad (for high-end devices) Conventionally, dense GPU kernel design is often guided by firstorder performance approximation (e.g.computation and DRAM footprint).Following these proxies, it seems to be reasonable to eliminate split = 0 (unsorted implicit GEMM in Figure 5) due to its large redundant computation.Split > 2 should also be eliminated since it incurs much larger DRAM write back traffic.In fact, such prematured optimizations lead to the restricted design space in SpConv v2.However, we argue in Figure 11 that it is beneficial to have a larger design space that includes many first-order suboptimal solutions.On the one hand, the redundant computation in both segmentation and detection workloads keeps dropping until  = 5.The difference in computation overhead between  = 2 and  = 4 can still be up to 1.2× for detection and 1.3× for segmentation.Thus, for devices with limited parallelism, it is beneficial to increase the number of splits despite increased DRAM traffic.On the other hand, when running detection workloads on devices with high parallelism, a 2.4-2.9×computation overhead for the unsorted dataflow in Figure 5 is completely acceptable.We will demonstrate in Table 3 and Table 4 that kernels for detection will not run faster despite having ∼ 2× lower computation overhead on RTX 3090, which has an ample 71 TFLOPS FP16 peak throughput.

Group-Based Configuration Tuning
To this end, we designed a sparse and dynamic shape kernel generator with minimal help from dense and fixed shape tensor compilers.By doing so, we obtain high-performance sparse convolution kernels with different dataflows (e.g.fetch-on-demand and implicit GEMM) and augments the design space of implicit GEMM itself by introducing arbitrary number of mask splits.However, no dataflow is perfect for all workloads.As in Section 2, fetch-on-demand has zero redundant computation but suffers from large DRAM scattering traffic, while implicit GEMM has the exact opposite property.Similarly, there is no single set of parameters that works for each dataflow.For example, the number of splits  in implicit GEMM reflects the tradeoff between redundant computation and control flow overhead (e.g.sorting  individual bitmasks and reordering the maps).Therefore, the enlarged design space necessitates the design of an autotuning system that can automatically determine the optimal dataflow and dataflow-specific parameters for different workloads.
To determine the optimal dataflow for different layers, we divide all layers into different groups (illustrated in Figure 12).All layers within each group use the same input-output mappings (maps) and are forced to execute the same dataflow.This is because different dataflows require different map structures.Implementations such as gather-GEMM-scatter and fetch-on-demand require the maps to be stored in a weight-stationary order, represented as M  = {(  ,   )|  =   + ,   ∈  out }, which makes it difficult to infer all the neighbors of an output point (required by implicit GEMM).On the other hand, the implicit GEMM implementation, stores the maps in an output-stationary order, represented as which makes it difficult to infer all the inputs that use the same weight (required by the other two dataflows).It would incur significant overhead (∼ latency of up to 3-4 sparse convolution layers within each group!) if we generate maps for all dataflows but only use one of them at runtime.Therefore, allowing intra-group heterogeneous dataflow selection is not desired.After group partition, we apply a group-level exhaustive search on a random subset of the target workload (e.g. 100 scenes on the Waymo dataset).Since the execution time of each group is independent of the others, we tune the dataflow parameters in a greedy manner.We iterate over all possible choices for the  th group based on the optimally-tuned configurations for the 1 st to ( − 1) th groups, using default parameters for all subsequent groups.This approach effectively reduces the tuner complexity from exponential to linear * and allows us to complete tuning within 2 minutes for most workloads.Considering that the tuned schedule could be reused for millions of scenes in real-world ADAS applications during inference, the cost is clearly justifiable.
We further extend Sparse Autotuner to support training workloads.The most straightforward design assumes that the back propagation kernels (i.e.dgrad for feature map gradient calculation and wgrad for weight gradient calculation) share the same dataflow parameters as the forward kernel.However, as analyzed in Section 6.1, such design incurs up to 10% performance regression in end-to-end training.Naively decoupling the tuning process for training workloads leads to an unacceptable  ( 3 ) tuning complexity, with  being the size of our design space.To address this complexity issue, we partially bind dataflow parameters for forward, dgrad, and wgrad kernels.We propose two binding schemes: the workload-pattern oriented scheme binds the dataflow parameters for forward and dgrad kernels while allowing wgrad kernels to be tuned separately, reducing the tuning complexity to  ( 2 ) and minimizing the total latency for all sparse convolution kernels.We also propose the sparse-mapping oriented scheme, which binds dgrad and wgrad kernels together since they share the same maps, minimizing the overhead for map computation.Similar to our observations in inference kernel autotuning, the high-parallelism devices (e.g.A100) is far less sensitive to redundant computation than to mapping overhead, while the low-parallelism devices (e.g.2080 Ti) behaves in the exact opposite way.This explains our design choice to use scheme 1 for low-end devices and scheme 2 for more powerful GPUs.As a final remark, we further notice in Figure 13 that the tuning time could be further reduced from  ( 2 ) to  () if we reuse the group-based tuner in Figure 12 twice and skip different parts of the kernels with dummy initializations during tuning.

Results
Inference.We compare our results with the baseline designs including MinkowskiEngine, SpConv 1.2.1, TorchSparse and Sp-Conv 2.3.5 in Figure 14.All evaluations are done in unit batch size.TorchSparse++ consistently outperforms all baseline systems on GPUs with all architectures under three numerical precisions by a large margin.On cloud Ampere GPUs (A100 and 3090), it achieves 2.9-3.7×,3.2-3.3×,2.0-2.2× and 1.4-1.7×measured endto-end speedup over the state-of-the-art MinkowskiEngine, SpConv 1.2.1, TorchSparse and SpConv 2.3.5, respectively.We also compare TorchSparse++ with SpConv 2.3.5 on NVIDIA Jetson Orin, an edge GPU platform widely deployed on real-world autonomous vehicles.Our TorchSparse++ is 1.25× faster than SpConv 2.3.5 on average, while achieving 1.3-1.4×consistent speedup across all detection workloads that are most time-critical in real ADAS applications.In addition, TorchSparse++ is competitive on legacy GPU architectures (Turing and Pascal), achieving at least 1.4×,We scale up the systolic array of PointAcc [28] to match the peak performance of RTX 3090 and compare our TorchSparse++ against the ASIC accelerator.
MinkowskiEngine.Notably, recent advances in point cloud transformers [32,39,43] often claim superior accuracy-latency tradeoffs over sparse convolutional backbones implemented with the Sp-Conv v2 backend.With the much faster TorchSparse++ backend, assuming that the 2D part is deployed with TensorRT, the 3-frame CenterPoint model on Waymo is 1.5× faster than FlatFormer [32] with higher accuracy on Orin.
Training.We also compare the training performance of our TorchSparse++ and existing systems on A100 and 2080 GPUs in Figure 15.We run the forward and backward pass of all workloads with batch sizes of 2 in mixed precision training (i.e.all gradients are calculated in FP16 precision) except for MinkowskiEngine that does not support FP16.We make sure that all workloads evaluated in Figure 15 can reach the same accuracy using the TorchSparse++ backend compared with TorchSparse (for segmentation workloads) and SpConv 2.3.5 (for detection workloads) with FP32 precision.Given the fact that A100 FP16 tensor core arithmetics has 16× higher throughput compared with FP32 (non-tensor core) computation (312 TFLOPS vs. 19.5 TFLOPS), we do not perform FP32 evaluation.As a result, TorchSparse++ is 4.6-4.8×,2.5-2.6× and 1.2-1.3×faster than MinkowskiEngine, TorchSparse and SpConv 2.3.5 on both Ampere and Turing GPUs.TorchSparse++ paves the way for rapid model iteration for real-world ADAS applications.
Comparison against Accelerators.We further compare the performance of TorchSparse++ on RTX 3090 against a scaled-up version of PointAcc [28] using the SemanticKITTI-MinkUNet workload.The systolic array in PointAcc is enlarged from 64×64 to 128×128 to roughly match the number of MACs (16384 vs. 20992) on RTX 3090.The PointAcc memory bandwidth is scaled up accordingly.Since the accelerator adopts IC-OC parallelism, we assume that the scaled PointAcc-L achieves linear speedup if the executed layer has large enough input and output channels.We also scale the measured TorchSparse++ latency by 1.7 (clock frequency difference) × 1.3 (peak MACs difference) = 2.2× for fair comparisons.As a result, TorchSparse++ achieves 56% of ASIC speed on a general-purpose hardware platform with similar computation budget.Notice that we also attempt to make a direct comparison with Mesorasi [15], which codesigns the point cloud convolution algorithm with the hardware architecture.However, its delayed aggregation scheme could only work for convolution operators with shared weights for all neighbors.The main workload accelerated in this paper, sparse convolution, is more complicated because it has different weights   Orin SK-M (1x) SK-M (0.5x) NS-M (3f) NS-M (1f) NS-C (10f) WM-C (3f) WM-C (1f) Geomean  for different neighbors (see Figure 2).Therefore, such comparison might be hard to achieve.
Results on Graph Workloads. .We also implement R-GCN [36] with TorchSparse++ and benchmark it on five representative heterogeneous graph datasets against state-of-the-art graph deep learning systems DGL [44], PyG [16] and the Graphiler [46]

ANALYSIS
In this section, we present in-depth analysis on the design choices of our Sparse AutoTuner and Sparse Kernel Generator and ablate the source of performance gains in Section 5.

Design Space of Sparse AutoTuner
As discussed in Section 3, the design space of TorchSparse++ is a superset of SpConv v2.We have added several new features to this space, including support for unsorted implicit GEMM, implicit Table 4: SparseConv Kernel Latency: Unsorted implicit GEMM kernels could be slower than their mask split counterpart, which is the exact opposite of Table 3 results.GEMM with an arbitrary number of mask splits (> 2), and the fetchon-demand dataflow.The flexibility of TorchSparse++ also allows us to explore different dataflow parameter bindings for forward, dgrad, and wgrad computation.As such, we challenge conventional designs that shares the same dataflow parameters across all kernels.
In the following two subsections, we will evaluate the effectiveness of all these new design choices in TorchSparse++.
Effectiveness of unsorted implicit GEMM.. We first demonstrate the efficacy of unsorted implicit GEMM dataflow (Figure 5) against the sorted implicit GEMM dataflow in SpConv v2.As in Table 3, the unsorted dataflow is consistently faster on both server and edge GPUs.We further present runtime comparison of all sparse convolution kernels between unsorted and sorted dataflows in Table 4. Interestingly, if we only consider the runtime of convolution kernels, the sorted dataflow is indeed faster.However, the latency difference between Table 3 and Table 4 reveals the fact that sparsityincurred mapping overhead (e.g.obtaining the bitmask, sorting the bitmask, performing bitmask reduction and reordering the maps) in the sorted dataflow is non-negligible.
Moreover, Figure 17 shows the layerwise comparison of these two versions of TorchSparse++, in which the gain from reduction in computation is overweighed by the overhead of sorting itself on Waymo object detection.However, sorting does show an advantage on a larger segmentation model (MinkUNet) on the SemanticKITTI benchmark.Our observation challenges the design principle of Sp-Conv v2, which is to use amount of computation as a first-order approximation for end-to-end performance.It also nullifies the assumption that faster computation kernel is equivalent to better end-to-end performance.Table 5: We evaluated the performance of a SemanticKITTI-MinkUNet workload on an RTX 3090 and found that expanding the design space of implicit GEMM by increasing the number of splits led to up to 1.4× improvement compared to the default setting (split=1) in SpConv v2.Effectiveness of larger split mask design space.We have shown the effectiveness of unsorted implicit GEMM.Additionally, we found that it's also beneficial to have a larger number of splits for segmentation workloads, as demonstrated in Table 5.The parallelism of an implicit GEMM kernel will be increased by × with  splits.Because segmentation workloads usually have smaller number of input points, they are more prone to suffer from device under-utilization and increased parallelism will be beneficial.Similarly, the overhead for mapping and partial sum reduction kernels is smaller in segmentation workloads.Significantly reduced computation overhead (Figure 11) further supports the preference for a larger number of splits in these scenarios.
Effectiveness of adding fetch-on-demand.We then choose 1-frame MinkUNet on nuScenes running on RTX 2080 Ti and Orin as a benchmark to demonstrate the efficacy of fetch-on-demand dataflow.As in Figure 18, individually-tuned implicit GEMM and fetch-on-demand dataflows both achieve inferior performance compared with the hybrid dataflow TorchSparse++.We further present the layerwise latency breakdown of the best tuned implicit GEMM and fetch-on-demand configurations in Figure 18b, where we amortize the mapping time to all layers within each layer group (defined in Section 4).The end-to-end performance of fetch-on-demand is notably better than implicit GEMM in decoder layers (i.e.layer index > 18) but gets outperformed in downsampling layers, where maps M could not be reused.This is because implicit GEMM has lower mapping cost while fetch-on-demand computation kernels run faster for the given workload.
Effectiveness of tuner design for training.We finally demonstrate that decoupling dataflow parameters for forward, dgrad and wgrad SK-M (1x) SK-M (0.5x) NS-M (3f) NS-M (1f) NS-C (10f) WM-C (3f) WM-C (1f) Geomean Figure 20: Naively converting fixed shape dense tensor programs to flexible shape sparse convolution kernels will incur 1.5-1.7×runtime overhead due to repetitive pointer calculation.We bridge such huge performance gap via loop invariance hoisting and show that constant folding is unnecessary for high-performance sparse kernels.
kernels could improve the training performance by up to 10% in Figure 22.On both A100 and 2080 Ti, binding parameters for two of the kernels is better than using the same parameters for all three kernels.On A100, binding dgrad and wgrad is better.This is because such strategy could minimize mapping overhead and there is a drastic performance difference (16×) between tensor cores (which runs computation) and CUDA cores (which runs mapping) on A100.On 2080 Ti, binding forward and dgrad is better, since the two kernels share the same workload pattern.Given much smaller performance gap between tensor and CUDA cores on 2080 Ti (3×), the additional mapping overhead for decoupled wgrad and dgrad is acceptable.

Sparse Kernel Generator
In this section, we present an analysis of the effectiveness of the design choices outlined in Section 3. Our experiments were conducted on 3090 GPUs using FP32 precision for offline reordering and FP16 precision for all other experiments.Our results demonstrate that simplifying control flows and addressing is critical for achieving optimal performance in sparse kernels.Additionally, we found that the conventional wisdom of fusing GPU kernels as much as possible may not always be applicable in the context of sparse computing.
Effectiveness of offline reordering.We present the effectiveness of offline reordering in Figure 19.As described in Section 4, our approach involves reordering computations based on the values of bitmasks in the implicit GEMM dataflow with mask splitting.While conventional wisdom in GPU kernel design suggests fusing kernels as much as possible (including reordering in the sparse convolution  kernel), our experiments demonstrate that this can lead to a 4-12% reduction in end-to-end performance compared to offline reordering.Specifically, when considering the wgrad kernels, it is necessary to iterate over the  out dimension in the large and innermost  loop.Online reordering introduces an additional level of indirect addressing to the memory access in the innermost loop.This will disrupt the continuous access pattern and results in a significant slowdown for wgrad.
Effectiveness of control flow simplification.We use MinkUNet on SemanticKITTI as an example to illustrate the importance of simplifying addressing and control flows.In Figure 20, we evaluate the benefits of loop invariance hoisting.The results show that a naively converted template can be very inefficient.It is up to 1.7× slower than the original fixed shape CUDA kernel.However, with the help of loop invariance hoisting in which we move all the common pointer offsets to the outmost possible loop, we can almost totally eliminate the pointer arithmetic overheads.After applying this technique, our templated CUDA kernel can even run slightly faster than the original fixed shape kernels in 5 of 7 sample workloads.Figure 21 shows the benefits of reducing control flow instructions by padding the map in Figure 7.The instructions performing boundary checking can make the kernel up to 1.3× slower.Whereas, after eliminating these control flow instructions, this problem can be well solved with the help of padding.
Effectiveness of adaptive tiling.We experiment with two sets of tiling sizes in TorchSparse++ dependent upon the MACs of the workload.Adaptive tiling provides up to 1.6× speedup to TorchSparse++, compared with fixed tiling version (either always using the small tile sizes or always using the larger tile sizes).

Discussions
Summary on performance gain.In Figure 23, we present a summary of the performance improvement achieved through the use of our Sparse Kernel Generator and the enlarged design space.Our SK-M (1x) SK-M (0.5x) NS-M (3f) NS-M (1f) NS-C (10f) WM-C (3f) WM-C (1f) Geomean  generator produces high-performance sparse convolution kernels that are 1.1 − 1.2× faster than SpConv 2.3.5, even when using the same dataflow parameters.Remarkably, our code generator comprises only 5% of the lines of code of SpConv 2.3.5'smetaprogrammer, which significantly reduces system complexity and enhances programmer productivity.For the enlarged design space, more mask splits are very helpful for segmentation workloads and FP32 precision, while unsorted implicit GEMM is helpful for detection workloads and FP16 precision.The efficacy of fetch-on-demand is mainly demonstrated in smaller segmentation workloads (e.g.NS-M).These results reinforce that there is no one-size-fits-all strategy for sparse kernel design, and that relying on first-order approximations for end-to-end performance is unreliable.
Insights for microarchitectural improvements.Our TorchSparse++ also provides new insights for future microachitecture design.Our findings indicate that when memory bandwidth is halved on RTX 3090, the latency of the system increases by 1.2×.In contrast, reducing peak computation throughput by a factor of 2 results in a more substantial slowdown of 1.4×.Therefore, scaling computation units instead of off-chip memory bandwidth can provide more effective improvements.Moreover, it is apparent from Table 3 and Table 4 that mapping operations account for up to 50% of the total runtime.Leveraging the efficient ASIC design [28] for these operators could significantly enhance the performance of GPUs when executing sparse computation workloads.
Future applications.TorchSparse++ platform presents novel opportunities for enhancing machine learning workloads beyond point clouds and graphs.For instance, in image segmentation [26] and video recognition [33], not all pixels hold equal significance.Hence, the selective computation on a sparse subset of pixels using TorchSparse++ can potentially significantly enhance efficiency.Furthermore, masked autoencoders (MAEs) [20] exhibit inherent sparsity in input patterns during training.While existing approaches already attempt to exploit this sparsity using sparse convolution [22,42], we posit that TorchSparse++ has the potential to unlock even greater speedups for such workloads.

RELATED WORK
Compiler-Based Tensor Program Optimization.Our system benefits from recent advances in tensor program compilation.The pioneering research TVM [4] provides graph-level and operatorlevel abstractions for deep learning workloads based on the essence of Halide [35].Based on TVM, AutoTVM [5] automatically discovers the optimal mapping of a fixed-shape tensor program onto the target hardware.Nimble [37] and DietCode [57] are compilers stemmed from TVM that can generate tensor programs with dynamic-shape workloads, but they are still tailored for dense workloads (e.g.transformers with variable length input sequences) and cannot deal with the sparsity in point clouds.More recently, Ten-sorIR [13] proposed a new IR for tensor programs and allows easier tensorization of accelerator primitives.SparseTIR [52] further extended TensorIR to support sparse workloads.Bolt [47] combines the advantages of fully-automatically generated kernels [4] with hand-written subroutines [24] through graph matching.
Point Cloud Accelerators.Deep learning on point clouds has also generated considerable interest in domain-specific accelerator design.Zhu et al. [59] proposed a sparsewise dataflow that skips cycles for zero-weight computations and saves energy through gating.Mesorasi [15] co-designed its architecture with the delayed aggregation algorithm to reduce redundant computation in point cloud NNs.More recently, Point-X [55] exploited spatial locality in point clouds through clustering, mapping point clouds into distributed computation tiles.It maximized parallelism and minimized data movement.Additionally, PointAcc [28] mapped all mapping operators in point cloud NNs to a versatile bitonic sorter, making it the first specialized accelerator to support 3D sparse convolution computation.Crescent [14] tamed irregularities in point clouds through approximate neighbor search and selectively elided bank conflicts, while Ying et al. [54] pushed point cloud compression to edge devices through intra-and inter-frame compression.

CONCLUSION
We introduce TorchSparse++, a high-performance GPU sparse computation library designed for point cloud and graph deep learning.TorchSparse++ features a highly optimized Sparse Kernel Generator with less than one-tenth of the engineering cost compared with the state-of-the-art system.It further enables us to build an inputaware Sparse Autotuner that selects the best configuration for each layer.TorchSparse++ achieves 1.7-3.3×inference speedup and 1.2-3.7×faster training compared to state-of-the-art MinkowskiEngine, SpConv v1/v2, and TorchSparse on seven real-world perception workloads.TorchSparse++ also achieves 2.6-7.6×speedup over DGL, PyG and Graphiler when running R-GCNs.We hope that TorchSparse++ will facilitate future system and microarchitectural research in sparse computation on 3D data and graphs.
for 50 classes from which 9 are seion.

Figure 2 :
Figure 2: Single scan (top) and multiple superimposed scans with labels (bottom).Also shown is a moving car in the center of the image resulting in a trace of points.

Figure 3 :
Figure 3: Waterfall diagram for different dataflows for sparse convolution on GPU: weight-stationary dataflows (a, b) are easier to implement and maintain but they do not overlap memory access with computation.Both fetch-on-demand and implicit GEMM dataflows require custom MMA routines but are able to hide the memory access time with pipelining.

Figure 4 :
Figure 4: Illustration of the gather-GEMM-scatter dataflow for Figure 2 workload: we first gather input features according to M  for each weight , then perform GEMM or batched GEMM, and finally scatter the results back to output locations given in M  .

7 Figure 5 :
Figure5: Illustration of the unsorted implicit GEMM dataflow for Figure2workload: each gray grid corresponds to a  indimensional input feature and blue grids correspond to redundant computation.The input feature matrix is not stored in DRAM.We assume that each thread block contains 4 threads (4 rows).

Figure 6 :
Figure 6: SpConv v2 sorts the input bitmasks and reorders the computation accordingly.White grids are skipped zero computation.Consequently, redundant computation is reduced from 34 MACs (Figure 5) to 26 for the Figure 2 example.

Figure 7 :
Figure7: We introduce Sparse Kernel Generator, a code generator that integrates on-chip MMA subroutines from[4] directly at the source code level, unlocking the potential of using dense, fixed shape tensor compiler to generate programs for sparse, dynamic shape workloads.Gray: constant code, red: fixed metaprogramming template, blue: generated automatically by existing tensor compiler for each tile size.

Figure 8 :
Figure 8: For sparse convolution workloads (MinkUNet on SemanticKITTI), it is possible for our template to achieve or even exceed cuBLAS utilization for the equivalent-sized GEMM problem by tuning only tiling size parameters.

Figure 10 :
Figure 10: We extend the implicit GEMM design space by introducing arbitrary number of mask splits.Compared with Figure 6b (1 split), splitting the mask into three parts further reduces redundant computation and increases parallelism.

Figure 11 :
Figure 11: A large design space on number of splits in implicit GEMM is beneficial: (a) redundant computation in segmentation workloads continues to drop quickly until splits = 5; (b) redundant computation in detection workloads at split = 0 (unsorted) is acceptable on high-parallelism devices.

Figure 13 :
Figure 13: Parameter binding in training tuner: we propose to partially decouple the dataflow parameters for forward, dgrad and wgrad kernels in training, which leads to up to 10% improvement in end-to-end training time.

Figure 17 :
Figure 17: Sorting is able to reduce the computation time, but its overhead outweighs the benefit on detection workloads.

Figure 18 :
Figure 18: Fetch-on-demand and implicit GEMM dataflows are complementary to each other on FP32 segmentation workloads.A hybrid dataflow is up to 1.06× faster than the best single dataflow.

Figure 22 :
Figure 22: Different from dense kernels, sparse forward, dgrad and wgrad kernels have different preferences for dataflow parameters.Binding hyperparameters for all kernels could hurt the training performance by up to 10%.

Figure 23 :
Figure 23: Summary of performance gain from different techniques and the enlarged design space in TorchSparse++.

Table 6 .
Ablation studies of the static auto labeling model.Metrics are the box accuracy at 3D IoU=0.7 and IoU=0.8 for vehicles in the Waymo Open Dataset val set.
Points and box sequence joint 85.67 / 65.77 Table

Table 1 :
Different sparse convolution dataflows in Section 2 can be mapped onto GPUs as dense GEMM with sparse global memory iterators.

Table 3 :
End-to-end Latency: Unsorted implicit GEMM is up to 1.2× faster with up to 1.7× redundant computation.