Dynamic N:M Fine-grained Structured Sparse Attention Mechanism

Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. Tremendous efforts have been made to alleviate this problem, and many of them successfully reduce the asymptotic complexity to linear. Nevertheless, most of them fail to achieve practical speedup over the original full attention under moderate sequence lengths and are unfriendly to finetuning. In this paper, we present DFSS, an attention mechanism that dynamically prunes the full attention weight matrix to N:M fine-grained structured sparse pattern. We provide both theoretical and empirical evidence that demonstrates DFSS is a good approximation of the full attention mechanism. We propose a dedicated CUDA kernel design that completely eliminates the dynamic pruning overhead and achieves speedups under arbitrary sequence length. We evaluate the 1:2 and 2:4 sparsity under different configurations and achieve 1.27~ 1.89x speedups over the full-attention mechanism. It only takes a couple of finetuning epochs from the pretrained model to achieve on par accuracy with full attention mechanism on tasks from various domains under different sequence lengths from 384 to 4096.


Introduction
Transformers [27] have achieved competitive performance across various domains like NLP [19] and Computer Vision [7].The key feature that sets them apart from traditional neural network architectures is the attention mechanism [27], which allows the transformers to gather information from the embeddings of elements in the input sequence in an adaptive and learnable manner.
Nevertheless, the high computation cost and memory footprint brought by the attention mechanism make it difficult to apply transformers to latency-sensitive tasks.Many approaches have been proposed to address these issues, but they exhibit different kind of limitations.Mishra et al. 17 apply static N:M fine-grained structured sparsity to the linear projections and fully-connected layers in transformers.However, unlike the static weight matrices, the attention weight matrix is dynamically generated during inference.Therefore, it is nontrivial to directly adopt this technique in the attention mechanism as it incurs dynamic pruning overhead.Tay et al. 25 Even though many of them reduce the asymptotic complexity to linear, there are still some drawbacks.First, tremendous engineering effort is required to deploy and optimize these techniques, as they usually have complex computation graph and introduce a handful of hyper-parameters to tune on specific tasks.Second, these techniques drastically modify the original full attention mechanism, so they usually need to be trained from scratch instead of exploiting pretrained models like BERT [6].As importantly, previous methods usually introduce additional operators like top-k and sorting that cause large overheads and make them impractical at moderate sequence length.
In this paper, we present Dfss, a simple and effective sparse attention mechanism that addresses the limitations mentioned above.
First of all, unlike traditional static pruning that can only be applied to the weight matrices, Dfss dynamically prunes the full attention score matrix using N:M fine-grained structured sparse patterns.Therefore, it can be jointly applied with static weight pruning to accelerate the whole transformer model.Second, it requires minimal changes to the original full-attention with no hyper-parameters to tune.This makes it a drop-in replacement of the full attention that only requires to change a few lines of code.Third, our method can directly leverage the pretrained models and achieve on par model accuracy with full attention even without fine-tuning.This is because it dynamically select N elements out of M based on their magnitude so that the most important entries are preserved.Finally, our method achieves wall-clock time speedup and memory footprint reduction over the full attention in arbitrary sequence length.This is attributed to our dedicated CUDA kernel design that dynamically prunes the attention weight matrix on the fly with zero overhead.Besides, our method can be combined with existing linear attention mechanisms like Nystromformer [31] for higher speedup.
While the techniques and observations in this paper can be applied equally to all N and M, following Pool & Yu 21, we focus on 1:2 and 2:4 structured sparsity [17] (particular forms of the more general N:M sparsity) since it is supported by offthe-shelf GPUs [18].Our main contributions are summarized below: • We propose Dfss, a dynamic N:M sparse attention mechanism that is a drop-in replacement of the full attention mechanism and orthogonal to existing efficient attention mechanisms.Its effectiveness is justified by both empirical and theoretical evidence.To the best of our knowledge, it is the first dynamic pruning technique under N:M sparsity.
• We present a dedicated CUDA kernel design to completely remove the pruning overhead.The pruning is implemented as an epilogue of the dense matrix multiplication which produces the attention score matrix.It is the first operator in the arXiv:2203.00091v1[cs.LG] 28 Feb 2022 deep learning software stack that dynamically prunes dense matrices and generates sparse encoding with zero overhead.• We evaluate Dfss on tasks cross various domains and sequence lengths on NVIDIA A100 GPU.It achieves 1.27 ∼ 1.89× speedup over the full attention with no accuracy loss.

Background and Motivation
We first introduce the preliminaries, notations, and background of our paper.

Full Attention Mechanism
Given an input sequence X = (x 1 ,..,x n ) ∈ R n×d , the full attention mechanism can be defined as where Q = XW q , K = XW k , and V = XW v are query, key, and value matrices.QK T forms a full-quadratic adjacency matrix, whose edge weights are the dot-product similarity between all the elements in the sequence.This adjacency matrix is standardized with 1/ √ d to keep the unit second moment and then normalized with softmax.At last, the row feature vectors in V are aggregated according to the normalized adjacency matrix by multiplying them together.In the rest of this paper, we denote A = S o f tmax(QK T / √ d).We refer QK T as the attention score matrix and A as the attention weight matrix.

Efficient Attention Mechanism
The high computation cost and memory footprint in the full attention mechanism come from A, whose size grows quadratically with the sequence length n.To address this issue, various efficient attention mechanisms have been proposed [25].
Fixed Sparse Patterns.Zaheer et al. 33, Beltagy et al. 1 apply a set of fixed sparse attention patterns on A, like global attention and sliding window attention.These patterns are constructed from empirical observations and designed GPU-friendly to achieve wall-time speedup.However, as these patterns are designed empirically and fixed during inference, there is no guarantee that they can always capture the important entries in A or transfer easily across different tasks.Dynamic Sparse Patterns.Ham et al. 9 dynamically generate fine-grained sparse attention patterns on A with low-cost binary hashing.However, this technique requires specialized hardware to achieve speedup, so it is not available on general-purpose hardware like GPU.Tay et al. 24,Roy et al. 22,Kitaev et al. 13 apply various clustering methods and only compute the attention within each cluster.Although computing full attention in each cluster is more friendly to GPU compared with fine-grained sparsity, the clustering methods contain several GPU-unfriendly operators like top-k and sorting that offsets their benefits under moderate sequence length.[18] projection and kernel construction also introduce considerable overhead.This makes these methods only effective under long sequence length.

Low
Besides, the previous studies drastically change the attention mechanisms, tens of thousands pretraining or finetuning steps are required to reach a comparable performance with the origin full attention mechanism.So they require tremendous engineering effort to deploy.

Static N:M Fine-grained Structured Sparsity
Static pruning under the N:M fine-grained structured sparsity is becoming a popular trend in recent years.On the software side, the N:M fine-grained structured sparsity has been applied to the static weight matrices in various neural network models including transformer [37,21,17].It can effectively accelerate the feed-forward part of the transformer up to 1.9×.On the hardware side, the N:M fine-grained structured sparsity is also adopted in recent deep learning accelerator design [38,15].Particularly, NVIDIA introduces the fine-grained structured sparsity in the A100 GPU.As shown in Figure 1, the dense input matrix is pruned with fine-grained structured pruning.If the data type is float, 1:2 sparsity is used which selects the larger one in two consecutive entries.If the data type is bfloat16 or float16, the 2:4 sparsity is used which selects two larger ones among four consecutive elements.After the pruning, the result is compressed to nonzeros and metadata.The nonzeros contain the value of reserved data that is 50% smaller than the original one.The metadata records the index of the nonzeros in the origins matrix.It takes 4 bit metadata to record the decision of each 1:2 or 2:4 selection.Therefore, the metadata is only 1/16 of the original dense matrix in terms of bits.This compressed sparse matrix can be multiplied with a dense matrix under the support of the sparse tensor core to achieve significant speedup.
However, to the best of our knowledge, no previous studies use N:M fine-grained structured sparsity to dynamic matrices like attention weight matrix in transformer.This is because current software library designed for pruning under N:M sparsity incurs high pruning overhead.During pruning, the whole dense matrix to be pruned is read from memory.Then, after selecting the elements to be reserved under the N:M pattern, it must also generate the metadata encoded in a special format such that the metadata can be used efficiently later.All these overheads will offset the benefit brought by the sparsity if we do it on the fly. 3 Dynamic N:M Fine-grained Structured Sparse Attention Mechanism In this section, we first give an overview of our Dfss method.Then, we discuss the design considerations of exploring sparsity in attention and the choice of sparse granularity in our method for GPU-friendly implementation and effectiveness.Finally, we briefly introduce our GPU kernel design to remove pruning overhead.We also discuss how to combine our method with existing linear attention mechanisms in Appendix A.7.
Our proposed Dfss mechanism is simple and effective.Figure 2 illustrates the case under 1:2 or 2:4 sparsity.Compared with the full-quadratic attention mechanism, our method dynamically prunes attention scores without incurring storage or computation overhead, while maintaining the effectiveness of attention.More importantly, our method can achieve practical speedups of attention on existing GPU hardware with customized CUDA kernels.Figure 3 shows all the modifications to be made to use Dfss.

Design Considerations for Exploiting Attention Sparsity
As illustrated in Figure 4, the attention mechanism can be considered as three stages: QK T , S o f tmax, and AV.To design a sparse attention mechanism, the first decision to make is where should we induce dynamic pruning to sparsify the attention.
If we start from 0 , all the three stages will be benefited from the sparsity given effective implementation on GPU: the dense matrix multiplication between Q and K will be replaced with the sampled dense-dense matrix multiplication (SDDMM) which only computes the entries identified by the sparse pattern.The S o f tmax only operates on the nonzero values in each row.The original dense matrix multiplication between A and V will be replaced with a sparse matrix-matrix multiplication (SpMM) which multiplies a sparse matrix with a dense matrix.However, as it is not possible to exactly know which entry in QK T has higher magnitude before computing QK T , starting from 0 usually requires some additional components to predict the location of important entries.
Starting from 1 requires us to compute a dense matrix multiplication between Q and K.The benefit is that we can explicitly select important entries from QK T without prediction.As the softmax is a monotonically increasing function, starting from 2 does not offer any benefits over 1 but throws away the opportunity to accelerate S o f tmax.
In this paper, we choose to start from 1 based on two considerations.First, replacing the dense matrix multiplication with SDDMM at QK T offers limited speedup even at high sparsity.Chen et al. 3 show that it is difficult for SDDMM to achieve speedup over its dense counterpart under 80% sparsity even with some structured design.Second, starting from 1 allows us to keep our design simple such that it does not introduce additional overhead or hyper-parameters to tune.

Granularity of Sparse Attention Patterns
The second decision to make is what sparse pattern to use as it will tremendously affect the latency of SpMM as well as the overhead to encode the sparse QK T .Existing studies exploit various sparse encoding schemes.For instance, the compressed sparse row (CSR) is popular for encoding fine-grained sparsity.However, CSR-based SpMM requires over 95% sparsity to be on par with its dense counterpart [3].Block sparsity is also widely used as it can bring considerable wall-time speedup at moderate sparsity given large block size.However, it cannot capture some fine-grained attention patterns.Moreover, these patterns require data comparisons within the entire row, which is difficult to execute in parallel and unfriendly to GPUs.
We find the N:M fine-grained structured sparsity mentioned in Section 2.3 is a good choice as long as we address the pruning overhead.First of all, the N:M selection is performed locally so that it is easy to be executed in parallel.Second, the size of compressed nonzeros is N/M of the original dense attention matrix, so the succeeding softmax is also accelerated.Particularly, we focus on 1:2 and 2:4 sparsity in this paper as they are supported by the off-the-shelf GPUs.Powered by the NVIDIA Sparse Tensor Core, the SpMM between the compressed A and V can also achieve 1.7× speedup.Notably, other N:M ratio is also supported given the hardware support for multiplication between an N:M sparse matrix and a dense matrix.

Empirical Results of Dfss Mechanism
Empirically, we find this pattern can well approximate the full attention mechanism.We first finetune a BERT-large model on SQuAD v1.1 under full attention.Then, we directly replace the full attention with the 1:2 and 2:4 attention without additional finetuning.The F1-scores are summarized in Table 1 under Cl = 95%.The accuracy loss is only around one sigma even without finetuning.

Removing Dynamic Pruning Overhead
As mentioned in Section 2.3, the major challenge that hinders us from using the fine-grained structured sparse attention is the pruning overhead.We observe that when computing QK T , the results are first accumulated in GPU registers and written to memory when all the computations are done.Therefore, we can implement the pruning as an epilogue of the matrix multiplication: after the accumulation is finished, we compare the data stored in the registers, select the larger ones and generate the metadata.Then, we only write the reserved non-zeros and metadata to memory.This design brings two benefits.First, it completely removes the overhead caused by reading the matrix to be pruned from memory, so it has zero overhead.Second, the memory footprint caused by the attention weight matrix is reduced from n 2 × 32-bit to n 2 2 × 32-bit + n 2 16 × 32-bit as the original n × n full attention weight matrix is not written to memory.The more detailed description of the CUDA kernel design including how to encode the metadata on the fly is summarized in Appendix A.1.
To the best of our knowledge, our kernel is the first operator in the deep learning software stack that dynamically prunes a dense matrix and generates its sparse encoding with zero overhead.This is achieved by fusion with preceding matrix multiplication and our dedicated design of the sparse encoding procedure in the registers under the layout of Ampere tensor core's output.

Combination with Existing Efficient Transformers
Although our method alone does not address the quadratic complexity of the attention mechanism, many efficient attention mechanisms [31,33,29] reduce the quadratic complexity to linear through computing full attention within sub-sequences.As our method is a good approximation of full attention and brings speedup under arbitrary sequence lengths, it can be combined with existing linear attention mechanisms to achieve higher speedup.Due to the page limitation, we add additional discussions in Appendic A.7.

Theoretical Results
In this section, we provide more theoretical and empirical evidence that justify our Dfss as a good replacement of the full attention mechanism.The strategy is to first derive the theoretical value of 1) quality of the approximation with different sparse patterns 2) speedup can be achieved under certain sparsity.Then, we compare the quality of different methods under the same speedup.

Attention Lottery Ticket
We borrow the lottery ticket hypothesis [8] and extend it to the attention mechanism.The last step AV in the attention mechanism can be viewed as the aggregation in the graph neural network.Following the Generalized Attention Mechanism [33], we describe it with a weighted directed graph G = (A, X).A is the adjacent matrix and A u,v > 0 indicates that element x u attends to x v .Inspired by the Graph Lottery Tickets [2], we propose the Attention Lottery Ticket.
Attention Lottery Ticket (ALT).Given a fully connected d graph G = { A, X} constructed from the full quadratic attention mechanism [27], the associated sub-graph can be defined as G s = {m A, X}, where m is a binary mask.If a G s has the performance matching or surpassing the original full quadratic attention mechanism, then we define the sparse attention mechanism with G s as an attention lottery ticket.

Zaheer et al. 33 have proved the existence of lottery tickets by
showing 1) sparse attention mechanisms are universal approximators of sequence to sequence functions when being used as encoder 2) sparse encoder-decoder transformers are Turing Complete.So the remaining problem is how to identify the winning tickets G s at runtime.

Quality of the Lottery Ticket
A popular strategy that empirically works well is selecting the top-k neighborhood in G based on the magnitude of edge weight.We refer it as Top-k Sparsity.Intuitively, this strategy is based on the hypothesis that the edges with larger edge weight are more important.It has been widely adapted in existing studies [8,2,9,28] and demonstrated its ability to preserve model accuracy at a high sparsity ratio.Following this trend of work, we define the Quality of Attention Lottery Ticket as follows: The above definition computes the expectation of normalized L p norm in each row of the attention score matrix.The p is a taskdependent factor that indicates how the accuracy depends on the edges with higher magnitude.In this paper, we compare the L p -Quality of tickets yield by three types of sparse patterns: Top-K, fixed, and our dynamic 1:2 and 2:4 sparse pattern.Particularly, we have the proposition below: Proposition 4.2.Under the assumption that the entries in QK T / √ d follow i.i.d.N(µ, σ), we have (3) It is obvious that the Q p topk achieves the upper bound of Q p under s.Besides, the pσ is always positive.Therefore, we also have

Efficiency of the Lottery Ticket
A lottery ticket with high quality does not necessarily mean that it is also efficient to execute for wall-clock time speedup.In this section, we analyze the efficiency of the three sparse patterns.
Top-K Sparsity.Zhao et al. 35 explicitly select k neighbors in each row of A based on their magnitude.However, as shown in their Table 4, the explicit sparse transformer has lower inference throughput despite k n.On one hand, the top-k operator is difficult to parallel and introduces high overhead.On the other hand, even if an oracle top-k sparsity mask m were provided with zero overhead, it would still be difficult for the explicit Top-K sparse attention to beat its dense counterpart.We provide a theoretical upper bound for density s in Proposition 4.3.Proposition 4.3.Given embedding size d and the maximum tiling size supported by GPU T , the upper bound of the speedup achieved by Top-K Sparsity under density s is (Proof: Appendix A.3) As typical values for the dimension d and tiling size T are d = 64, T = 128, s < 4.5% is a necessary and insufficient condition to have S peedup > 1. Notably, this is not a strict upper bound as we did not take the overhead of identifying top-k entries into consideration.Therefore, the strict upper bounder should be even smaller.
Fixed Sparsity.As the fixed sparse pattern are designed or learned before inference, they can be designed to be GPUfriendly and have the same tiling size with the dense matrix multiplication.Therefore, we can derive the upper bound of the speedup under density s with the same strategy in Proposition 4.3: . ( Dynamic 1:2 / 2:4 Sparsity.Similarly, we can derive the theoretical speedup with 1:2 and 2:4 sparsity as follows  With the theoretical conclusions above, we compare the quality of the lottery ticket under our dynamic 1:2 and 2:4 sparsity with the other two methods under the same efficiency.
Comparison with Top-K Sparsity.The Top-K sparsity achieves the same efficiency with ours at With typical values T = 128, d = 64, we have s < 0.02.We can substitute it to Proposition 4.2 and get Q p topk < Q p 1:2 when pσ < 7. On the other hand, when pσ > 7, although the Top-K sparsity produces tickets with higher quality, Q p 1:2 | pσ=7 ≈ 0.9999996 is already very close to 1.
Comparison with Fixed Sparsity.The fixed sparsity achieves the same efficiency with ours when With typical values T = 128, d = 64, we have s ≈ 0.63.On the other hand, we have theoretical value of σ ≈ 1 and p ≥ 1.
The p ≥ 1 is based on the observation that the edges with higher magnitude are more influential.Therefore, we have pσ ≥ 1 and To conclude, compared with both top-k sparsity and fixed sparsity, our method can always yield lottery tickets with higher quality under the same efficiency.To support this conclusion, we further provide some empirical studies in Appendix A. 4. Besides, we found both theoretically and empirically that our method is a good complementary to the kernel-based transformers like Performer [5].We add more discussions about it in Appendix A.5.

Evaluation
In this section, we first evaluate the accuracy of our dynamic finegrained structured sparse attention mechanism on tasks across different domains.Then, we profile our methods on NVIDIA A100 GPU under different sequence lengths from 256 to 4096 to show that we can achieve practical speedup in arbitrary sequence length.We focus on 1:2 and 2:4 sparsity as their speedup can be directly evaluated on readily-available hardware, other N:M sparsity are left for future work.

Model Accuracy
To show that our method is effective in comprehensive scenarios, we first evaluate the model accuracy on tasks in different domains and sequence length.For models under "bfloat16" data type, we first finetune them from the pretrained model under "float" data type as "float" provides more precise gradient that helps convergence.After the finetuning, we directly cast all the parameters in the model to "bfloat16" and test it on the test dataset.For Question Answering and Masked Language Modeling tasks, we report the results averaged over 8 runs under different random seeds.
As shown in Table 2, with finetuning, our 1:2 sparsity has only 0.1 F1 score loss that is smaller than the standard deviation.Our 2:4 sparsity even achieves a little bit of performance improvement over the dense baseline.One plausible explanation is that 1 https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering while the 2:4 sparsity can keep most of the important edges, it also occasionally drops a small fraction of important edges which acts like the attention dropout technique [34].Besides, directly applying our methods to the dense transformer without finetuning also achieves comparable results, it justifies that our method can well approximate the dense attention mechanism.
Masked Language Modeling.We also evaluate our models on the masked modeling tasks on Wikitext-2 and Wikitext-103 under sequence length 512.Similar to the question-answering tasks, we choose the "roberta-large" as the pretrained model and finetune it under the default configuration in Huggingface2 .The results are summarized in Table 3.Similarly, the perplexities achieved by our methods are on par with the dense transformer.
Long Range Arena.For sequence length longer than 512, we incorporate four tasks from the Long Range Arena [26], including ListOps, Text Classification, Document Retrieval, and Image Classification under sequence lengths 2048, 2048, 4096, and 1024, respectively.We omit the Pathfinder (1K) task as we cannot replicate the results, which was also reported in Lu et al. 16.For a fair comparison with other efficient transformers, the model is trained from scratch under the default configurations.The results are summarized in Table 4.Our method achieves comparable accuracy on all the four benchmarks for long sequence.

Speedup
In this section, we demonstrate the speedup achieved by our method across different sequence lengths and compare it with existing studies.As our method focuses on the attention mechanism and is orthogonal to techniques (e.g.static pruning, quantization) that accelerate the other parts of transformer models, we only show the speedup achieved on the attention mechanism declared in Equation (1)   Appendix A.6.Our method achieves 1.08∼1.52×end-to-end speedup and 1.41∼1.82×memory footprint reduction.
For models in previous studies, We also apply the PyTorch JIT script when possible in case that their implementations are not efficient.The configuration we use is as follows: Each layer contains 4 heads, the feature dimension per head is 64.The batch size is set to be large enough to keep the GPU busy.We summarize the profiling results in Figure 5.We normalize the latency to the Transformer with full attention mechanism under each configuration.We also cut the y axis off at 2 for clarity, because some methods designed for long sequence could be more than 20× slower than the dense transformer at moderate sequence length.
First of all, our method achieves 1.27∼1.89xspeedup over the transformer with full attention.It is the only method that brings consistent speedup across different sequence lengths, while other methods from previous papers suffer from high overhead at moderate and short sequence lengths.Second, under the float data type, our method achieves speedup in all the three stages with zero overhead.This accords with our arguments in Section 3.Under the data type bfloat16, the QK T in our method is a little bit slower than the dense baseline.The reason is that selecting 2 larger ones from 4 elements requires more comparisons, which results in more warp divergence.

Conclusion and Discussion
In this paper, we present Dfss, a dynamic fine-grained structured sparse attention mechanism that dynamically prunes the QK T on the fly to N:M structured sparsity.It achieves no accuracy loss compared with the full attention mechanism across tasks in various domains and sequence lengths.Besides, it only requires modifying a few lines of code, which makes it a drop-in replacement of the full-attention mechanism.Moreover, powered by our customized CUDA kernels and the new sparse tensor core on Ampere GPU, we achieve 1.27∼1.89×speedup over the full attention in arbitrary sequence length.All of these pieces of evidence demonstrate that our method can be a good replacement for the full attention mechanism.Our method is also orthogonal to many existing efficient attention mechanisms and can potentially be applied jointly for further speedup.

A.1 Kernel Design Details
In this section, we first demonstrate how a dense matrix is pruned and compressed under the 50% structured sparsity on Ampere GPU in Section A.1.1.Then, we detail the design and implementation of the SDDMM, Softmax, and SpMM kernels in Section A.1.2 and A.1.3.

A.1.1 Structured Pruning of Dense Matrix
We first illustrate how to dynamically prune a dense matrix with 50% structured sparsity.Under data type float, we select the larger one in every two consecutive elements.If the data type is bfloat16, we select two larger ones in every four consecutive elements.We compress the pruned dense matrix to nonzeros and metadata following CUTLASS [12] as there are two benefits.First, it can be directly used by high-performance SpMM kernels in CUTLASS.Second, as we will show in Section A.1.2, it can be dynamically generated from the SDDMM kernel with neglectable overhead.
As shown in Figure 6, the basic tile size to prune is 32 × 64-byte, this corresponds to a 32 × 32 block under bfloat16 or 32 × 16 block under float.There are four major steps: 0 Prune 50% of each consecutive 8B data, generate nonzeros and metadata; 1 Interleave the metadata rows by 8; 2 Switch the metadata along sub-diagonal.3 Write metadata and nonzeros to global memory.
In detail, 2 out of 4 2-byte data are select based on their magnitude and a unique 4-bit metadata is assigned to each combination in 0 .The correspondence between selection pattern and metadata is enumerated in Figure 6 (b).Notably, with float32 data type, each 32-bit data occupies two consecutive 2-byte slots.Therefore, it only supports the patterns under 0x4 and 0xe.After generating the 4-bit metadata, consecutive four of them are concatenated to a 2B metadata block.Then, the rows of metadata are interleaved by 8 in 1 following In 2 , the metadata blocks at upper right and lower left of each 2 × 2 grid are switched.At last, in 3 , the metadata produced by 2 is written into global memory following the interleaved column-major format under stride 4-byte.This can be realized by interpreting two consecutive metadata as an int object and then write it to DRAM in column-major.The nonzeros are simply writen to global memory under row-major.

A.1.2 SDDMM Kernel Design
Our strategy for dynamically pruning the attention score matrix has two steps.First, perform a conventional dense GEMM.Second, prune the GEMM output with procedures described in Section A.1.1.However, if the second step is implemented as a separate GPU kernel, we need to write the dense attention score matrix to DRAM and read it back.This not only introduces high overhead, but also prevents us from reducing global memory footprint.To address this issue, we implement the pruning step as an epilogue attached to the conventional GEMM kernel: the results of the dense GEMM are stored in the registers, the epilogue processes the results and then writes nonzeros and metadata to global memory.
Dense GEMM.The GEMM step is no different from conventional GEMM kernels, and all the existing optimizations can be used.
The tiling is shown in Figure 7: each thread block processes a Mtile × Ntiles output tile, which is further partitioned to several warp tiles.Each warp tile is composed of a grid of 16 × 16 blocks that matches the tensor core output size.In each thread block, all the threads jointly load Mtile × Ktile and Ktile × Ntile input tiles from matrix A and B into the shared memory.We use the new synchronize copy feature on Ampere architecture to fully utilize the memory bandwidth and reduce register usage.To fully annihilate shared memory bank conflict, we use the XOR layout.Once the load is completed, the warps fetch their source operands from shared memory with ldmatrix and perform a (16 × 32B) • (32B × 16) warp matrix multiply accumulate (wmma) with tensor core.Notably, float data will be converted to tensorfloat-32 before wmma.To reduce accumulation error, we accumulate the partial sum as float regardless of the source operand data type.Besides, software 2-stage pipeline is used to overlap memory access and computation with double buffering [12].Although deeper software pipeline can be built on Ampere, we find 2 stages is enough as the inner-product dimension K is usually very small (e.g.64).More detailed explanation of the above techniques can be found in this GTC 2020 talk [11].
Pruning the GEMM result.In the pruning step, the warp tile is partitioned to a grid of 32 × 64B blocks that are processed by the warp one at a time.
Under data type float, the register layout of the 32 × 16 block is illustrated in Figure 8 (a).It consists of two 16 × 16 wmma blocks, so each thread has sixteen 32-bit registers to hold the results.The registers are annotated with "Tthread_id{register_id}". As the adjacent two data are held by the same thread, we can simply compare them and the larger one is retained.
Under data type bfloat16, we need to select 2 larger ones from adjacent 4 entries.However, under the naive mapping shown in Figure 9 (a), these 4 entries are held by 2 thread.Therefore, we need additional warp shuffle to first pass these 4 entries to the same thread, then compare them and obtain the 2 larger ones.This will introduce additional overhead.To solve this problem, we propose to interleave the columns when loading matrix B to shared memory by simply manipulating the pointer to the global memory at the beginning.The resulted mapping to the registers is shown in Figure 9 (b) which is equivalent with Figure 8 (a) bfloat16.After the interleaving, consecutive four data are naturally held by the same thread, and we select 2 larger ones from them.
To reduce branch divergence, the selection is done by comparing the sum of any two data.
Generate Metadata and Nonzeros.For both float and bfloat16 data type, each comparison produces a 4-bit metadata.Next, following the procedures described in Section A.1.1,we need to concatenate consecutive four metadata to a 16-bit metadata block.This is done in two steps.First, put the 4-bit metadata to the correct position of a int16 register with bit shift.Second, share these int16 registers cross threads with warp shuffle, and concatenate them with bitwise OR.As consecutive four metadata are held by thread 4t to 4t+3, we put the 4-bit metadata of thread 4t+k to [k×4:k×4 + 3] bits in the int16 object in the first step.The detailed layout is shown in Figure 9 (b), where we denote each 4-bit metadata as "Tthread_id{register_id}[bit_id]".The result of the second step is shown in Figure 8 (c).Figure 8 (d) and (e) illustrate the result after 1 and 2 in Figure 6.Notably, these two step only change the logic mapping of the metadata and the register allocation is not affected.So no code is required for these two steps.At last, we need to write the metadata and nonzeros to global memory following 3 in Figure 6.As shown in Figure 8 (e), each row is held by consecutive two int16 registers of the same thread, so we can simply reinterpret it as an =int32 object and   write the metadata to global memory in column major.For the nonzeros, we simply coalesce them in the shared memory and then write to global memory in row-major.
Batched Kernel.The self-attention layer in transformer usually has multiple independent attention heads.Instead of launching one CUDA kernel for each attention head, using a batched kernel that processes all the heads can better utilize the GPU resources and reduce kernel launching overhead.We support the batched computation by using the blockIdx.zto index the heads in the batch and update the pointers to the input and output based on the index.
Blocked-ELL Sparsity.Under long sequence length, higher sparsity is desired to reduce computation cost and memory footprint.
Our kernel support hybrid blocked-ELL sparsity [33] and 50% structured sparsity.To support this feature, we set the block size in blocked-ELL to the thread block tile size of the GEMM.Therefore, we can simply skip those pruned blocks during the execution.

A.1.3 Softmax and SpMM Kernel
In this section, we detail the implementation of the softmax and SpMM kernels.Softmax Kernel.To improve numerical stability, the softmax on GPU is computed with so f tmax(x) i = e x i −max(x) j e x j −max(x) . ( Therefore, each element in x has to be loaded for three times.1) compute c = max(x); 2) compute s = j e x j −c ; 3) compute e x i −c /s.Instead of loading x i from global memory in each time, we cache it in the register when the whole row fits in the register file capacity.Besides, the ordinary softmax kernel in libraries like PyTorch can also be used.
SpMM Kernel.As we encode the nonzeros and metadata following the CUTLASS [12], we directly construct the SpMM kernels from the CUTLASS APIs.To support the hybrid blocked-ELL and structured 50% sparsity, we modify the PredictedTileAccessIterator class in CUTLASS to skip the tiles masked out by the blocked-ELL sparsity.
A.2 Proof of Proposition 4.2 Proof.Under the assumption that the entries in QK T / √ d follow i.i.d.N(µ, σ), we denote x i, j = e µ+σz i, j , where z ∼ i.i.d.N(0, 1).Then we can substitute it into the definition of the softmax and get We substitute the above equation into the definition of L p -Quality and get With n → ∞, the denominator can be approximated with Top-K Sparsity.When the sequence is long enough such that we have n → ∞, the numerator can be approximated with Therefore, the L p -Quality of Top-K sparsity is Fixed Sparsity.Without any assumption on the distribution of important edges in A, applying a fixed pattern is equivalent with uniformly sampling with probability s and we have Therefore, the L p -Quality of the fixed sparsity is 2-to-1 Sparsity: This sparsity pattern select the larger one in every two elements.We denote adjacent two elements with Z 1 and Z 2 are independent.Then we have We denote then we have With the conclusion above, we have The L P -Quality of 1:2 sparsity can be computed with 2:4 Sparsity: This sparsity pattern select the largest two elements in consecutive four elements.While it is more challenging to find an explicit expression for Q p 4−to−2 , a trivial lower bound can be found with where we have Z 1 , ..., Z 4 are independent.Therefore, the lower-bound of Q p 2:4 is A.3 Proof of Proposition 4.3 Proof.First of all, thanks to the Tensor Core in latest GPUs, the latency of matrix multiplication operations, both sparse and dense, are bounded by the memory access.Therefore, instead of counting the number of MACs (multiply-accumulate operations), the amount of memory access is a better metric to estimate the latency.Tiling is a basic optimization applied to optimize matrix matrix multiply on GPU.As shown in Figure 10 (A), the original n × n output is partitioned to independent blocks with size T × T .When computing each block, operands with size T × r and r × T are loaded from A and V T to the fast memory, respectively.Then, these two operand are multiplied and accumulated to the partial sum stored in the registers.After applying the top-k, as shown in Figure 10 (B), the k elements in each row of A correspond to different rows in A. Therefore, we can only partition the output to independent vectors with size 1 × T .During the computation, operands with size 1 × r and r × T are loaded from A and V T to the fast memory, respectively.Then, the loaded operands are multiplied and accumulated to the partial sum stored in the registers.

𝑨
With the tiling strategy mentioned above, we can summarize the amount of memory access in different attentions in the table below.
For QK T , as we need to compute all of it before getting the top-k elements, it is a dense matrix matrix multiplication for both full and explicit top-k attention.The Softmax needs to read the n × n QK T in, normalizes it, and write the result A back.As the  intermediate values can be stored in registers, we only need to count reading QK T in and writing A out.Therefore, its memory access is 2n 2 for full attention and 2n 2 s for explicit top-k attention.For AV in full attention, the output size is nd.As each output element is generated from the inner product between two vectors with length n, the total data read equals nd × 2n.However, with the tiling in Figure 10 (A), each operand is reused for T times.Therefore, the total memory access for AV in full attention is nd( 2n T + 1).For AV in explicit top-k attention, as shown in Figure 10 (B), each left-hand-side data is reused for T times while each right-hand-side data is used only for once.Therefore, its memory access equals to nd( sn T + sn + 1).The theoretical speedup can be computed with A.4 Quality of the Lottery Tickets under the same Efficiency In this section, we provide more empirical evidences to support our conclusions.We first compare the theoretical speedup of different sparsity predicted in Equation ( 4), (5), and ( 6) and the actual speedup measured on A100 GPU in Figure 11.
First of all, the Top-K sparsity is well bounded by the theoretical value, and our method achieves better speedup than the Top-K sparsity when the density s > 0.02.This is because gathering top-k elements in each row of the attention weight matrix and sorting them to compressed row format introduce huge overhead.
Second, the speedup achieved by the fixed sparsity is well predicted by our theoretical value.The speedup it achieved is lower than ours when density s ≥ 0.63, which accords with our theoretical conclusion.Notably, the speedup of fixed sparsity we used here is simply truncate the number of columns of the attention weight matrix based on the density.The actual speedup will be even lower when more fine-grained pattern is involved.
Our method delivers speedup a little bit higher than the theoretical value.This is because the softmax kernel has different implementations under different sequence length.When the sequence length is moderate, as mentioned in Appendix A.3, the data loaded from the attention score matrix can be explicitly cached in fast memory like registers or shared memory for reuse.When sequence length too long for the fast memory to cache, it has to be implicitly reused through lower-level cache or even global memory.The second implementation is slower than the first one as lower-level cache has longer access latency and lower throughput.As our method reduces the sequence length by half, it can use the implementation for moderate sequence length while the full attention is handled by the long sequence version.
In Figure 12, we compute the theoretical value (solid line) and empirical value (box plot) of Q p over attention matrix A in BERT-Large on SQuAD v1.1.As p is a task-dependent value that is hard to obtain, we instead sweep through several typical values.Compared with the top-k sparsity, when p < 7, our 1:2 and 2:4 sparsity always achieve better performance than the top-k sparsity when s < 0.05.Besides, when p = 7, the Q p 1:2 and Q p 2:4 are very close to 1.These observations accord with our conclusion that our 1:2 and 2:4 sparsity can obtain tickets with better quality than Top-K sparsity at the same efficiency.
Compared with the fixed sparsity, our Q p 1:2 and Q p 2:4 are also similar or better than Q p f ix across different ps.This supports our conclusion that our method achieves better performance than the fixed sparsity patterns under the same efficiency.
To show that our Q p is a good metric to compare the performance of different sparse patterns, we plot the Q p and F1 score on BERT-large SQuAD v1.1 in Figure 13.As we mentioned before, p is a task-specific value used to model tasks with different degree of dependency on the largest few elements.In order to identify the p for our target task, we tune the value of p until the data points from Top-K sparsity and Fixed sparsity form a monotonically increasing line.We found that p = 6.5 is a good choice.This large p accords our observation that the Top-K sparsity works well even under 5.4% density.After anchored the p, we put the data points from 1:2 and 2:4 sparsity into the plot and verify if the line is still monotonically increasing.Figure 13 shows that the data points from our 1:2/2:4 sparsity perfectly fills in the monotonically increasing line.Oppositely, The traditional F-norm based metric cannot explain why the 1:2 sparsity has better F1-score than some Fixed Sparsity even though it has lower score.This demonstrates that our Q p is a better metric than existing metrices.

A.5 Comparison with Performer
In this section, we add more discussions on how our method compared with kernel based transformer, i.e.Performer Choromanski et al. 5.As our Definition 4.1 is designed to characterize how well the sparse pattern could reserve the important edges in A, so it is not suitable for kernel-based attention mechanisms that do not involve sparsity.For example, an approximation of A with high positive approximation error can have Q P ≥ 1 under Definition 4.1.Therefore, we instead compare the mean squared error (MSE) following Choromanski et al. 5. Given the query and two adjacent key vectors q, k, and k ∈ N(0, I d ), we denote the softmax kernel between them as S M(q, k) = exp(q T k/ √ d).And the softmax approximated by our dynamic 1:2 sparsity S M 1:2 (q, k) is  defined as Then, we can compute its MSE as follows Because q T k = d i=1 q i k i is the weighted sum of i.i.d variables following N(0, 1), we have x = q T k ∼ N(0, ||q|| 2  2 ).We can substitute it into Equation (29) and get With Lemma 2 and Theorem 2 in Choromanski et al. 5, the MSE of their positive softmax kernel with orthogonal random features has an upper bound as follows First of all, when S M(q, k) → 0, both MS E( S M 1:2 (q, k)) and MS E S M ort+ m (q, k)) converge to 0. However, for large S M(q, k)s that are potentially be critical for the model accuracy, the exp √ 2 ln (S M(q, k)) term in our method reduces the MSE.To conclude, while both the positive softmax kernel and ours has low MSE error when approximating small edge weights, our method can better approximate the edges with high magnitude.
From the empirical perspective, as shown in Table 2 and 3, our method can achieve good accuracy even without finetuning.Whereas the Performer still requires tens of thousands steps of finetuning (e.g. Figure 5 in Choromanski et al. 5).Table 4 also reveals that Performer has poor accuracy on certain tasks like byte-level document retrieval, while ours consistently achieve accuracy on par with the dense transformer.All this observations suggest that our method can better approximate the full attention mechanism than Performer.
In terms of wall-clock time speedup, Figure 5 illustrates that the Performer can only achieve good speedup at long sequence length.The similar phenomenon is also observed in multiple online forums 3 .Certainly, the PyTorch JIT script does not yield the optimal implementation of the computation graph, but it reveals that tremendous engineering efforts are required for Performer to achieve good speedup under moderate sequence length.Following Section 4.3, we also compare the theoretical speedup achieved by ours and the Performer.

m×1
T (9)  m×d = φ(K) T n×m × V n×d , T (10)  n×d = φ(Q) n×m × T (9)   m×d T (8)  n×1 . ( The computation steps of Performer are listed in Equation (32) where each equation denotes a sub computation graph that can potentially be fused.Notably, this is more complex than the original mathematical expression to handle the numerical instability of exp.The total memory access can be computed with We have m = dln(d) following Theorem 4 in Choromanski et al. 5. We can substitute m = 266, d = 64, and T = 128 into Equation (33) and get S peedup > 1 when n > 672.On the other hand, the performer achieves the same speedup with ours with n > 1002.
To conclude, our method is a good complementary to performer.With delicately optimized computation graph, performer can achieve good speedup and relatively good accuracy under long sequence scenario.In contrary, our method has better speedup and accuracy under moderate and short sequence length.Besides, our method delivers lower approximation error on important edges so it is more friendly to finetuning.

A.6 End-to-End Speedup and Memory Footprint Reduction
In this section, we present the end-to-end speedup achieved by our method under different configurations.We use the 4-layer dense transformer model of Text Classification task in Long Range Arena [26].The dimension of each head is 64.We explore different combination of number of heads (4, 8), sequence length (512, 1024, 2048, 4096), and hidden dimension of the feed forward layer (256, 512, 1024).The end-to-end speedup over the dense transformer under different configurations are plotted in Figure 14.
Our method achieves 1.11 ∼ 1.52× and 1.08 ∼ 1.47× end-to-end speedup over the dense transformer, it is the only method that deliver end-to-end speedup under all configurations.Under sequence length ≤ 2048, our method achieves higher speedup than most of the baselines.Although Sinkhorn transformer [24] has higher speedup than ours at sequence length 2048, as shown in Table 4, its accuracy is less satisfying.This result justifies that our method delivers good speedup under short and moderate sequence length.Notably, this speedup is almost a free lunch.On one hand, Section 5 demonstrates that our method achieves comparable accuracy across different tasks and sequence length, so the model accuracy is not sacrificed.On the other hand, unlike previous efficient transformers, our method has no hyper-parameters and only requires lightweight finetuning process.
To study how our method contributes to the end-to-end speedup, we further break down the end-to-end inference time to the attention mechanism and other components under bfloat16.The results are illustrated in Figure 15.Under moderate and short sequence length like 1024 and 512, the "Others" contributes over 70% of the total latency.This is because the size of the matrix multiplications in the feed-forward network and query/key/value projection are comparable with the attention mechanism.
However, unlike the attention mechanism that has limited time budget for compression, the feed-forward network and query/key/value projection use a static weight matrix during inference, so they can be compressed offline.[23], on the other hand, redesign the network architecture that reduce the hidden dimension of feed-forward network in BERT from 4096 to 512.In terms of quantization, previous work [32] have shown that the linear layers can be quantized to 8 bit integer.
Besides the linear layers, there are also techniques to accelerate other components in transformers.For instance, the MobileBERT [23] replaces the layer normalization to a simple element-wise linear transformation.The input embedding table is also compressed with smaller embedding dimension along with an 1D convolution.
With all these techniques in the literature, it should not be hard to achieve 2× speedup in the non-attention part of transformer models.Then our method could deliver 1.13 ∼ 1.41× speedup under sequence length ≤ 1024.
We also measure the peak memory allocation of different models and configurations, the results are summarized in Figure 16.Our method achieves 1.41 ∼ 1.82× memory reduction, which is comparable with or better than most existing efficient transformers when sequence length ≤ 1024.

A.7 Combination with the Existing Efficient Transformers
Existing efficient transformers usually sparsify the full attention mechanism to densely connected clusters [24,22,13,33] or approximate it with low-rank projection [29].As our method is a good approximation of the full attention mechanism and brings wall time speedup at arbtrary sequence length, it can potentially be combined with the existing efficient transformers.17.We observe that the computation circled in Figure 17 is identical to the standard attention mechanism, so it can be further accelerated with our method.More importantly, the two matrix multipliation involved are the two of the three largest m × n matrices.It will be very beneficial to reduce their complexity.
We report the accuracy on Image (1K) on LRA [26]  ).The matrix multiplication complexity of the standard Nystromformer takes O(nm 2 + mnd v + m 3 + nmd v ).After applying our method, it can be reduced to O( nm 2 2 + nmd v 2 + m 3 + nmd v ).The memory footprint can be reduced from O(md q + nm + m 2 + nm + nd v ) to O(md q + nm + m 2 + nd v ).Given n m > d v ≈ d p , this could be a significant improvement that allows us to use more landmarks m to better approximate the full attention mechanism.Besides Nystromformer, we also illustrate two possible combinations with BigBird [33] and Linformer [29] that can be explored in the future work.
As shown in Figure 18 (A), Zaheer et al. 33 use block sparsity with block size 64 and compute a full attention within each block.We can apply the 1:2 or 2:4 sparsity within each block to bring further speedup.
Figure 18 (B) gives another example on how to combine our method with Linformer [29].Linformer uses low-rank approximation on the attention mechanism as follows: where E, F ∈ R n×k are linear projection matrices and k n.We can first prune E and F along with other weight matrices to have 1:2 or 2:4 sparsity offline following Mishra et al. 17.Then we compute EK and FV with Sparse Matrix-Matrix multiplication.Next, we multiply Q and (EK) T and the result is pruned to 50% structured fine-grained sparsity on the fly.After applying softmax to the nonzeros, we multiply it with FV.

A.8 Visualize Attention Distribution
To illustrate that our Dfss can well capture the fine-grained sparsity in attention, we visualize the attention weight matrices in BERT-large on SQuAD v1.1 in Figure 19.In detail, we run inference of the same input sample in BERT-large model pretrained under dense, 1:2, and 2:4 settings, then collect the attention weight matrix in the first layer.It is obvious that the pattern in dense transformer and our Dfss are quite similar.The magnitude of nonzero values in Dfss are a little bit higher than dense attention.This is because the softmax normalizes the values in each row with the exponential sum of each entry.After removing 50% smaller entries, the magnitude of remaining entries would be relatively higher.Nevertheless, we find that this does not influence the model accuracy, as the forthcoming normalization layers will take care of it.[33] and Linformer [29]

Figure 3 :Figure 4 :
Figure 3: Example of using Dfss.The "dspattn" is the package we developed.

Definition 4 . 1 .
(L p -Quality of Attention Lottery Ticket) The quality of attention lottery ticket G s = {m A, X} under density s

Figure 5 :
Figure 5: Latency breakdown of different attention mechanism.For each configuration, we normalize the latency to Transformer with full attention mechanism and cut off the axis at 2 for clarity.

Figure 8 :
Figure 8: Mapping between the registers and data, metadata.

Figure 9 :
Figure 9: Interleave the columns for matrix B to reduce cross-lane data sharing during pruning for bfloat16.

Figure 11 :
Figure 11: Theoretical and actual speedup achieved by different sparse patterns on A100 GPU.

Figure 12 :
Figure 12: Q p under different density s and sparsity strategies.Box plot: Empirical results from BERT-large on SQuAD v1.1; Solid line: Theoretical results from Proposition 4.2.

Figure 13 :
Figure 13: Q p under different density s and sparsity strategies.Box plot: Empirical results from BERT-large on SQuAD v1.1; Solid line: Theoretical results from Proposition 4.2.
q, k) term in the positive softmax kernel in Performer could greatly increases the MSE.Oppositely, the 1 − er f √ d ||q|| 2

Figure 16 :
Figure 16: Peak memory allocation normalized to dense transformer under different configurations.

Table 3 :
Perplexity on roBERTa-large (Cl=95%) 4.4 Quality of the Lottery Tickets under the same Efficiency

Table 4 :
26curacy of different transformer models on LRA benchmark.We follow the training instructions from Tay et al.26to reuse the results from this paper.
in this section.The end-toend speedup and memory footprint reduction under different sequence length, number of heads, and hidden dimension in

Table 5 :
Amount of Memory Access in Different Operations in Attention.s = k/n: density of the sparse attention; T : tiling size.
36ns of methods have been in the literature to do that even before the transformers are proposed.For instance, Mishra et al. 17, Zhou et al.36show that pruning the weights to 2:4 sparsity can deliver 1.3 ∼ 1.6× speedup and 2× fewer parameters in the feed-forward and projection layers4without accuracy loss on BERT-large.Lagunas et al. 14 apply structured pruning and achieve 2.4× speedup on SQuAD v1.1 with 1% drop of F1.The MobileBERT Figure 14: End-to-end inference speedup of different efficient transformers over dense transformer.proposed Figure 15: End-to-end inference latency break down under bfloat16.We first demonstrate the combination of our method with Xiong et al. 31.Xiong et al. 31 propose a Nystrom-based self-attention mechanism that approximate standard self-attention with O(n) complexity.The Nystromformer is illustrated in Figure

Table 6 :
311]able6.We first pretrain a standard Nystromformer from the scratch for 35,000 iterations following Xiong et al.31.Then, we finetune it for 3,500 iterations (1/10 of the training process) under standard Nystromformer, Nystromformer + Dfss 1:2, and Nystromformer + Dfss 2:4.It is obvious that by combining Dfss and Nystromformer, we can achieve higher accuracy on LRA with lightweight finetuning.Accuracy on Image (1K) on LRA[26]under the combination of Dfss and Nystromformer[31].Then we provide a complexity analysis of the combination following Xiong et al.31.The landmark selection with segement-means takes O(n), iterative approximation of the pseudoinverse takes O(m3