Inference Optimization of Foundation Models on AI Accelerators

Powerfulfoundationmodels,includinglargelanguagemodels(LLMs), withTransformerarchitectureshaveusheredinaneweraofGener-ativeAIacrossvariousindustries.Industryandresearchcommunity havewitnessedalargenumberofnewapplications,basedonthose foundationmodels.Suchapplicationsincludequestionandanswer, customerservices,imageandvideogeneration,andcodecomple-tions,amongothers.However,asthenumberofmodelparameters reachestohundredsofbillions,theirdeploymentincursprohibi-tiveinferencecostsandhighlatencyinreal-worldscenarios.As aresult,thedemandforcost-effectiveandfastinferenceusingAI acceleratorsisevermorehigher.Tothisend,ourtutorialoffersa comprehensivediscussiononcomplementaryinferenceoptimiza-tiontechniquesusingAIaccelerators.Beginningwithanoverview ofbasicTransformerarchitecturesanddeeplearningsystemframe-works,wedeepdiveintosystemoptimizationtechniquesforfast andmemory-efficientattentioncomputationsanddiscusshowthey canbeimplementedefficientlyonAIaccelerators.Next,wedescribe architecturalelementsthatarekeyforfasttransformerinference. Finally,weexaminevariousmodelcompressionandfastdecoding strategiesinthesamecontext.


OVERVIEW
The substantial size of modern large language models (LLMs), such as Llama-2/3 70B [109], Claude 3 Opus 137B [8], and Groq-1 314B [118], presents significant challenges in both training and inference phases.Training LLMs, in particular, demands considerable resources and has been the subject of extensive research.In contrast, inference consumes fewer computational resources but occurs much more frequently once the model has been trained.This phase is crucial as it encompasses various applications where the value of LLMs is realized, including text translation, sentiment detection, code generation, text summarization, and question answering.
Customers naturally demand faster and more cost-effective inference.To meet the user demands, it is essential to reduce latency-the time required to complete a generation-and to increase throughput, which is the number of requests processed per unit of time.The latency and throughput of LLMs depend on multiple factors, such as the hardware utilized, the capability of software frameworks to optimally leverage the available hardware, and the model architecture itself.Therefore, efforts to improve speed and costs benefit from optimizations across all these dimensions.To this end, this section provides an overview of the characteristics of LLM inference, along with the corresponding systems and hardware requirements.

LLM Inference
Transformer models have revolutionized the landscape of LLMs by introducing a highly effective architecture for natural language processing tasks, as shown in Figure 1 [112].These models, characterized by their attention mechanisms, have significantly enhanced the capacity of models to understand and generate human-like text.Their versatility and scalability in training have established them as the backbone of many state-of-the-art LLMs today.Transformer models can include an encoding component only (e.g., BERT [27]), a decoding component only (e.g., GPT [13], Llama-2 [109], Claude 3 [8], Groq-1 [118], Mistral 7B [49]), or both (e.g., BART [65]).Currently, modern LLMs predominantly employ a decoder-only architecture, generating output sequences by predicting one token at a time, conditioned on the input sequence and previously generated tokens-a process known as auto-regression.Consequently, our discussion primarily focuses on decoder-only Transformer models.1], comprising of an encoder (left) and a decoder (right).Tokens are initially encoded into an embedding space and a positional encoding is used to encode information about the token positions.Modern LLM architectures are decoder-only with a backbone built of repeated layers containing masked attention and a feed forward neural network (FFN).The masked attention first applies linear transformations on a sequence of embeddings to obtain query (), key (), and value ( ) matrices and computes Attention(, ,  ) = softmax( ) thus relating the tokens to each other (the mask enforces that tokens can only attend to their predecessors).The FFN is applied on each token independently.Both attention and FFN add their outputs onto the embedding, which is passed through the skip connections.
As the model parameters of LLM increases, the decoding phase of LLM inference is inherently memory-bound due to its low arithmetic intensity, meaning that loading and moving the model weights into the on chip memory takes significantly more time than the actual computations.This challenge becomes particularly acute with small batch sizes.LLMs have a large memory footprint, primarily due to the pre-trained model weights and intermediate states required for next-token generation, such as the key-value cache.

Computational and Memory Requirements
Modern computer chips employ specialized tensor units to efficiently perform tensor computations, such as matrix multiplication, which are fundamental in large foundation model workloads.Examples of these units include Nvidia TensorCore [86], AMD Ma-trixCore [4], and the systolic arrays found in Google TPU [50,52] and AWS Trainium [14].These tensor units are designed to process high-performance tensor computations such as matrix multiplication to meet the extensive demands of LLM workloads, especially during the training phase.
Inference tasks, however, present a distinct challenge, as powerful tensor units alone are insufficient for optimal performance.To address memory-bound during decoding process, modern chips incorporate high-bandwidth memory, typically in the form of Static Random Access Memory (SRAM).SRAM offers low latency and high throughput, suitable for the substantial memory requirements of inference workloads.However, the high cost of SRAM limits its capacity, requiring careful data manipulation to optimize its usage.
High performance kernels.Inference-purposed kernels, such as DeepSpeed-Inference [6], FasterTransformer [84], and transformersneuronx [11], adhere to these guidelines to efficiently process the workloads.They can be designed by experienced performancetuning experts or generated by machine learning compilers.In either case, a deep understanding of both chip architecture and inference workloads is essential for efficiently mapping and scheduling computations onto the hardware.By leveraging this knowledge, these kernels can fully optimize the utilization of high-bandwidth memory and tensor units, ultimately enhancing the efficiency of inference workloads on modern computer chips.
Hardware Accelerators.While the majority of the LLM workloads are now done on GPUs following the SIMT (single instruction, multiple threads) paradigm, LLM inference actually can also be accelerated with systolic array and High Bandwidth Memory (HBM) based systems (e.g.Google TPUs [50,52], AWS Trainium/Inferentia [14] and Intel Gaudi [41]) with lower power consumption and lower cost accordingly.Systolic array based systems can accelerate matrix multiplication with instruction-level parallelism [51].To accelerate memory access speed of a large amount of data, HBM is used as a replacement of Double Data Rate (DDR) and careful memory planning is required as the capacity of HBM is limited compared to the model size [135].There are also systems that utilize FPGAs [66] for compute acceleration, and systems that utilize inter-node connectivity [137] for large-scale transformer inference.
Techniques to Mitigate Memory Bound.In addition, to mitigate the memory-bound issues in LLM inference, practitioners employ various techniques that can be broadly categorized into two main approaches.First, semantic-preserving methods aim to reduce memory usage while maintaining the original prediction via system optimization (Section 2).Examples includes KV caches [90], FlashAttention [24], and FlashDecoding [91].Conversely, architectural/algorithmic optimization usually trade off some prediction accuracy for improved memory efficiency and inference speed (Section 3 and Section 4).These includes grouped query attention (GQA) [2], Mixture of Experts (MoE) [104] architectures as well as compression methods of quantization, pruning and distillation, and speculative decoding [18].

Distributed Solution Frameworks
The memory-bound nature of LLM inference and the limited capacity of HBM on individual accelerators present significant challenges in meeting the growing demands of LLM workloads.LLMs with hundreds of billions of parameters typically do not fit on a single node for inference, let alone a single accelerator.Consequently, a distributed solution becomes necessary.However, implementing such a solution for LLM inference introduces challenges like efficient model partitioning, communication, and load balancing.Addressing these challenges is crucial for enabling scalable processing of large-scale LLM inference workloads.Typically, we can employ a combination of multiple parallel strategies to achieve state-of-the-art performance for LLM inference, each with its own advantages and disadvantages.
Tensor parallelism is designed to distribute large chunks of tensor computation workloads across multiple accelerators and aggregate the final results via collective communication.This approach can help reduce end-to-end latency when collective communication is efficient (e.g., NVIDIA NVLink [30], AWS Neuron Collective Communication [10]).However, if the tensor computation workload is small, the extra overhead in collective communication can diminish overall performance.Since inter-node communication is typically higher than intra-node communication, tensor parallelism is most effectively utilized within a single node.
Pipeline parallelism is employed to distribute model layers across accelerators.As both model weights and KV cache for each layer can be distributed to different accelerators, and only the inputs/outputs of the layers need to be transferred across devices, pipeline parallelism is relatively independent of the collective communication bandwidth.This strategy allows for the distribution of models that are too large for a single node.To increase hardware utilization, overlapping different pipeline stages is typically necessary.Pipeline parallelism is preferable over tensor parallelism when the entire model does not fit on a single node for inference.
Sequence parallelism [68] is a critical technique for supporting long context.The core concept of sequence parallelism involves distributing sequences along the sequence dimension, enabling the parallel decoding of small batches of long sequences.This technique is implemented by solutions such as FlashDecoding [91] and PagedAttention V2 [58].
Expert parallelism (EP) facilitates the distribution of Mixture of Expert (MoE) models [104] across multiple accelerators.The MoE model architecture is designed to skip inactive expert computation, while still maintaining the capability to achieve high accuracy compared to dense models.Since expert weights are typically large, distributing and dynamically loading these weights can be costly.To reduce collective communication and avoid the dynamic loading of expert weights, EP keeps each expert within a small group of accelerators [92].As the input/output data is considerably smaller than expert weights, the all-to-all collective communication can be efficiently used to distribute tokens to the activated experts.

SYSTEM OPTIMIZATION
We explores semantic-preserving optimizations for LLM inference from a systems perspective.By strategically organizing computations, significant improvements in inference speed and memory efficiency can be achieved without compromising the semantic integrity of the model.In this seciton, we discuss on reducing redundant computations through the use of key-value caches (Section 2.1), optimizing attention implementation to minimize memory access (Section 2.2), enhancing throughput via handling batches of requests (Section 2.3), and reducing unused memory fragmentation via distributing sequences (Section 2.4).These optimizations were mainly developed based on GPUs, but the main concepts are largely applicable to other AI accelerators with some specific implementation tweak.The following subsections delve into each of these approaches in detail, examining their theoretical foundations, practical implementations and challenges therein.

Fast Attention Computation via Caching
Generating tokens in an autoregressive fashion is a widely adopted approach like GPT [13] and Llama [109], yet it can pose computational challenges.During the auto-regressive generation, decoding step to generate every next token requires to fetch previous tokens.This requires to compute their hidden representation of keys and values in attention mechanism, which could be repetitive during the sequence of token generation.KV-cache [90] stores and reuses these past key-value pairs, eliminating the need of recalculation for every new token.This technique significantly improves the efficiency of inference, by reducing the quadratic complexity of attention computation w.r.t. a sequence length to be linear.
However, the memory footprint of KV cache growing linearly w.r.t. the sequence length can be substantial, as it requires additional memory to store the cached keys and values.To address this, several techniques has been introduced to reduce the memory space required for the KV cache.Low-bit precision data types have been utilized in KVQuant [45], which brings million-scale context length support on a single A100-80G GPU.StreamingLLM [123] introduced the concept of attention sink, which preserves decent accuracy by leveraging initial tokens without exhausting the long context window size.Generalized block-sparse attention patterns, e.g.BigBird [128]), allow the training of long context support, without degrading accuracy at inference stage.Heavy-Hitter Oracle [134] is a cache eviction policy which retains Heavy Hitters tokens, i.e., tokens contributing most of the value in attention scores, based on local statistics at each decoding step.However, all of these can lead to a potential degradation of accuracy.
The aforementioned KV cache strategies can be implemented differently depending on hardware.To be specific, the KV cache memory space size can be formulated as 2ℎ bytes, where  is batch size,  is sequence length, ℎ is number of KV heads,  is size of the attention head,  is the number of layers,  is size of each data element in number of bytes.The size of  is determined at runtime for batch inference. and  are fixed by the model configuration.This leaves the optimization space for reducing KV cache memory space being limited to , ℎ and .KV cache quantization helps the reduction of .Block-sparse attention techniques help minimize  and ℎ.With all considered, the distributed strategy of KV cache memory can be distinct among GPU and systolic array-based accelerators (e.g., TPU, Trainium) due to different memory constraints and the numbers of devices per node, especially for handling GQA models (Section 3.1).
PagedAttention [59] can be considered as a KV cache optimization.It transforms the KV cache into non-contiguous memory space, and makes  as a fixed block size.Each sequence can occupy a variable number of KV cache blocks.SGLang [136] further transforms the fixed block size to variable length, with RadixAttention enabling automatic KV cache reuse.Both PagedAttention and RadixAttention enabled the possibility to cache shared prefix among multiple sequences, without duplicated copy of the prefix.

Efficient Attention Computation
Figure 2: Flash Attention by Dao et al. [24].The outer loop iterates over K and V blocks and loads them to fast SRAM.In each block, inner loops iterates over Q blocks, loading them to SRAM, and writing the attention output back to HBM.
Modern LLMs have extended the support of context length from the order of thousands to millions within a few years from less than 1k (e.g., GPT-2 [13]) to 200k+ (e.g., Claude 3 [8]).The main challenge of expanding the context window lies in the extensive computational requirements and memory consumption for the attention computation.As the model considers more tokens simultaneously, the compute/time complexity and memory demands of calculations increase significantly, scaling quadratically with the size of the context window.FlashAttention [23,24] was introduced to address these challenges, which reformulates the attention computation as a sequence of matrix multiplications and applies block-sparse decomposition.By processing attention in smaller blocks, FlashAttention reduces the memory footprint of attention computation, avoiding the need to materialize the entire attention matrix in memory at once.The key advantage of FlashAttention is its ability to minimize data movement between different memory hierarchies.By carefully selecting the block size based on the memory hierarchy and capacity of the device, FlashAttention ensures that the data can be efficiently processed without requiring multiple transfers between memory levels.For example, on GPUs, the block size is typically small to fit within the L2 cache, minimizing expensive memory accesses.In contrast, devices like AWS Trainium or Google TPU, which have a large scratchpad memory in the tens of megabytes (MBs), can leverage larger block sizes to maximize computational efficiency by processing more data in parallel.
For large context, Blockwise Parallel Transformer (BPT) [72] further minimize memory consumption on feedforward network by computing them in a block-wise manner.Enhancing BPT, Ring Attention [73] utilizes blockwise computation for self-attention and feedforward processes to distribute extended sequences across multiple devices by dividing the input text into smaller, more manageable blocks.These blocks are processed on separate devices organized in a ring-like configuration, enabling parallel processing.
When it comes to inference compared with training, relatively smaller batch size can lead to different bottleneck.Flash-Decoding [91], based on FlashAttention, introduces a new parallelization dimension: the keys/values sequence length.It stores minimal extra data in global memory while fully utilizing the accelerator, even with small batch sizes, provided the context length is sufficiently large.For the smaller chunks of split keys/values, it computes the attention of the query with each chunk in parallel using FlashAttention, and reduce across all chunks to calculate the final output.

Continuous Batching
LLM inference is inherently memory-bound if only one sequence is processed.To increase the throughput for a large number of input prompts, the most straightforward approach was to allocate a fixed time window for decoding a fixed number of sequences.This is commonly known as static batching, which has been implemented in FasterTransformer [84] and many others [11,90].The advantage of static batching comes from the minimized latency for decoding with small batch sizes.As batch size gets bigger to achieve higher throughput, a mechanism in improving effective utilization of batched decoding is needed.
Static batching results in resource waste as some sequences reach the end earlier than the others in the same batch.Orca [31] proposed the idea of a dynamic sequence eviction strategy.The strategy essentially removes the sequences that generated EOS token, and inserts new prompts into the decoding batch.The approach is commonly referred to as continuous batching.In addition to the proposed mechanism in handling continuous batching, Ocra also introduced the idea of flattening multiple input prompts and concatenate them into the prefill kernel, in order to reduce padding and kernel launch overhead.The block diagonal causal attention mask is commonly used to achieve a throughput gain with FlashAttention.

PagedAttention and its Derived Applications
Since the length of output tokens is unpredictable, the most straightforward approach was to maintain the maximal sequence length for each decoding request.As most part of the reserved memory won't be actually used, this would introduce a large amount of internal memory fragmentation.As illustrated in the Figure 3, internal memory fragmentation refers to the memory space that is allocated but not effectively utilized for sequence decoding.External memory fragmentation indicates the device memory space that is free but not allocated for usage.To reduce both internal and external memory fragmentation, PagedAttention [59] introduced the • FP8 (E5M2/E4M3) data type [82] for KV cache storage.FP8 storage data type for the KV cache helps increase compute intensity in decoding stage, and mitigate the memory-bound decoding problem.It can also help increase batch size while maintaining same amount of KV cache payload, comparing to FP16/BF16 KV cache data type.The throughput benefit can come from the increase of batch size for decoding.The initial support for FP8 KV cache quantization [125] in vLLM reported 1.49x throughput improvement on A100, by trading off up to 2.4% of accuracy degradation on HumanEval-Python evaluation tasks.• Structured KV cache storage for shared prefix processing.Recent advancement in context-aware generation has demonstrated strong reasoning capability in multiple frameworks [17,40,53].To reduce unnecessary computation while maintaining strong reasoning capability, Zheng et al. [136] proposed RadixAttention, which utilize radix tree and maintain the tree elements as sequences with varying lengths.It also introduces a compiler optimization framework to achieve longer shareable prefixes for caching.• Reduce the interruption of input prompt encoding.In order to reduce high tail latency in decoding phase due to long context inputs, Agrawal et al. [1] proposed the idea of distributing long context inputs into separate chunks of processing steps.It utilizes the chunked prefill kernel, which was initially proposed to reduce pipeline bubble for multi-GPU serving.It increases the stability in decoding latency via stall-free decoding, and improved endto-end throughput by up to 1.33x.

STRUCTURED TRANSFORMER ARCHITECTURES
Beyond optimizing the serving of a given model, also the model architectures themselves have developed and moved towards architectures that enable faster and more efficient inference, while still being similarly powerful.In the following we discuss changes to the attention mechanism, reducing its number of key and value heads (Section 3.1) as well as mixture of experts approaches, which effectively only execute part of the network for each token (Section 3.2), in addition to other architecture choices (Section 3.3).

Multi-/Grouped Query Attention
Falcon [3] and Llama 2 70B [109] employ techniques known as multi-query attention (MQA) [103] and grouped-query attention (GQA) [2] respectively.When it comes to inference, memory and computational challenges arise from the repeated loading of decoder weights and attention keys/values in decoding steps.In multi-head attention, distinct queries brings linear increase on the number of heads for keys and values, requiring larger memory bound and prohibiting potential latency improvement.However, MQA involves employing multiple query heads alongside a single key/value head, thereby accelerating decoder inference.GQA an advancement over MQA, strikes a balance by utilizing an intermediate number of key-value heads (more than one but fewer than the query heads).The GQA model efficiently partitions the query into  heads segments akin to the original multi-head attention mechanism, while dividing the key and value into handful of groups.For example, Llama-3 70B [109] uses 64 query heads which are grouped onto 8 key-value heads.This arrangement allows a handful of query heads to share the same key-value heads to interact.By leveraging repeated key-value pairs, the GQA approach enhances overall model performance while preserving quality.
When it comes to the MQA/GQA inference strategy in a distributed setting, there are a number of approaches.If possible, the common practice is to evenly distribute KV heads across multiple accelerators.This assumes that the number of KV heads are divisible by the number of accelerators.For handling the case where there are more accelerators than number of KV heads, Pope et al. [90] has introduced the approach that distributes sequences over different accelerators.The idea is to leverage all-to-all operator to transform the layout of the hidden states.It can effectively increase static batching inference throughput when batch size is large and number of KV heads is small.Regarding the support of PagedAttention [59], the best practice is yet to be explored, since KV cache block placement is determined at runtime.It won't be effective if number of KV cache blocks are imbalanced among accelerators.Existing solutions either shard along sequence dimension (e.g.PagedAttention V2), or replicate the KV heads for each of accelerators.

Mixture of Experts for Transformer
Mixture of Experts (MoE) [104] architecture from Figure 5 is designed to activate part of expert computation by skipping inactive ones, while maintaining the capability in achieving high accuracy.This allows for pretrained models to utilize significantly less computational resources and thus increase the model's size or the dataset it handles within the same computational budget compared to a dense model both in training and inference.This MoE component becames a popular design choice in favor of fast inference among Transformer class [29,49,62].Among many variance of MoE [36,55,64,133,138,141], it typically tries to comprise two primary components: First, sparse MoE layers replace conventional dense feed-forward network (FFN) layers.These MoE layers are comprised of a set number of "experts" (e.g., 8 in Mistral [49]), where each expert functions as an individual neural network.While these experts are typically FFNs in practice, they can also encompass more intricate networks or even form a hierarchical MoE structure [34].Second, a gate network or router determines the allocation of tokens to specific experts.Notably, tokens can be directed to multiple experts.This decision is governed by the routing mechanism as a critical design choice for efficient inference and training.The router, comprising learned parameters, is pretrained concurrently with the remainder of the network and plays a pivotal role in token allocation within MoEs.
Routers for sparse MoEs can be categorized into two main variants: Token Choice, which assigns experts to individual tokens, and Expert Choice, which assigns tokens to individual experts.Token Choice can be optimal for latency constrained applications, since the number of activated experts is small.Expert Choice is used for throughput optimizations, especially when total number of experts are small and the tokens can be balance among all experts.In such applications, Expert parallelism (EP) keeps each expert within a small group of accelerators, leading fast inference by alleviating collective communications and dynamic loading.

Other Architectures
Sliding Window Transformer (SWT) [12] is a variant of the selfattention mechanism designed to handle long sequences more efficiently by dividing the input sequence into smaller, overlapping chunks or "windows."For each token, the attention score is computed only over a window of length  sequence rather than the entire (previous) sequence.This attention mechanism sequentially slides across the input sequence to compute all localized attention scores.As the layers of the SWT get deeper, the localized attention mechanism extends the receptive fields w.r.t.input tokens, preserving a comprehensive understanding of the entire sequence, similar to a CNN.Each SWT requires only linear complexity  (), mitigating the quadratic complexity  ( 2 ) in standard self-attention.
Mixture-of-Depth [93] allows some tokens to take paths across layers dynamically, skipping certain layers based on specific criteria, e.g., CALM [99] with exit criteria during forward pass, instead of all tokens passing through every layer of Transformer.This approach enables the model to allocate computational resources more efficiently, focusing more layers on complex parts of the input while using fewer layers for simpler parts.The mixture of depth can help reduce computational costs and improve forward/backward speed without significantly compromising model performance.[22] compress a model or input, thereby reducing the memory footprint and latency of LLMs.These methods come with challenges as they typically introduce tradeoffs between inference improvement and accuracy.Quantization of model weights (Section 4.1) has essentially has become a standard nowadays.Pruning parts of models has posed more challenges but also seen much progress targeted specifically to LLMs (Section 4.2).Lastly, entirely compressed models can be trained through distillation from a large teacher model (Section 4.3).

Quantization
Quantization [37] is a model-compression technique that represents weights or activations of the network with low-precision data types (e.g., 8-bit integer) instead of high-precision data types (e.g., FP32), therewith reducing the storage when loading the model/activations in hardware (see Figure 6).Reduced data precision poses trade-off between latency-throughput-accuracy.It also requires support from the target hardware to realize maximum speedup [111].Quantization is applied either during training or after training.
The upper bound for both latency and throughput improvement from weight-only quantization is the ratio of source precision data type to the target precision data type.For example, the upper bound for latency/throughput improvement with INT8 quantization down from 32-bit floating point format (FP32) is 4×.As INT8 parameters require 4× fewer bits than FP32, we can increase the batch size as well as perform more computations on the same data size in one go.But memory saving does not directly translate to improved throughput/latency due to several factors like memory bandwidth, hardware limitations, and quantization/de-quantization overhead.

Pruning
Pruning is a compression technique to remove redundant parameters from the model.The goal is to maintain prediction quality of the model while shrinking its size, and thereby increasing its efficiency.Pruning requires strategies to identify which parts to remove and, potentially, how to adapt the remaining parts in order to compensate for quality degradation.
Unstructured Pruning removes individual weights of the network.Clearly, weights that are 0 can be ignored without any loss in accuracy, but also very small weights can be set to zero.Pruning weights that are not small enough will finally lead to degradation of the model, which sets the limit for the speedup.Given a desired sparsity ratio and a matrix  , the simplest strategy is to prune the weights with the smallest magnitude, which corresponds to minimizing the Frobenius norm between the dense matrix  and its sparse approximation  , i.e., ∥ −  ∥ 2  .This approach, referred to as magnitude pruning, quickly leads to drastic accuracy degradation [32,106].Wanda [106] and RIA [132] improve over simple magnitude pruning by reweighing the matrix weights with the norm of the corresponding input activation.Another popular Transformer pruning method is SparseGPT [32], which jointly optimizes the pruning mask as well as the remaining weights in order to minimize ∥( −  ) ∥ 2  , where  represents a sample of inputs to the linear layer.Since finding the optimal pruning mask is a combinatorial problem, SparseGPT employs heuristics to make it computationally feasible.While most methods apply the sparsity uniformly across layers, owl [127], BESA [124], and ISC [101] derive a criteria to prune layers to different levels.
Unstructured sparsity is mainly of academic interest since, so far, it does not lead to speedup on hardware accelerators (Flash-LLM [119] recently provided some steps in this direction).However, most methods can also be applied to achieve N:M structured sparsity, where only  out of  consecutive elements are allowed to be nonzero.Some hardware accelerators support these patterns and allow for memory savings and speedups [89].
While pruning can in principle be done during pretraining [61,131,140], most recent work focuses on the post-training setting.Nonetheless, in order to recover from the accuracy loss due to pruning many works consider applying a short training strategy after pruning.This is either done via a standard pretraining loss [78,106] or with variants of distillation losses [56,60,97,108,120].To increase efficiency, some works do not update all remaining parameters, but employ parameter efficient techniques like LoRA [46].Generally, such strategies help recovering the accuracy loss, but are also prone to overfitting to the specific dataset used [78] and can compromise the generality of the model.

Distillation
Knowledge distillation (KD) ( [15,44,76,117]) is a model compression technique in which we train a small model (called student) to match closely the performance of a larger model or an ensemble of models (called teacher).To this end, KD connects a student model with the teacher model by a distillation loss, which penalizes differences in the outputs of the two models at certain layers (see Figure 7).The standard KD approach-also called last-layeronly approach-trains the student to match the performance of the teacher at the last layer (e.g., [44,96]).Another approach-also called layer-wise approach-trains the student to match the hidden representation of the teacher at each layer (e.g., [107]).Layer-wisedistillation approaches report improved results on downstream tasks compared to last-layer-distillation approaches [71], but they stipulate the same number of layers in the student as the teacher.In general, KD approaches are flexible with regard to the exact structure of the student model, which allows optimizing the student for various target hardwares.Another advantage is that the distillation process runs entirely after training the large teacher model.
Distillation does not affect the training of a teacher model, but distillation effort by itself can be a major training effort for the following reasons.First, the number of steps can be similar to pre-training a small model.Second, the distillation loss usually is a combination of the pure student/teacher loss together with an original loss, for which typically the original pre-training data is recommended [44].To compute the distillation loss, we also need to make a forward pass of a teacher model to get logits.But there  [44], where small student model distills a large teacher model via minimizing a distillation loss.This loss on a transfer dataset is then backpropagated to the student model.is a range of possibilities in selecting the transfer set on which to train the smaller distilled model [87].For example, symbolic distillation [75,117] approaches synthesize data from the teacher model to this end.Distillation also comes with a trade-off between size and quality, which determines the improvement in throughput/latency.

FAST DECODING
As discussed, vanilla auto-regressive decoding is memory bound.Speculative decoding (SD) [18,63] exploits the fact that multiple draft tokens can be verified in a single forward pass of the target model.The draft tokens are then accepted based on a rejection sampling scheme [18,63] or deterministic approaches [54].Processing the draft tokens requires additional computations in the target model, but the main bottleneck remains the loading of the weights.Hence, the verification of the additional draft tokens comes at negligible additional latency.But once draft tokens are accepted, multiple tokens are decoded with a single call to the target model, resulting in an overall latency reduction.Noticeably also, opposed to the compression techniques in Section 4, the output distribution provably remains the same [18, Theorem 1].
Beyond the verification also the draft token generation adds to the latency.We classify SD methods broadly into two categories, based on whether or not they use a separate model for drafting.Seminal work [63] uses a smaller model from the target model's family as draft model, e.g., T5-small as the draft model for T5-XXL, whereas Chen et al. [19] train a separate draft model from scratch.
Choosing an appropriate draft model for a target model can be tricky.In light of this, some SD methods take advantage of the target model itself.For example, self-speculative decoding [129] drafts tokens using the target model but skips some of its intermediate layers.Medusa [16] trains multiple feed-forward heads on top of the last Transformer layer.The -th head is responsible for predicting ( + )-th token into the future.EAGLE [70] improves the heads by introducing auto-regression on features at the last Transformer layer.PaSS [83] appends  special "look-ahead" tokens to the prompt as input, and generates  draft tokens in parallel using the target model itself.Lookahead Decoding [35] applies Jacobi method [98,105] that drafts multiple tokens in parallel.In some applications (e.g., Question Answering), one can draft tokens by matching their prefix in a document [5], or a database [43].
There are two orthogonal paths to further speed up speculative decoding.One is to draft multiple sequences for verification.The other is to improve the acceptance rate.We elaborate on them next.
Multiple Drafted Sequences.In the vanilla case of a single drafted sequence, all drafted tokens after the first rejection position are wasted.In contrast, drafting multiple sequences increases the chance of having a longer accepted sub-sequence.The multiple sequences are often organized in a tree structure to share some prefixes.Correspondingly, the verification is made more efficient by introducing tree attention, with a specialized attention mask that reflects the token dependencies in the tree.This approach is first proposed in SpecInfer [81], adopted in several aforementioned papers (e.g., Medusa [16], EAGLE [70], Lookahead Decoding [35]), and further developed by Chen et al. [21].Depending on the model architectures, speedups reported are often in 2 − 3×.
Aligning Draft to Target Model.In [63], the rejection rate is shown to be equal to the Total Variation divergence (TV-div) between target and draft models' token probabilities.This neat theoretical result has motivated Distillspec [139] to knowledge distill from the target to draft model.With the better aligned draft model, 10 − 45% further speedups are reported.Regarding the objective function for distillation, it could be either conventional Kullback-Leibler divergence (KL-div) or the more relevant TV-div.Note that KL-div can be considered as a surrogate for TV-div due to Pinsker's inequality.Interestingly, [139] does not observe an obvious advantage of TV-div against KL-div.

CONCLUSION
This paper provides a comprehensive overview of efficient inference methods for LLMs, covering system optimization, structured Transformer architectures, model compression, and algorithmically faster decoding, especially in the context of AI accelerator.These techniques aim to facilitate effective computation, often considering input-output (IO) communication during attention score calculation, reducing extensive and repetitive self-attention mechanisms, minimizing memory idleness and compressing models themselves.
Inference optimization is not only crucial for Transformer-based LLMs, but also other foundation models like Stable Diffusion [94] or the Transformer alternative of State Space Models (SSMs) [38,39].Several of the techniques presented in this paper have been successfully applied to these models too; e.g., in Stable Diffusion with FlashAttention [20], quantization [42,69,100,114], sparsity [67], and distillation [80,95], or in SSMs with Mixture of Experts [7,88].
Nevertheless, many of the challenges remain largely unresolved, particularly when dealing with extremely long context lengths and sequences, necessitating tailored efforts depending on the types of devices used.We are confident that researchers and developers will continue to strive towards narrowing these gaps, thereby enhancing the accessibility of Generative AI systems.

Figure 1 :
Figure 1: Original Transformer architecture adopted from [112, Figure 1], comprising of an encoder (left) and a decoder (right).Tokens are initially encoded into an embedding space and a positional encoding is used to encode information about the token positions.Modern LLM architectures are decoder-only with a backbone built of repeated layers containing masked attention and a feed forward neural network (FFN).The masked attention first applies linear transformations on a sequence of embeddings to obtain query (), key (), and value ( ) matrices and computes

Figure 3 :
Figure 3: Types of memory fragmentation by Kwon et al. [59].The figure depicts the memory space for decoding two sequences.Internal memory fragmentation is considered to be the allocated KV cache blocks that are not occupied by the sequences.The free memory space that is not allocated is considered to be external memory fragmentation.

Figure 5 :
Figure 5: Instead of the dense feed-forward network layer in the traditional Transformer (left blue), Fedus et al. [29] introduce a sparse Switch FFN layer (right blue).This layer functions independently on the sequence's tokens.

Figure 7 :
Figure 7: Canonical knowledge distillation process by Hinton et al.[44], where small student model distills a large teacher model via minimizing a distillation loss.This loss on a transfer dataset is then backpropagated to the student model.