Grape: Practical and Efficient Graph-based Executions for Dynamic Deep Neural Networks on GPUs

Achieving high performance in machine learning workloads is a crucial yet difficult task. To achieve high runtime performance on hardware platforms such as GPUs, graph-based executions such as CUDA graphs are often used to eliminate CPU runtime overheads by submitting jobs in the granularity of multiple kernels. However, many machine learning workloads, especially dynamic deep neural networks (DNNs) with varying-sized inputs or data-dependent control flows, face challenges when directly using CUDA graphs to achieve optimal performance. We observe that the use of graph-based executions poses three key challenges in terms of efficiency and even practicability: (1) Extra data movements when copying input values to graphs’ placeholders. (2) High GPU memory consumption due to the numerous CUDA graphs created to efficiently support dynamic-shape workloads. (3) Inability to handle data-dependent control flows.To address those challenges, we propose Grape, a new graph compiler that enables practical and efficient graph-based executions for dynamic DNNs on GPUs. Grape comprises three key components: (1) an alias predictor that automatically removes extra data movements by leveraging code positions at the Python frontend, (2) a metadata compressor that efficiently utilizes the data redundancy in CUDA graphs’ memory regions by compressing them, and (3) a predication rewriter that safely replaces control flows with predication contexts while preserving programs’ semantics. The three components improve the efficiency and broaden the optimization scope of graph-based executions while allowing machine learning practitioners to program dynamic DNNs at the Python level with minimal source code changes.We evaluate Grape on state-of-the-art text generation (GPT-2, GPT-J) and speech recognition (Wav2Vec2) workloads, which include both training and inference, using real systems with modern GPUs. Our evaluation shows that Grape achieves up to 36.43× less GPU memory consumption and up to 1.26× better performance than prior works on graph-based executions that directly use CUDA graphs. Furthermore, Grape can optimize workloads that are impractical for prior works due to the three key challenges, achieving 1.78× and 1.82× better performance on GPT-J and Wav2Vec2 respectively than the original implementations that do not use graph-based executions.CCS CONCEPTS• Computing methodologies → Parallel computing methodologies; Machine learning; Artificial intelligence.


INTRODUCTION
DNNs form an important class of machine learning algorithms and have made significant advancements in numerous domains such as image classification [41,43,107], text generation [11,78,87,106,114], and speech recognition [5,8,36].In those applications, the DNNs are first trained and then deployed for inference, both for many iterations.Due to the high cost of the two operations in terms of time and hardware resources [81,100], it is crucial to achieve efficient execution of DNNs on specialized hardware platforms such as GPUs.
Graph-based executions such as CUDA graphs [35] on NVIDIA GPUs are an effective way of improving the performance of DNNs by submitting jobs to GPUs in the granularity of multiple kernels instead of a single kernel.Specifically, given a DNN model, CUDA graphs capture the model's computations on GPUs at the first time they are launched and replay those computations in subsequent iterations with a single graph launch call.To keep the correctness and the consistency between the captured operations and the replayed ones, CUDA graphs request that all workloads be deterministic (i.e., no control flows) and have their parameters fixed when capturing, which include CUDA kernel launch configurations and function arguments.In order for CUDA graphs to operate on different input values, synthetic inputs are used at capture time as placeholders, which are populated with real input values (i.e., input data from machine learning workloads) at replay time.
Although CUDA graphs are powerful in eliminating CPU overheads for general-purpose GPU applications, it is difficult to directly apply them to many DNNs (especially dynamic DNNs that have varying-sized inputs or data-dependent control flows, which are ubiquitous in modern machine learning applications [5,8,11,36,78,87,106,114]).This is because such dynamic DNNs pose three key challenges in terms of efficiency and even practicability: (1) Extra data movements from real inputs to the graphs' placeholders: The design of CUDA graphs introduces extra copies from real input values to the graphs' placeholders, attributing to up to 34% of the models' execution time.
(2) Prohibitively large GPU memory consumption to efficiently support dynamic-shape workloads: To efficiently support workloads with dynamic input shapes, we have to create a CUDA graph for each shape, and every created graph consumes a certain amount of GPU memory .Although the amount is small by itself and hence not a challenge in static-shape DNNs, they sum up to be a huge value (20-100 GB) that exceeds the GPU memory capacity (e.g., 24 GB on an NVIDIA RTX 3090 GPU [71], 40/80 GB on an A100 GPU [70], or 80 GB on an H100 GPU [74]) when the number of possible shapes is 1024 (used in state-of-the-art language modeling applications [87,114]).
(3) Unable to handle programs with data-dependent control flows: CUDA graphs can only capture deterministic computation and thus fail to capture modules with data-dependent control flows.This restriction prevents us from using them to optimize important modules that take a large portion of the total runtime but nonetheless have control flows within them.For example, our runtime measurements show that the beam search module [112] of state-of-the-art text generation applications [87] constitutes 30% of the total execution time, but it is out of the optimization scope of CUDA graphs for having data-dependent control flows.
These three challenges hurt the efficiency and the practical use of CUDA graphs in many real-world scenarios.To adequately address those challenges, we propose Grape, a graph compiler that makes graph-based executions practical and efficient for dynamic DNNs on GPUs.Grape is made up of three key components: (1) An alias predictor that automatically and accurately predicts if a tensor will be copied to a graph's placeholder.This enables direct forwarding of the placeholder's memory region to the tensor (which we denote as that placeholder's alias), eliminating the need for future data movements.We observe the main reason why the data movements are needed in the existing graph-based executor [80] is that the memory allocators of machine learning frameworks [2,80] are unable to tell whether an arbitrary tensor created in Python's dynamic programming environment will be moved to a placeholder in the future.Therefore, they can only offer a general memory region to the tensor first and then move it later to the regions that are reserved for placeholders.To our best knowledge, this is a common problem that exists for all frameworks adopting a Python frontend and a C++ runtime (e.g., PyTorch [80] and TensorFlow [2]).
We, however, notice that such data movements can be avoided if the Python frontend is considered when allocating memory.Specifically, we observe that DNN executions are highly regular and a Python code position that yields a placeholder's alias in one iteration is also likely to yield another alias for the same placeholder in the next iteration.Based on this insight, we design a frontend-aware alias predictor that dynamically records the connection between the Python frontend and the placeholders.The predictor requires zero changes to frontend applications' source code and can accurately forward the reserved memory regions for placeholders to their respective aliases.
(2) A metadata compressor that significantly reduces the GPU memory consumption of CUDA graphs by compressing their memory regions at capture time and efficiently decompresses them at replay time.Although the precise technical details regarding the CUDA graphs' memory regions are proprietary, we speculate that they are for caching the graphs' metadata on GPUs after carefully studying their dumped values.The metadata describes what is captured in the graph and contains information such as the CUDA kernels that reside in the graph along with their function arguments.
We further observe from the dumped values that the CUDA graphs' memory regions for many important models are very sparse and repetitive.This is because most of the CUDA kernels in stateof-the-art DNNs [8,87,114] do not fully leverage the function argument space provided to them, and they are usually invoked using similar pointer values.Therefore, those CUDA kernels leave abundant sparsity and value redundancy in their memory regions, allowing us to compress them using simple compression algorithms such as run-length encodings [31].With efficient compression, we are able to fit the extremely large GPU memory consumption when using CUDA graphs to execute dynamic-shape DNNs (up to 100 GB) into the limited memory of modern GPUs.
(3) A predication rewriter that replaces data-dependent control flows with predication contexts, a new Python context that controls whether GPU operations within it can go through or not using a predicate.Although the idea of predication has been proposed in prior works [39,42,45,63] at the architecture level to handle short if-else statements, we significantly broaden its scope of applicability by allowing machine learning practitioners to implement common control logic (e.g., if, break, and continue) at the Python level.Predication contexts stitch fragmented basic blocks together into a monolithic block that does not have control flows while preserving the programs' semantics, which hence enables conversion to efficient graph-based executions.
Our contributions can be summarized as follows: • We identify the key challenges in making graph-based executions practical and efficient for dynamic DNNs, an important class of DNNs that involve varying-sized inputs and/or data-dependent control flows.
• We build Grape, a new graph compiler that is integrated in PyTorch [80], a state-of-the-art machine learning framework.Grape adequately addresses those challenges and allows machine learning practitioners to program their models at the Python level with minimal source code changes.Its design is based on three key insights: (1) regularity in mappings from the Python frontend to graphs' placeholders, (2) high data redundancy in graphs' memory regions, and (3) predication contexts in place of data-dependent control flows.
• We evaluate Grape using state-of-the-art DNN training and inference workloads in text generation (GPT-2 [87], GPT-J [114]) and speech recognition (Wav2Vec2 [8]) on NVIDIA RTX 3090 [71] and A100 [70] GPUs.Our evaluation shows that Grape achieves up to 1.26× better performance on GPT-2 [87] than prior works on graph-based executions [80].Furthermore, as Grape reduces the GPU memory consumption of CUDA graphs by up to 36.43×, it is able to practically optimize dynamic-shape workloads that are challenging for prior works [80], achieving up to 1.78× and 1.82× better performance than the original implementations that do not use graph-based executions on GPT-J [114] and Wav2Vec2 [8], respectively.

BACKGROUND AND MOTIVATION
In this section, we present an overview of (1) what CUDA graphs are and why they are important (Section 2.1), and (2) the challenges of using CUDA graphs to efficiently execute machine learning workloads (Section 2.2), especially those that involve dynamic shapes and/or dynamic control flows.After we present those challenges, we will also demonstrate why prior works [69,86,105,118] are inadequate in addressing them (Section 2.3).This motivates us to propose a new graph compiler that allows for practical and efficient graph-based executions on GPUs.

CUDA Graph
Modern machine learning frameworks such as PyTorch [80] and TensorFlow [2] commonly follow a hierarchical design: To achieve high efficiency and programmability, they have a Python programming frontend that communicates with hardware backends such as NVIDIA GPUs via a C++ runtime.CUDA graphs are a GPU programming model that can significantly boost the performance of machine learning workloads on NVIDIA GPUs by eliminating the overheads on the CPU side, which exist across the frontend (e.g., Python invokes C APIs), the runtime (e.g., operator implementations verify input data values and shapes), and the backend (e.g., CUDA [72] launches GPU kernels using the cudaLaunch API).In addition to being easily programmable by machine learning practitioners at the Python level, CUDA graphs are orthogonal to other optimizations such as reduced precision [66,67,116] and quantization [38,44,52,57,103,108,121], and their ideas can be generically applied to other hardware platforms (e.g., HIP graphs [4] on AMD GPUs are similar in spirit to CUDA graphs).
CUDA graphs aim at eliminating CPU overheads by capturing and replaying only the effective computations on the GPU side (Figure 1): To construct a CUDA graph, the workload is first captured (❷ in Figure 1) to record all the GPU operations that happen within a given capture context.In subsequent runs, the exact same operations in the context can be replayed (❹) by launching the graph object instead.Due to the consistency between the operations that are captured and the ones that are replayed, CUDA graphs request that all GPU kernels be deterministic (i.e., no control flow when capturing) and have their parameters fixed.These parameters include launch configurations (e.g., block and grid dimensions) and function arguments (e.g., pointer values that refer to the input and output data).To have CUDA graphs operate on different data values, synthetic inputs are used at capture time as placeholders (❶), and their contents are populated at replay time with real input values from machine learning workloads (❸).

Challenges of Using CUDA Graphs
Although CUDA graphs are effective in removing CPU overheads, it is challenging to apply them out-of-the-box to many machine learning workloads for runtime efficiency.We use the GPT-2 [87] model as an example 1 .Note that as we introduce the model in the next paragraph, we highlight key points that hinder the practicability or the efficiency of CUDA graphs in bold texts.
GPT-2 [87] is an important language model in the text generation domain.It has structural similarity with numerous large-scale language models [11,78,106,114] and many state-of-the-art DNNs in other domains such as speech recognition [8,36].Figure 2 illustrates the text generation pipeline of the GPT-2 [87] model.In each decoding iteration (e.g., ① in Figure 2), the GPT-2 model [87] first decodes the current input tokens for the probabilities of the next token on all the possible vocabularies.The input tokens have dynamic sequence lengths whose value can only be known at runtime.The model then applies a searching algorithm (e.g., the beam search algorithm [112]) that cherry-picks tokens to compose sentences that have the highest overall probabilities (②).This process continues until the stopping criteria are met (e.g., when the maximum generation length is reached or when the end-ofsequence token <EOS> is generated).To avoid re-evaluating the intermediate values of past input tokens, those values are usually cached and forwarded to the next iteration as inputs [84] (e.g., ① passes its intermediate values evaluated for "I am" to ③ via ④).
Although the input tokens that need to be decoded in iteration ③ have a sequence length of 1, they need the intermediate values from the previous iteration, whose sequence length is again dynamic, to generate the next token.Figure 3 illustrates the runtime breakdown of the GPT-2's [87] text generation pipeline on a modern NVIDIA RTX 3090 GPU [71], using the PyTorch [80] framework without and with the CUDA graphs applied (see Section 5.1 for the detailed methodology).We categorize the runtime into three portions: (1) InputPrep: the time spent on preparing input values to the model (e.g., filling nonprovided arguments with their default values), (2) Model: the time spent on the model's forward pass (i.e., "GPT-2" in Figure 2), and (3) BeamSearch: the time spent on the beam search (i.e., "BeamSearch" in Figure 2).We make the following key observations from Figure 3: • The InputPrep portion becomes 4.14× longer with the CUDA graphs applied (23% of the time on Model).This is because the data copying from runtime input values into the graphs' placeholders (❹ in Figure 1) incurs runtime overheads (Challenge #1), but such copying is an essential step for CUDA graphs to operate on distinct input values (see Section 2.1).More fundamentally, this is because by the time of tensor A's creation in Figure 1(b), the memory allocator of the machine learning framework is unable to foresee the future

Model
GPT-2 [87] GPT-J [114] Wav2Vec2 [8] Memory (MB) 20 102 98 Operators 446 1666 1476 Table 1: GPU memory consumption increase per CUDA graph creation and number of executed operators of the three models on an RTX 3090 GPU and tell that A will be copied to the placeholder.Therefore, it can only allocate a general memory region to A and copy its contents to the placeholder later, resulting in extra data movements.Despite the fact that one could manually force A to directly take the memory space of ph_A when A is being created by modifying the machine learning applications' source code, it is a tedious backtracking process that has to be done for all the placeholders.For example, there are 27 placeholders in the GPT-2 [87] model.Most of them are the intermediate values coming from the previous decoding iterations [84] (④ in Figure 2).Those values are created at different source code locations that span multiple files, making them hard to trace.
• Although CUDA graphs can speedup the Model portion by 3.31×, they nevertheless come with a significant increase in the GPU memory consumption due to the extra metadata: Each time a CUDA graph is created, a certain amount of GPU memory is allocated.The precise technical details regarding how the memory is used have not been revealed by NVIDIA, but we speculate that it is used for the CUDA graph's metadata by looking at the actual GPU memory content.This can be empirically observed in the results we show in Table 1, where we present the GPU memory consumption increase per CUDA graph creation on three state-ofthe-art machine learning models: (1) GPT-2 [87] (introduced in the previous text), (2) GPT-J [114], whose architecture is similar to GPT-2 [87] but possesses 48.3× more parameters, and (3) Wav2Vec2 [8], which is the state-of-the-art speech recognition model.We observe from the table that the GPU memory consumption differs in each model and it scales proportionally with the number of operators executed in the model.We present more detailed evidence on the memory being the CUDA graph's metadata in Section 3.2.
Each memory allocation made by CUDA graphs is not particularly large in isolation, however, their cumulative impact can become significant when attempting to optimize dynamic-shape workloads whose input shapes vary upon each model invocation (Challenge #2).For example, to support the maximum sequence length of 1024, which is the context size when training the GPT-2 model [87], 20 GB of the GPU memory has to be allocated just for the CUDA graphs, which is in the same order as the GPU memory capacity of a modern RTX 3090 GPU [71] (24 GB).This does not even consider several complications.For example, when serving models like GPT-2 [87] and GPT-J [114], their inputs could be batched dynamically at runtime to have variable batch size per iteration [125].Yet another example, when using CUDA graphs to train a model, the backward pass [97] of the model also requires distinct graph construction from its forward pass.All these complications further increase the number of CUDA graphs required and exacerbate the GPU memory consumption problem, causing CUDA graphs to be less practical for dynamic-shape workloads.
• CUDA graphs are unable to optimize the BeamSearch portion because it involves complex data-dependent control flows (Challenge #3).Specifically, it needs to check whether the next token generated is the <EOS> token and decide whether the current sentence should be inserted into its internal scoreboard that maintains top-rank sentences based on the sentence's length and overall probability.These program flows cannot be handled by CUDA graphs due to the requirement that they have to be deterministic (as in Section 2.1).However, as Figure 3 shows, BeamSearch grows to 30% of the total execution time after the CUDA graphs are applied, and we are not able to squeeze the last drop of performance from GPUs with it being unoptimized.

Why A New Compiler?
Now that we have presented the key challenges on the practical and the efficient aspects of CUDA graphs, a natural question to ask is why a new compiler is needed to address them.Below, we answer this question by showing why prior works [69,86,105,118] and some obvious solutions fall short in resolving these challenges.
Eagerly release CUDA graphs.One possible solution to the GPU memory challenge posed by the CUDA graphs is to construct them just-in-time before invoking the model and destruct them upon the model's completion.This solution is not feasible in practice, as the CUDA API that constructs the CUDA graph (namely cudaGraphInstantiate [72]) is prohibitively expensive runtimewise (3-5 ms for one sequence length of the GPT-2 model [87] on an RTX 3090 GPU [71], which is more than 2× longer than running the model itself).Despite the high cost, the API call is necessary for the CUDA graphs to function correctly, as our experiments find that different CUDA graphs populate their respective memory regions with distinct values, even if they come from the same DNN but with different shapes.This is expected because machine learning frameworks such as PyTorch [80] are likely to launch the same CUDA kernel with distinct grid dimensions or even invoke distinct kernels when the input shapes vary [76,130].
Bucketing.Another possible solution to the GPU memory challenge is to reduce the number of CUDA graphs needed using bucketing [69,118].Bucketing [69,118] implements dynamic-shape workloads by dividing the dynamic ranges into buckets, each of which denotes a static shape.At runtime, the workload is dispatched to one of the buckets that fits its input shapes best by padding its inputs to the bucket's shape.Despite being straightforward, bucketing [69,118] suffers from two key weaknesses: (1) It causes performance degradation due to padding [40,130], and (2) it requires careful insertion of many additional masking operations into the models to prevent the padded values from affecting the output results [69], which leads to nontrivial engineering efforts.These weaknesses limit the practical use of bucketing [69,118] in dynamic-shape workloads.
One CUDA graph for every basic block/possible execution path.We now study possible solutions to the data-dependent control flow challenge.S. Stevenson et al. [105] propose to convert every basic block (defined as contiguous statements without control flows) into its corresponding CUDA graph.This solution, however, is not practical for fragmented program flows such as the beam search algorithm [112] (where the average length of the basic blocks is only 2 statements approximately by our calculation), as launching multiple small CUDA graphs requires frequent CPU-GPU communications just like normal CUDA kernels, ruining the purpose of using them.TorchDynamo [86] uses graph breaks at control flows and instantiates one graph for every possible execution path.
Hence, the number of paths it needs to handle grows exponentially with the number of control flows (i.e., 2  where  is the number of if-else statements in the program), resulting in huge GPU memory consumption (as every CUDA graph consumes GPU memory).This does not even consider the complications brought by break and continue statements in loops, hindering its practical use in graph-based executions.We now conclude that those prior works [69,86,105,118] and obvious solutions are insufficient to resolve the challenges faced in graph-based executions for dynamic DNNs, and hence a solid new solution is required.

GRAPE: KEY IDEAS
To adequately address the challenges described in Section 2.2, we propose Grape, a new graph compiler that enables practical and efficient graph-based executions for dynamic DNNs on GPUs.Grape resolves the aforementioned Challenge #1-3 using three key components: (1) an alias predictor that accurately foretells if a tensor will be forwarded to the placeholders, eliminating extra data movements, (2) a metadata compressor that efficiently compresses CUDA graphs' memory regions, making them practical for dynamicshape workloads, and (3) a predication rewriter that safely replaces data-dependent control flows with Python contexts, stitching the fragmented basic blocks together into a monolithic block that enables conversion to efficient graph-based executions.We now elaborate on those components.

Frontend-Aware Alias Prediction
Our first key idea, frontend-aware alias prediction, is based on the observation that smarter memory allocations can be made if the frontend Python code positions are considered.Current state-ofthe-art machine learning frameworks [2,80] allocate memory independently of the Python frontend.This works in general use cases, but is inefficient when graph-based executions are applied, as data movements are always required from general-purpose memory regions to specially reserved placeholders' regions.We, however, notice that such data movements can be avoided by leveraging the regular nature of DNN executions and the rich lexical information from the Python frontend.Specifically, if a code position in Python yields a tensor that is copied into a placeholder in one iteration, it is also likely to yield another tensor that will be copied into the same placeholder in the subsequent iterations.The tensors are what we refer to as that placeholder's aliases.
Based on this insight, we devise an alias predictor that accurately predicts the placeholders' alias directly from the Python frontend.Figure 4 shows an example that illustrates its workflow: In the first iteration when a tensor is created, the alias predictor fetches a memory region from the memory allocator to it and records its code position in the Python script (file app.py line 42, ❶).By the time the tensor is copied into a placeholder, the predictor links its code position with the placeholder (❷), so that in the next iteration when the same position is encountered, it will proactively forward the placeholder's memory region to the tensor (❸).This eliminates the data movements (❹) and improves the efficiency of the graph-based executions, as our evaluation in Section 5.3 will show.Furthermore, the alias predictor is a transparent optimizer that requires zero changes to machine learning applications' source code.

Metadata Compression
Our second key idea, metadata compression, is based on the observation that the memory regions created by the CUDA graphs exhibit a huge level of data redundancy (e.g., usually more than 90% of all values are zeros), and therefore can be easily compressed with simple techniques such as run-length encodings [31].Although the exact details regarding the CUDA graphs' memory regions are proprietary, by carefully studying their dumped values, we infer that they are for caching the CUDA graphs' metadata on the GPUs.The metadata of a CUDA graph describes what the graph captures and contains information such as the CUDA kernels that reside in the graph.
To prove our hypothesis that the memory regions are indeed used to store the metadata, and also to locate the source of the high sparsity of the CUDA graphs' memory regions, we consider the microbenchmark as in Figure 5(a), where we use the CUDA graphs to capture a single CUDA kernel Sample.The CUDA kernel accepts an object of type Argument as its function argument, whose member is an integer array of size ArraySize.In Figure 5(b), we examine the sparsity of the memory region of the CUDA graph that captures Sample while increasing the value of ArraySize and filling every byte with the value 0xFF.We observe from Figure 5(b) that the sparsity of the memory regions decreases as we populate the data structure Argument with more non-zero values.This verifies our claim that the CUDA graphs' memory regions are used to cache the metadata that describes the CUDA kernels the graphs capture on GPUs, and a piece that forms the metadata is the function arguments of the CUDA kernels.Furthermore, since most of the CUDA kernels in the state-of-the-art machine learning workloads (e.g., GPT-2 [87], GPT-J [114], and Wav2Vec2 [8] in Table 1) underutilize the argument spaces that are provided to them and use pointer values as function arguments, they leave abundant sparsity and value redundancy (usually more than 90%) in the CUDA graphs' memory regions, making the latter highly compressible.Our evaluation in Section 5.4 shows that, by leveraging efficient data compression, we are able to use CUDA graphs to execute state-of-the-art dynamic DNNs [8,114] that would normally consume up to 100 GB GPU memory on modern GPUs whose memory capacity is only 24 GB [71].

Predication Contexts
Our third key idea, predication contexts, is based on the observation that many data-dependent control flows can in fact be converted into equivalent forms that do not have control flows using the predication contexts, a new Python context that is illustrated in Figure 6(a).We define a predication context as follows: If the input  in Figure 6(a) is evaluated to true, then all GPU operations within the context can go through and execute normally.Otherwise, all GPU operations within the context are nullified.The predication contexts realize the nullification by adding an extra argument predicate to every CUDA kernel of the machine learning framework's operator pool (see Figure 6(b)).At runtime, the input to a predication context is passed to all the CUDA kernels within it as a predicate, and the predicate's value controls whether the kernels' body will be executed or not.
Although the idea of predication has been proposed before in prior works [39,42,45,63] at the architecture-level to handle short if-else statements, we significantly broaden its scope of applicability by enabling it to work on many common control flow patterns at the Python level.We show this in Figure 6(c)) where we demonstrate how common control flows such as if, break, and continue can be replaced by predication contexts, yielding equivalent programs with those control flows removed.Despite having the limitation of not being able to handle break statements in loops whose upper bound is unknown (e.g., while True statements), we notice that predication contexts are generic enough to cover important modules like the beam search module of GPT-like models [11,78,87,106,114] (shown in Figure 2).We further observe from the design of predication contexts that they do not change the launch configurations (e.g., block and grid dimensions) of CUDA kernels within them.This complies with the deterministic constraint imposed by graph-based executions (see Section 2.1).Therefore, even if a program module has control flows and fragmented basic blocks, we can rewrite it into a form that does not have control flows using the predication contexts, which can be captured as a monolithic CUDA graph.The graph behaves as if we stitch the fragmented basic blocks together and is more efficient than the original implementation with control flows as it no longer needs to frequently synchronize with the CPUs for control flow operations (see our evaluation in Section 5.3).In contrast to TorchDynamo [86], which needs to instantiate a number of graphs that grows exponentially with the number of if-else statements and cannot practically handle break and continue statements in loops, predication contexts create a single monolithic graph that can incorporate any number of common control flows (e.g., if, break, and continue) and therefore do not lead to huge GPU memory consumption (24 MB for the beam search module in Section 2.2).

IMPLEMENTATION DETAILS
We implement Grape in PyTorch [80], a state-of-the-art machine learning framework.To enable straightforward conversions from PyTorch [80] neural network modules into their equivalent compiled forms, we develop a top-level interface (shown in Figure 8(a)) that is easy for machine learning practitioners to use.The interface accepts two key arguments: (1) module: the module that is to be compiled and optimized, (2) module_args_generator: a tensor generator that yields a list of tensor arguments to the module (to be used as placeholders).Figure 8(b) demonstrates an example of using the interface, where we compile the GPT-2 model [87] from the HuggingFace Transformers repository [120]  Table 2 provides an overview of the three key components of Grape, including whether they are done at compile time or runtime and the changes that they require to machine learning applications' source code.In the following subsections, we highlight some of the important details of the design of those key components.

Alias Predictor
We implement the alias predictor by monitoring the traffic between PyTorch's [80] Python frontend and its GPU memory allocator (i.e., CUDACachingAllocator).Every time a GPU memory allocation is requested by the frontend, we attach the current Python thread state to the data pointer object that is returned by the caching allocator.The thread state is queried from the Python C APIs [85] and contains information such as the current code object and the last instruction executed.When a copy is made from a tensor to a CUDA graph's placeholder (marked by specially reserved memory regions in PyTorch [80]), the alias predictor will be invoked to record the connection between the tensor's thread state when it was created and the corresponding placeholder by marking the thread state as generating the placeholder's alias.The next time another thread state that has the same code object and last instruction executed as the recorded thread state is encountered, the alias predictor will predict it to be an alias of the recorded placeholder and proactively forward the placeholder's memory region to it, eliminating extra data movements in the future.
A misprediction happens when a tensor is predicted to be a placeholder's alias but is later discovered not to be so.We detect misprediction by checking whether the real inputs have their pointer addresses aligned with the placeholders at runtime (e.g., we check if A's pointer address aligns with ph_A at ❹ in Figure 4, and mark ❷ as mispredicted if A is reassigned and no longer takes on the address 0x7000).In the case of a misprediction, the mispredicted tensor will have to move its memory contents out from the placeholder's memory region.This incurs runtime overheads but still preserves program correctness, as it is possible to use a placeholder's memory region for general purposes but not vice versa.Furthermore, we notice from our evaluation in Section 5 that the misprediction rarely happens thanks to the regular nature of DNN executions.
The alias predictor is enabled by default during graph-based executions and it allows for automatic and transparent removals of data movements into placeholders (i.e., no changes to machine learning applications' source code).Our evaluation in Section 5.3 shows that the alias predictor is able to speed up the input preparation time of graph-based executions by up to 2.07×.

Metadata Compressor
As a prerequisite to implementing the metadata compressor, we first need to find a way to access (i.e., white-box) the CUDA graphs' memory regions as CUDA [72] is NVIDIA's proprietary software and does not give access to the memory that it allocates for CUDA graphs.We accomplish this by using a customized GPU kernel module that is modified from the open-sourced release [75].The kernel module is able to record and replay GPU memory allocations depending on the command that we send to it via the /proc filesystem.It grants us access to private GPU memory allocations that are made by CUDA [72] but not exposed to its users.
Figure 7 illustrates the white-boxing workflow: Before a CUDA graph is created using the cudaGraphInstantiate function call, we issue a record command to the kernel module (❶ in Figure 7) to have it record the memory resource parameter assigned to that CUDA graph (including information such as its size, address, and unique handle, ❷).Immediately after the instantiation call resolves, we issue a replay command to the kernel module (❸) while invoking a cudaMalloc function call from the user space that requests a memory region of the same size as the CUDA graph.In the replay mode, the kernel module bypasses physical memory allocations and directly tapes out the previously recorded memory resource parameter to serve the request (❹).This hence has the data pointer returned by cudaMalloc point to the memory region that was previously allocated to the CUDA graph (❺), allowing us to inspect and manipulate it (i.e., compress and decompress) in the user space.
As we examine the dumped values of CUDA graphs' memory regions, we notice that they are full of zeros and repeated values (see Section 3.2 for the explanation).Based on this observation, we use the run-length encoding algorithm [31] to compress them.To make the algorithm more friendly to GPUs, we develop a page-based variant of it that allows each GPU thread to decompress each page in parallel.Specifically, at compile time, we fetch CUDA graphs' memory regions from the GPU to the CPU and have the CPU split the regions into pages3 , compress those pages sequentially, and send the compressed pages to the GPU upon completion.Each time a page is compressed, we record the current size of the compressed data in a separate array, so that when decompression happens, each thread knows exactly which part of the compressed data corresponds to its own page by reading the array.Our evaluation in Section 5.4 shows that this page-based run-length encoding is both effective and efficient, as it achieves up to 36.43× compression ratio with negligible performance overhead (less than 1% of the time spent on the graph-based executions of DNNs).

Predication Rewriter
We present two important implementation details on the predication contexts, including (1) the source code transformations that we do to make them functional in PyTorch [80], and (2) the UnlikelyPredicate context variant that we support, which provides machine learning practitioners with the flexibility of marking contexts as unlikely for further performance benefits.

Predication-Enabled Framework.
To enable the predication contexts, we need to enhance the PyTorch's [80] C++ runtime.We first do a lookup in the PyTorch's [80] operator pool for all the CUDA __global__ functions.For each function we find, we make a copy of its implementation and append an extra argument predicate to the copy.The predicate governs the copy's entire body (as in Figure 6(b)) and controls whether the kernel is nullified or not.In addition to transforming GPU kernels, we also look for CUDA APIs [72] that implicitly invoke GPU operations under the hood (e.g., cudaMemcpyAsync) and develop their equivalent CUDA kernel implementations (also with a predicate boolean variable appended at the end of their arguments and governing their kernel bodies).Furthermore, we maintain a global variable named gCurrentPredicate in the C++ runtime.The variable points to the current predicate value in effect (evaluated from the inputs to predication contexts) and can be accessed by all GPU kernels.Whenever a kernel in a predication context is invoked, we pass this predicate variable as the last argument.If this predicate variable is evaluated to true, the kernel will behave as the original.Otherwise, the kernel will be nullified.When executing a DNN, we switch between the CUDA kernel implementations that have a predicate and those that do not by checking whether there is a predication context in effect.This ensures that the enhancements that we implement do not affect the performance and the correctness in the case of non-graph-based executions.

Unlikely
Predicates.Although the predicate in Figure 6(b) is able to shorten most of the execution time of the CUDA kernel when it is evaluated to false, executing an empty CUDA kernel still takes time (in the order of several microseconds on a modern RTX 3090 GPU [71]).This small amount of time could nevertheless sum up to be a noticeable runtime overhead when there are many GPU operations within a predication context.To derive further performance benefits, we allow machine learning practitioners to specify a predication context that is unlikely to be executed using an UnlikelyPredicate context, similar to the unlikely attribute in the C++ programming language [19].
Figure 9 compares the compilation workflow of a Predicate context with that of an UnlikelyPredicate context.Unlike a Predicate context that can be compiled in a single run (Figure 9(a)), an UnlikelyPredicate context requires two compilation runs (Figure 9(b)).In the first run, the capture for the parent graph context graph_ctx is turned off and the GPU operations within the child Unlikely context are first captured as a standalone CUDA subgraph (① in Figure 9(b)).The subgraph is configured to be launchable from the GPU side and wrapped in a CUDA kernel (②).In the second run, the capture for the parent graph context is turned on and the wrapper kernel is used in replacement of the body statements of Unlikely (③).
Compared with Predicate contexts, UnlikelyPredicate contexts possess the strength of having constant-time latencies when they are not taken (regardless of how many statements they have).However, they have the weakness of not being nestable (i.e., we cannot nest one UnlikelyPredicate inside another one due to the lack of programming language support in CUDA [72]).They also require domain-specific knowledge inputs from frontend machine learning practitioners to indicate which predicates are unlikely.These shortcomings make them less generic when compared to Predicate contexts.
As an example application of the UnlikelyPredicate contexts, we consider the beam search module [112] of GPT-2 [87].In the current state-of-the-art implementation of beam search [120], the decoding output from the GPT-2 model [87] (which represents the probabilities of the next token on all the possible vocabularies) is first cherry-picked to filter out the top 2×nbeams tokens (where nbeams is the beam width of the beam search algorithm [112]) before given to the beam search module.The reason for this 2× over-provisioning is because it is possible for a filtered token to be the end-of-sequence token <EOS>, which is not selected by the beam search module for generating future tokens.The beam search module iterates through those filtered tokens until it has nbeams tokens that are not <EOS> before giving them to the GPT-2 model [87] for the next decoding iteration.The program that shows the high-level control flow and its semantically equivalent form using Predicate contexts are illustrated respectively in Figure 10 We observe, however, that since the probability of generating an <EOS> token is low (once per roughly one decoding iterations in our evaluation), the first nbeams tokens are usually enough to fulfill the nbeams requirement of the beam search module.Based on this observation, we can split the loop [7,79] into two (before and after nbeams) and mark the second loop as unlikely.This is illustrated in Figure 10(c).
We measure the runtime of the three implementations (the original program, the one using Predicate contexts, and the one using UnlikelyPredicate) on an RTX 3090 GPU [71] (with the latter two compiled using Grape).Figure 10(d) shows the relative speedup over the original program (higher is better).We observe from the figure that while Predicate contexts achieve a speedup of 1.27× over the original program, UnlikelyPredicate contexts further increase this speedup to 1.79× (hence is 1.41× better than pure Predicate contexts), demonstrating their value if domain-specific knowledge can be provided.
To prove that Grape's key ideas apply generically to other hardware platforms, we also evaluate on a GCP a2-highgpu-1g [34] instance that is equipped with 12 Intel Xeon 8273CL virtual CPUs [20] (with 85 GB RAM) and an NVIDIA A100 GPU [70] (Ampere architecture [70] with 40 GB HBM2 [50]), using the same software stack as the main compute platform.
Applications.We evaluate our new compiler, Grape, on three state-of-the-art DNN models from the HuggingFace Transformers repository ver.4.18 [120]: GPT-2 [87], GPT-J [114] (with 6 billion parameters), and Wav2Vec2 [8].We use a maximum sequence length of 1024, which corresponds to the context size that the GPT-2 model [87] is trained on, a batch size of 1, and a beam width of 5 [112] when generating texts with both the GPT-2 [87] and the GPT-J [114] model.Additionally, we use a training batch size of 8 to fine-tune the Wav2Vec2 model [8] on the TIMIT dataset [30] for one epoch [113] (with 391 different input shapes).
Baselines.We compare Grape with two main baselines: (1) the original implementation from PyTorch [80] without CUDA graphs, denoted as Baseline, and (2) the graph-based executions from Py-Torch [80] using CUDA graphs that handle dynamic-shape workloads by capturing one graph for one input shape, denoted as Pt-Graph.All three systems use cuDNN [16] and invoke the exact same set of CUDA kernels under the hood.
Metrics.We show results on (1) the end-to-end latency on the entire model, measured in seconds (lower is better), (2) the runtime of each submodule (InputPrep, Model, and BeamSearch, as in Figure 3) in one decoding iteration, measured in seconds (lower is better), and (3) the GPU memory consumption per CUDA graph created, measured in MB (lower is better).

End-to-End Results
Figure 11 shows the end-to-end latency comparison between Baseline, PtGraph, and Grape on the three machine learning workloads and the two hardware platforms.We are unable to get PtGraph functional on the GPT-J [114] and the Wav2Vec2 [8] model due to the GPU memory consumption challenge (Section 2.2 Challenge #2).We make the following key observations from Figure 11(a): (1) Grape achieves noticeable speedups (up to 2.97×) over Baseline on all the three models evaluated [8,87,114], showing the importance of using graph-based executions to boost the performance of machine learning workloads.
(2) On the GPT-2 [87] model where both PtGraph and Grape can be applied, Grape outperforms PtGraph by 1.26×.This is because Grape's alias predictor saves the extra data movements of copying runtime input values to the graph's placeholders and its predication rewriter allows it to optimize the beam search module, which is not optimizable by PtGraph due to its data-dependent control flows (see Section 5.3 for the breakdown between the two components).
(3) Grape can be applied to workloads such as the GPT-J [114] and Wav2Vec2 [8] model where PtGraph is not practical due to the extremely large GPU memory consumption, achieving a speedup of 1.41× and 1.38× respectively.This is due to its capability of reducing the GPU memory consumption of CUDA graphs by leveraging the high sparsity and value redundancy of their memory regions (see Section 5.4).

Baseline PtGraph Grape
Latency (sec) GPT-2 GPT-J Wav2Vec2 We further observe from Figure 11(b) that Grape works equally well on the A100 [70] hardware platform, achieving even greater speedup (up to 2.99×) compared with Baseline.This not only proves Grape's generality across different hardware platforms, but stresses its importance further as GPUs with stronger compute power are used [10].

Breakdown of Performance Speedup
To better understand the performance speedup of Grape over Pt-Graph, we analyze again the runtime breakdown of the GPT-2's [87] text generation pipeline on the RTX 3090 GPU [71], which is similar to Figure 3 but compares between PtGraph and Grape.Figure 12 shows the runtime breakdown comparison.We observe from the figure that Grape is able to speed up the InputPrep portion by 2.07× and the BeamSearch portion by 1.79×.While the former is contributed by the alias predictor (34% of the 1.26× speedup of Grape over PtGraph in Figure 11(a), the latter is by the predication rewriter (66% of the speedup).This indicates that both components are necessary to maximize the performance of graph-based executions.

Metadata Compression
To better understand how the GPU memory consumption challenge is resolved by Grape, we show the GPU memory consumption per CUDA graph creation in Figure 13 on the RTX 3090 GPU [71].We observe from the figure that Grape is able to achieve significant compression ratios over PtGraph across all the three models evaluated [8,87,114] (up to 36.43×GPU memory reduction).We also study the compilation and the runtime overheads that are caused respectively by compressing and decompressing the CUDA graphs' memory regions.We notice that while the compression overheads are large (constituting 19.8%, 58.5%, and 30.3% of the total compilation time for GPT-2 [87], GPT-J [114], and Wav2Vec2 [8] respectively).They are done ahead-of-time and hence do not affect the model's runtime performance (which is why we implement the compression on the CPU and do not parallelize it for simplicity).We also observe that the decompression runtime overheads are negligible when compared with the models' execution time (i.e., less than 1%), this is because the decompression of the page-based run-length encoding that we adopt is highly parallelizable and friendly to GPUs.
We further repeat the experiments on A100 [70] and observe similar numbers.This is expected since the GPU memory allocations are made by the CUDA software stack [72] and hence are largely independent of the underlying GPUs, indicating Grape brings in practical and efficient graph-based executions for dynamic-shape workloads on different GPUs.

RELATED WORKS
Grape addresses key challenges in making graph-based executions both practical and efficient for dynamic DNNs.As Grape aims at eliminating CPU overheads, it is orthogonal to many system optimization techniques such as reduced precision [66,67,116], quantization [38,44,52,57,103,108,121], and operator fusions [14,21,59,62,98,99,117,128,136].In fact, those optimizations can be used jointly with Grape to derive even greater performance benefits.
DNN Benchmarking and Profiling Analyses.Grape focuses on the efficiency of state-of-the-art dynamic DNNs and uses models that are prevalent in the area of text generation and speech recognition for its evaluation.TBD [137], MLPerf Training [64], MLPerf Inference [91], and DawnBench [18] are benchmark suites that bring together state-of-the-art DNNs from various domains (such as image classification [41], object detection [61], and recommendation [68]).To help machine learning practitioners understand the runtime and GPU memory allocation behaviors of those important DNNs, there are profiling tools that accurately diagnose DNN executions with visualizations, such as DeepView [12], Skyline [126], RL-Scope [32], and Hotline [104].Those useful tools greatly inspire part of the ideas in Grape.
Machine Learning Compilers.Grape is a compiler for machine learning workloads, hence it is related to numerous research works in the area of machine learning compilers, including those that rewrite DNNs at the graph level [23,51,115,123,133], those that allow easy development of low-level tensor programs [9,14,21,26,37,60,101,110,119,122,129,132,134,135,138], those that target efficient code generation for irregular [25,109], sparse [124], dynamic-shape [102,130,139], or recursive [24] workloads, those that propose expressive graph-level representations [77,86,92], and those that cover multi-level optimizations [58,98].These machine learning compilers are orthogonal to Grape.In fact, we notice that some compiler frameworks such as Hidet [21] and TVM [14] already adopt graph-based executions in their codebase.It is hence possible for them to benefit from Grape.
Data-dependent control flows.It is challenging to handle data-dependent control flows on a SIMT hardware platform like GPUs [1].To address this challenge, prior works propose solutions such as warp compaction [27][28][29], multi-path execution [22,65,94], MIMD [53,55,56], and predication [39,42,45,63].Although the idea of predication [39,42,45,63] is similar in spirit to that of Grape's predication contexts, the former is proposed at the architecture level to handle short if-else statements, whereas the latter works at the Python level and can handle common control flow patterns (e.g., if, break, continue) in state-of-the-art DNN workloads.

CONCLUSION
We build Grape, a new graph compiler that enables practical and efficient graph-based executions for dynamic DNNs.On the three state-of-the-art DNNs evaluated, Grape improves performance by up to 1.26× better than prior works on graph-based executions.Moreover, Grape significantly broadens the optimization scope of graph-based executions by compressing their GPU memory consumption by up to 36.43×, enabling it to optimize workloads that are impractical for prior methods and resulting in up to 1.82× better performance.System researchers and machine learning practitioners can all benefit from Grape, as it speeds up training and inference workloads using graph-based executions while allowing easy development at the Python level.We hope that Grape would become a platform for further research on efficient system design for key machine learning applications.

A.6 Evaluation and expected results
The three scripts in the experiment workflow automatically run the experiments that correspond respectively to Figure 11, 12, and 13 in Section 5.After each experiment is completed, a CSV file will be dumped into the experiments folder.Illustrating those CSV files produces Figure 11 to Figure 13.

A.7 Experiment customization
Not applicable.

Figure 4 :
Figure 4: An example that illustrates how the alias predictor eliminates data movements by predicting the placeholders' alias using the information from the Python frontend.The pointer values (0x's) are used to demonstrate the aliasing.

Figure 5 :
Figure 5: (a) The microbenchmark that is used to study the CUDA graphs' memory regions.It captures a CUDA kernel Sample that takes ArraySize integers as an argument.(b) Sparsity of the captured CUDA graph's memory region versus the value of ArraySize (with every byte of arr being 0xFF). if

Figure 6 :
Figure 6: (a) An example predication context.(b) The predication contexts control whether the GPU operations within them are nullified or not by using a predicate.(c) A showcase of how common control flows such as if, break 2 , and continue (from left to right) can be replaced by the predication contexts.

Figure 7 :Figure 8 :
Figure 7: The workflow of white-boxing CUDA graphs' memory regions

Figure 9 :
Figure 9: Comparison between compilation workflow of (a) Predicate and (b) UnlikelyPredicate

Figure 10 :
Figure 10: (a) The original Python program and its semantically equivalent forms that use (b) Predicate contexts and (c) UnlikelyPredicate contexts.(d) Performance comparison between (a-c) on an RTX 3090 GPU.

Figure 13 :
Figure 13: Comparison between PtGraph and Grape on the GPU memory consumption per CUDA graph creation

Table 2 :
with dynamic sequence lengths (∈ [1, 1024)).The compiled GrapeModule can work Overview of Grape's Key Components as a drop-in replacement of the original module and supports both training and inference.