Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay

HPC is a heterogeneous world in which host and device code are interleaved throughout the application. Given the significant performance advantage of accelerators, device code execution time is becoming the new bottleneck. Tuning the accelerated parts is consequently highly desirable but often impractical due to the large overall application runtime which includes unrelated host parts. We propose a Record-Replay (RR) mechanism to facilitate auto-tuning of large (OpenMP) offload applications. RR dissects the application, effectively isolating GPU kernels into independent executables. These comparatively small codelets are amenable to various forms of post-processing, including elaborate auto-tuning. By eliminating the resource requirements and application dependencies, massively parallel and distributed auto-tuning becomes feasible. Utilizing RR, we run scalable Bayesian Optimization to determine optimal kernel launch parameters. LULESH showcases an end-to-end speedup of up to $1.53 \times$, while RR enables $102\times$ faster tuning compared to existing approaches using the entire application.


INTRODUCTION
With the advent of heterogeneous high-performance computing (HPC), both systems and applications have moved the bulk of computation onto GPU accelerators.While scalable tuning large software systems has always been challenging, the heterogeneous nature of applications, and the consequently complicated compilation toolchain, have made it even harder.
Especially large applications need scalable solutions for any non-trivial tuning effort.Given that a full application tuned for a realistic workload exposes various tunable code regions, e.g., GPU kernel invocations, which themselves expose multiple co-dependent tunable parameters, the search space is often enormous.If a single evaluation of an explored configuration is as expensive as a full run of the application, the number of tuning steps is severely limited.Besides significant time commitment, other resources also restrict when and how tuning can be done.Real-world applications often come with complicated build systems that specify many explicit dependencies and require high-end systems to be executable in the first place.Further, partial tuning efforts, e.g., after a new GPU kernel was added or an existing one was substantially modified, could require the same amount of effort as full application tuning, which is wasteful and often impractical.
In this work we present a practical solution to the scalability problem for large application tuning.At the core of our work is a record-and-replay mechanism for (OpenMP-originating 1 ) GPU kernels.After the application was executed once in record-mode we can replay each GPU kernel in isolation.Since the replay has no ties to the original application, original input data files, or build dependences, we can migrate it to a different machine with a compatible GPU architecture 2 .Further, tuning can be done selectively for important kernels only, and it can be parallelized easily as the replay mechanism is completely self-contained.Figure 1: Simplified illustration of a single iteration in a tuning loop.On the left, the full application is tuned together, each iteration therefore executes all kernels, as well as the host code.On the right, the extracted GPU kernels are tuned standalone.The kernels can be tuned in parallel, and selectively, e.g., to ignore small or non-repeating kernels.Verification of the standalone code against the original recorded ground truth is possible if desired.
Figure 1 depicts the primary concept of the proposed approach.Both sides show a single tuning step.Traditionally [30,37], as shown in the left hand side, the entire application is executed in every optimization step regardless of which kernels will be optimized.With our work we can transition to the right hand side, where selected kernels are independent and thus can be executed in parallel to drastically reduce tuning time and effort.

Contributions
In this paper we make the following contributions: (1) A novel record-and-replay capability for GPU kernels invoked via the LLVM/OpenMP offloading runtimes.It offers fine-grained control over the recording and allows to replay executions after kernel launch parameter or even the kernel code itself was modified, e.g., for tuning or debugging.(2) An implementation of the record-and-replay mechanism that is in large parts already available in community LLVM (starting with version 16) with other parts under review.(3) A case study showing the utility of record-and-replay to facilitate scalable Bayesian tuning for GPU kernels.Our scripts are public to simplify utilization of our work and reproducibility.(4) An OpenMP language extension implemented in LLVM/ Clang to allow fine-grained tuning of GPU kernel parameters otherwise only accessible via kernel languages such as CUDA.(5) A thorough evaluation of kernel launch parameter influence on OpenMC [29,36], a full scale scientific application, and two HPC proxy applications.Our results show the need for tuning to optimally utilize AMD and NVIDIA GPUs while they also clearly highlight the need for scalable solutions.

Limitations
As a research prototype, our work has the following limitations: (1) We inherit the inability to target Intel GPUs from LLVM's OpenMP implementation though we expect our approach to work for any target supported by LLVM's offloading runtimes.(2) We use a simple memory allocation scheme, described in Section 3, which is sufficient for common GPU programs, like our benchmarks, but over-provisions the data required to be saved on a checkpoint file resulting in larger files.(3) Our verification scheme applied after a kernel replay is based on bit-wise equivalent memory output.This is not a conceptual limitation but we currently do not allow user provided verification functions.Consequently, non-deterministic kernels and kernels run in a multi-threaded environment might fail to pass verification even though the result is acceptable.
The remainder of this paper is organized as follows.We provide necessary background information in Section 2. The record-andreplay functionality is described in Section 3, followed by the scalable Bayesian optimization use case in Section 4. To evaluate our implementation we provide a detailed look at the tuning results for the full OpenMC application and two HPC proxy applications in Section 5. Before we conclude in Section 7 we compare our work with existing approaches in Section 6.

BACKGROUND
Throughout the paper we will use OpenMP-centric terminology and clarify, when appropriate, how the concepts map to hardware or other GPU languages.

OpenMP Offload Programming Model
OpenMP-offload uses pragma-directives to offload annotated code regions to accelerators.The target directive executes the enclosed code region on a device.To exploit the massive parallelism of GPUs, developers prescribe, through additional directives, the available algorithmic parallelism.For example, the teams directive designates the annotated region for execution by one (initial) thread per independent team, which map naturally to GPU thread blocks or workgroups.To further exploit all GPU threads developers can compose the teams directive with the parallel directive which specifies that the associated code region should be executed by all threads in each team concurrently.Since parallel in GPU offloaded regions is commonly mapped to all threads in a team (aka.thread block or workgroup), the combination of teams and parallel effectively allows to utilize the entire GPU device the same way a CUDA or HIP kernel launch does.The developer can explicitly control the number of teams and the maximum number of threads within a team by using the clauses num_teams and thread_limit, respectively.While the separation of parallel levels provides flexibility to the user, it can cause performance degradation [14].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Efficient Device Utilization
Summarized, GPU kernels should present sufficient parallel work items to oversubscribe the GPU hardware and consequently allow latency hiding by switching between work that is stalled and work that is ready to compute.
The number of registers used by a kernel can significantly impact performance as each hardware compute unit (CU), aka.streaming multiprocessor (SM), has a fixed number of registers, shared by all resident teams.Decreasing the number of registers used by a kernel effectively increases the number of resident threads and allows the scheduler to hide latencies, often resulting in performance benefit.However, enforcing a limit on the number of registers being used may cause the register allocator to spill temporal values in device memory and slow down application execution.
Developers can use vendor specific annotations -called launch bounds -to provide hints and constraints to the compiler regarding the maximum number of threads of each team and the minimum number of teams that should reside concurrently in a CU.Decreasing the maximum thread count effectively allows the compiler to allocate more registers per thread, possibly avoiding register spilling at the cost of reduced parallelism.Whereas, increasing the minimum number teams lowers the number of available registers per team and thread but may allow for more latency hiding.Expert developers usually perform manual tuning to identify values that achieve better performance than the generic default bounds.

Bayesian Optimization
Bayesian Optimization (BO) is a machine learning based optimization algorithm used to find the parameters that globally optimize a given black-box function.BO uses two important components within the algorithm.
(1) The objective function to optimize  () for which BO The optimization itself is an iterative process presented in Algorithm 2. The definition space of the objective function  , is denoted as S and represents a bounded hyper-rectangle of  dimensions ({S ∈ R  :   ≤   ≤   }).BO randomly selects  samples from S, denoted  0 , as an initial population (Line 2) and evaluates  ( ) (Line 3).Given these data samples, BO fits a model M(,  ) (typically a Gaussian Process model) with  =  and  =  () that will serve as the surrogate model of the objective function (Line 4).Once the initial model is created, in each BO iteration the acquisition function suggests a new sampling point   (Line 6).BO evaluates the objective function on the suggested point (Line 7), inserts the new evaluated pair to the sampled space (Line 8) and updates the model (Line 9).The iterations continue until a user-defined upper bound of iterations is met.
BO is useful in situations when the objective function (i) is a black box for which no closed form is known (nor its gradients), (ii) is expensive to evaluate, and (iii) evaluations of the objective function may be noisy.These characteristics make BO an ideal search algorithm for auto-tuning [5,19,30].Typically, approaches map auto-tuning to BO by setting as the objective function the speedup of the auto-tuned kernel over a vector of tunable parameters (e.g.[30] uses as tunable parameters, among others, the uncore frequency, the number of threads and enabling or disabling hyperthreading).The acquisition function selects new candidate tunable parameters that are likely to maximize speedup.BO evaluates those candidates and builds a surrogate model of the speedup function in respect to the selected tunable parameters.

RECORD-AND-REPLAY
At the core of our work is a configurable record-and-replay mechanism integrated in the LLVM/OpenMP offloading runtime.Through recording of all relevant state as well as the GPU kernel code, combined with careful setup of the replay execution environment, we can isolate GPU kernels from the rest of the application and make them standalone executables with the same input as used by the application.Record-and-replay enables a variety of use cases, including: debugging, parameter tuning, and exploration of code modifications.We will walk through the recording and replay steps as depicted in Figure 3 and Figure 5.

Recording
The recording is done as part of the regular application execution with a regular LLVM/OpenMP offloading library, version 16.0.0or newer.No compilation changes are necessary and any kernel executed via the OpenMP offloading mechanism can be recorded.Thus, CUDA (or HIP) kernels executed via a wrapper layer on-top of LLVM/OpenMP are eligible as well [8].Recording is enabled by setting the environment variable LIBOMPTARGET_RECORDING to true.In addition to the filtering environment variables, explained in Section 3.1.1,users can determine if the memory state after a kernel execution should be recorded as well in order to allow verification as part of the replay.We also provide the LIBOMPTARGET_RR_DEVMEM_SIZE environment variable to control the size of the GPU memory preallocation, as described in Section 3.3.
In Figure 3  . . . the recording is initialized if the user enabled it via the environment variable.In this step, denoted as 1 in Figure 3, we pre-allocate GPU memory which is later used to serve memory allocation requests by the application and the offloading runtime.This custom memory pool and allocation scheme allows us to easily determine the applications memory usage, and identify all allocated memory.It uses a simple bump allocator and tracks all memory allocations happened until the current execution time.Further, it ensures the memory is allocated (mostly) contiguous in the virtual address space, which allows for efficient device-to-host copies.Thus, when we want to save the entire device memory state, as required in step 2 and 3 , we can simply copy out and store the entire user allocated part of our custom memory pool.For now, we effectively provide a bump (pointer) allocator which is sufficient for our benchmarks even if we may waste (and therefore record and store to the checkpoint file) some unnecessary/deallocated memory regions.While our prototype does effectively ignore deallocations, there is no reason we could not add support in the future, e.g., via free lists, without breaking any assumption of our scheme.
During the application execution the recording will be invoked as part of a kernel launch.For now, we assume that recording filters described in Section 3.1.1did not prevent recording.In this case, recording step 2 is performed prior to the kernel launch, and step 3 is optionally performed after the kernel execution if the environment variable LIBOMPTARGET_RR_SAVE_OUTPUT is set to true.In the former step we save the device (memory) state together with the kernel launch arguments and launch parameters.Thus, the values passed to the kernel from the host, the values passed to the driver that determine the launch configuration, and the entire allocated GPU memory is stored to disk for use by the replay tool.In addition, we record the relevant parts of the device image as well.Especially global variables need to be read and saved at this point as they are not allocated via our memory pool but could have updated values since we loaded the image onto the device.The kernel code, in binary form or in LLVM-IR if recording is composed with the JIT capability [35], is saved either at this point or eagerly when the record capability was initialized (step 1 ).
To ease tooling and allow manual inspection and modifications we create one JSON file per recorded kernel (launch).It contains all necessary metadata to replay the kernel in isolation.This includes user provided information, e.g., the number of teams required, as well as runtime generated information, e.g., the placement of kernel arguments in device memory.Further, auxiliary files, such as the binary files containing the memory dump and the kernel code, are referenced from here.All entries in the recording JSON file are described in Figure 4.It is worth noting that recording has to be rigorously synchronized with regards to the execution of the kernel, associated data movement, as triggered by the map clause in OpenMP, and OpenMP dependence resolution that might interact with other threads.While a multi-threaded environment might result in read-write races during the recording, we could prevent these if the hardware does not silently resolve them, e.g., by providing unspecified values for the concurrently written data regions.That said, the hardware resolution is generally sufficient for recording and replay since the recorded kernel could not have accessed the data without triggering a race, and therefore undefined behavior, itself.However, recording unspecified values will impact our replay validation scheme, as discussed further in Section 3.2.1.

Recording Filters.
Since full applications have many kernels, and kernels are executed many times, we provide control over the recording via filters.By default, we will retain one recording for each kernel that was invoked at least once during the run.Subsequent invocations will overwrite the existing recording if the number of teams (aka.thread blocks or work groups) is larger.This heuristic tries to save "the largest" invocation of each kernel, which is especially useful for tuning purposes.However, users can use environment variables to change this default behavior in various ways.First, the value of the LIBOMPTARGET_RR_RECORD_FILTER environment variable is interpreted as a regular expression and only if it matches the kernel name, recording is considered.In a second step we determine the instance recording policy defined Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
via LIBOMPTARGET_RR_INSTANCE_POLICY.The user can provide a fixed limit to the invocations that should be recorded, or alternatively pick one of the predefined, parameterized policies.As an example, LIBOMPTARGET_RR_INSTANCE_POLICY=5-10 will record invocations five to ten of each kernel that matches the record filter.The set of initial policies includes the max/min policy, which will record the invocation with the maximum/minimum value for a launch argument, e.g., LIBOMPTARGET_RR_INSTANCE_POLICY=max(threads).
For Figure 3 we assumed the user set the instance record policy LIBOMPTARGET_RR_INSTANCE_POLICY=1, which has an implicit upper bound equal to the lower bound.Since only the first instance of each kernel is to be recorded, the filter check in step 4 prevents the recording of the second invocation of kernel A.

Replay
To replay previously recorded kernel invocations we provide a driver tool.It is available as llvm-omp-kernel-replay in LLVM 16.0.0and newer that are build with OpenMP offloading support.The illustration in Figure 5 walks through the different replay stages.The main input of the driver 1 is the JSON file generated as part of the recording.Via command line options the user can change the target device, among all compatible ones, and request the final memory to be saved or compared against the recorded one.The number of threads and teams, hence the grid dimensions, can also be explicitly passed via command line options.If the values contradict the ones seen during recording, a warning is emitted.This ensures users are aware that the compile-time constant values might have been embedded into the program, which generally makes changes after the fact invalid.After the metadata in the JSON (ref. Figure 4), is read 2 , the LLVM/OpenMP offloading runtime is initialized 3 .During this step the memory pool is allocated, as described in Section 3.3.The replay driver then loads the state onto the chosen device 4 .Before the kernel is launched, the driver initializes all global variables and loads the device image.If the recording was performed in conjunction with JIT compilation, the recorded image is an LLVM-IR file rather than a binary for the target device, e.g., a cubin file for NVIDIA, which needs to be JITed for the device first 5 .Users can generally modify the LLVM-IR file as long as the externally visible interface, hence the name and types of external global symbols, is not disturbed.The JIT will run in replay mode as it would in a regular compilation, optimizing and transforming the image to the target binary format before it is loaded.

Replay Validation.
If the user stored the memory after a kernel launch as part of the recording, the replay tool can validate the replay run.For now, we only support bit-wise comparison of the memory after the recorded run with the memory after the replay run.While this often suffices to validate kernels, it can signal false negatives if the kernel is by nature non-deterministic or the application was interacting with the device concurrently from different threads.The final memory content can also be stored permanently in a file to allow more elaborate inspection.

Consistent Memory Placement
Memory placement is one of the most crucial parts of the recordand-replay infrastructure.It is important because we can support arbitrary (pointer) indirection without any analysis or effort if we can ensure that all device memory is placed at exactly the same virtual address it was during recording.Alternatives would require us to patch pointer addresses or drop support for indirection.While the former is in general impossible, since we cannot generally decide if a memory element stores data or a pointer, the latter would severely limit the usefulness.To ensure the correct memory placement we perform target-specific memory allocation.The NVIDIA driver GPU API provides a Virtual Address (VA) management API [23].The API allows users to map device addresses on user-provided virtual addresses.Our bump allocator uses this API to allocate memory pools in the same virtual address between record and replay.We work on integrating this allocation scheme into community LLVM, interested users can find it online [10].
The Heterogeneous System Architecture (HSA) AMD driver API -used by OpenMP -does not offer equivalent API functionality.We employ a rigorous heuristic to support consistent memory placement across individual record and replay runs.HPC systems, disable the Operating System's (OS) page randomization module to maximize performance and enforce a deterministic page allocation scheme.We take advantage of this determinism.Once we bring the page allocator on both record and replay runs on the same state the following allocations will be the same.
In the record run, just before loading the device executable to the device, we force the OS page allocator to a known state.Effectively, we read the process memory state from /usr/proc/self/maps and remove any existing memory fragmentation within the VA of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
application by explicitly mapping pages in the fragmented ranges.Finally, we map a large padding block of P GB in the next available virtual address.At the end of this process the state of the allocator is a contiguous, memory-mapped region that starts at a low address (LowVAStart) till the high address (HighVAStart), with a padding of VAPad GBs.We store this information in the JSON description file (Figure 4) and continue execution.
For AMD GPUs we allocate the entire GPU memory for the memory pool as illustrated in Figure 6.The effective start address of the pool is then adjusted to the same value during recording and replay which ensures the entire memory is located in the same virtual address range.In detail, we start with an allocation request of N GB which should be larger than the device memory.We then decrease the allocation size by Step until it succeeds.Assuming the device provided us with the entire memory, the allocation start is now limited to an interval of size Step.By aligning the memory pool start to 2xStep we can consequently ensure a consistent address.Crucially, this allocation scheme requires the device memory to be mapped consistently to the same virtual memory space.
During replay, we follow a similar process.Briefly, we detect the current process memory state and we initialize the memory state of the record run as defined by LowVAStart, HighVAStart and, VAPad.Using this information we memory map anonymous pages until the memory range from LowVAStart until HighVAStart + VAPad is filled.This heuristic, forces the host memory allocator to the same state observed during recording and after which we reserve memory for the bump allocator as previously discussed.

USE CASE: SCALABLE TUNING
The record-replay infrastructure facilitates an ideal tool for piecewise auto-tuning of GPU kernels.Typically, auto-tuning involves searching for a set of parameters to optimize some quantity of interest, for example execution time or power consumption.Potential parameter values create a combinatorial space with an intractable number of possible configurations, effectively making it infeasible to search all possible combinations.
We design a Bayesian Optimization framework that exploits record-replay to optimize kernel execution configuration.The usecase optimizes both compile-time (launch bounds) and runtime parameters (number of threads and teams, aka.grid size), thus exemplifies many capabilities of record-replay.

Bayesian Optimization auto-tuning of kernel execution configurations
We define the search space S before describing our search algorithm.The search space is a hyper-rectangle of the kernel launch bound configurations (maximum number of threads  ℎ and minimum number of teams    ) and the runtime kernel configuration parameters (number of threads ( ℎ ) and number of teams (  )).Launch bounds impose constraints to the possible values of runtime parameters.For example, if  ℎ = 256 for the maximum number of threads in launch bounds, then the number of threads at runtime must use less or equal than 256 threads ( ℎ ≤ 256).
To successfully describe our parameter space for BO, we model every parameter as a real number ranging from 0 to 1 representing the fraction of this parameter over the maximum allowed value.We introduce a transformation function    that takes as input parameters in the fractional representation of the kernel execution parameters and transforms them to valid integer configuration parameters by mapping them to the valid [, ] range, using ceiling and flooring to convert them to integer values.
As the objective function for BO to maximize, we use the speedup of the configured kernel versus the original device execution time of the kernel.Algorithm 7 summarizes the BO search algorithm enhanced with replaying capabilities.The  function accepts as inputs the number of random samples ( ) to sample and the number optimization steps to perform ( ), the kernel description  -created by the recorded run -the kernels' trip-count (), and the device execution time ().Lines 7, 11 sample a point in the fractional domain using a random sampler or the acquisition function.Lines 3, 12 invoke the DoReplay function with the sampled point.The DoReplay function converts the sampled point in Line 2 in a set of discrete values  using    function.Line 3 invokes the proposed replay tool, attaching the requested launch bounds to the respective vendor kernel by using IR-attributes on AMD and IR metadata on NVIDIA.
In turn, OpenMP offload uses JIT to lower the IR to the device image and the replay tool launches the configured kernel with the requested number of threads and teams.Invoking  at line 3 returns the execution time of the kernel.Line 4 computes the speedup of the current configuration and lines 8, 13 concatenate previous samples with the latest one.Notably, BO applies to the continuous space defined by the fractional values, hence the sampling space (  ) is agnostic to the transformation.Finally, in line 14 the algorithm updates the model and moves to the next iteration.After completing a user-defined number of iterations, the algorithm returns the maximum speedup observed during the search.instructions for building and executing the application so it can build, execute, and record all requested kernel.After recording, we utilize the cluster job scheduler Flux [2] to create independent workers that auto-tune each kernel.The workers copy all required files into the fastest file system, e.g., under memory mapped /dev/shm, to reduce loading times during tuning.
To counteract noise in the measurements of a BO sampled point, the worker invokes the replay tool multiple times.If the JIT is used, the worker instructs the OpenMP runtime to save the compiled binary file to disk so that subsequent invocations of the same sample can avoid JIT overheads.The worker collects execution time of each experiment through vendor-provided tools (rocprof and nvprof) and computes the respective speedups in comparison to the baseline.Once all BO steps finish, the worker exits signaling the end of the kernel's optimization.

OpenMP Kernel Tuning Extensions
OpenMP provides the means to set the number of teams (aka.thread blocks) with the num_teams(<N>) clause on a target construct.Similarly, the thread_limit(<N>) clause provides an upper bound on the number of threads per block.In the unreleased OpenMP 6.0 version, the user will be able to provide a lower bound for the thread limit as well.However, kernel languages, such as CUDA and HIP, provide more kernel tuning control.While they allow for (or require) a strict thread count and team count during the kernel launch, they also provide ways to give resource allocation hints.In CUDA, this can be done via the launch bounds, e.g., __launch_bounds__(256, 6) , as well as ptxas arguments, e.g., maxregcount=<N>.In HIP, the kernel can be annotated with attributes, including, amdgpu_waves_per_eu(< min>[, <max>]) and amdgpu_flat_work_group_size(<min>, <max>).
For this work, we extended the OpenMP target region handling in LLVM/Clang to catch up with kernel languages.Our experimental clause, ompx_attribute(<attr>[, <attr>]*), is available in community LLVM/Clang but not in the 16.x.xreleases.We opted to provide a generic interface similar to C/C++ attributes as the different (GPU) targets expose similar but not equivalent ways to provide information.While the OpenMP standards committee might find a way to unify the interface such that all target information can be computed from a small standardized set of user inputs, our implementation can serve as a comparison implementation that accepts the "native" kernel language information instead.As an example, we can limit the number of registers by specifying a maximum number of threads and minimum number of teams we want to be resident on the GPU concurrently, as shown in Figure 8.
LLVM/Clang will warn on unknown attributes and ignore all known ones that do not apply to the target architecture.

EVALUATION
We evaluate the proposed record-replay-based Bayesian Optimization (BO) on one mini-app (miniFE), a proxy-app (LULESH [15]), and a full application (OpenMC [29]) and on two systems.The first system consists of 2 Power9 CPUs with 22 cores each, 256 GB main memory, and 4× NVIDIA Tesla V100 GPUs equipped with 16GB device memory.The other has an AMD EPYC AMD Epyc 7A53 CPU with 64 cores, 512 main memory, and 4× AMD Instinct MI250x GPUs each with 128 GB device memory.Figure 9 lists the inputs used for each application and the time spent to build and execute it, with and without recording enabled.In miniFE and LULESH we record all GPU kernels.In OpenMC we utilize the filtering mechanism to only record kernels with multiple teams, effectively eliminating trivial initialization kernels.Moreover, we only store the kernel invocation with maximum parallelism, i.e., LIBOMPTARGET_RR_INSTANCE_POLICY=max(threads*teams).In total, we evaluate 33 individual kernels on both systems.

Evaluation Methodology and Scenarios
To better understand the performance benefits introduced by finetuning launch bounds and execution configuration we introduce four optimization scenarios and respective search spaces: MaxThreads (EM).In this scenario we only tune the launch bound restricting the maximum number of threads to be used at runtime.We will always launch the kernel with this maximal number, though it is passed as runtime parameter and cannot be folded into the application code.We explore only values that are multiple to the warp / wave size of the respective architecture, i.e., 32 for NVIDIA and 64 for the AMD GPU.Since 1024 is the hardware defined upper bound, the search space is tractable.We employ record-replay and perform an exhaustive search over all possible configurations.Thus, for every recorded kernel we perform 32 individual experiments for NVIDIA and 16 experiments for AMD.
MaxThreads, MinTeams (EMM).In this scenario we tune both launch bounds parameters.As before, kernels are launched with the maximum number of threads defined in the launch bounds and those values are again multiples of the warp / wave size.The set of meaningful values for the minimum number of teams is impacted by the maximum number of threads but never larger than 32 for NVIDIA and 10 for AMD per SM.Given that the combinatorial space of these two parameters is again tractable, we run an exhaustive search for optimality.This scenario requires 216 and 103 experiments for NVIDIA and AMD, respectively.NumTeams, NumThreads (BONN).In this scenario we employ BO to tune only parameters that can be adjusted at runtime without requiring re-compilation (via JIT) of the kernel.The total number of experiments required to explore this space can be huge and is kernel dependent as it depends on the trip count of the worksharing loop associated with the kernel (LoopTripCount in Figure 4).For example, in miniFE the dot product kernel contains a worksharing loop with 8120601 (≈ 2 23 ) iterations.An exhaustive search would require 1029902 (≈ 2 20 ) experiments for NVIDIA.MaxThreads, MinTeams, NumTeams, NumThreads (BOMN).In this scenario we explore both the launch parameters and the execution time parameters.As the search space is even larger than BONN, we again employ BO to identify configurations that maximize kernel speedup.
For scenarios using BO, we use the Upper Confidence Bound (UCB) as the acquisition function and perform 30 random sampling iterations, followed by 200 BO iterations.We explore all four scenarios for the 33 kernels and report the speedup obtained by each scenario on the respective system.Each evaluation point uses the average execution time observed over 5 individual runs.Across all applications our evaluation totals 23364 configurations for the NVIDIA system and 19107 for the AMD system.

Auto-Tuning Performance Benefits
Figure 10 illustrates the maximum per-kernel speedup we observe for all applications and systems.For each kernel we use a differently colored marker and for each scenario we use a different shape.The legend is shown in the middle plot, Figure 10b.
BO configurations most often result in the largest observed performance benefits, as more diamond (BONN) and box (BOMN) markers are present on the outer areas of the scatter plots.Small differences between BONN and BOMN cannot be attributed to one being better than the other as the initial random steps performed by BO may shift the search algorithms to similar speedups.On the other hand, limiting auto-tuning to launch bounds, misses opportunities to tune the runtime parameters, i.e., the grid size, hence those scenarios often have lower performance.
Speedup results are not analogous between architectures.For example, the sparse matrix vector multiplication sparsemv in miniFE exhibits speedup of 7.2× tuned by BOMN on NVIDIA and is consequently located on the far right of Figure 10b.The same kernel achieves on AMD only a modest speedup of 1.2× and is thereby only the third most tunable kernel for this architecture.We used nvprof to identify the root cause of higher performance on NVIDIA.Although the kernel is significantly faster, the occupancy of the tuned version drops from 74% to 15%, which confirms that occupancy alone can be misleading [38].More importantly, in the tuned version, the L1 cache hit rate increased from 15% to 79%, hence reducing device memory requests to lead to an increase in memory throughput for loads and stores from 145 GB/s to 1050 GB/s and from 2 GB/s to 15 GB/s respectively.

5.2.1
End-to-end performance benefits.We use the per-kernel configuration options that result in the highest speedup and embed them into the full application.Figure 11 visualizes the end-to-end time of the optimized and the unmodified application, breaking down execution time to host compute, device compute, and memory transfer times.In all applications, speedup from tuning follows Amdahl's law, since tuning optimizes device computation, but not memory transfers which are part of the execution time.Labels show end-to-end speedup (in black) or the speedup of device compute (in green), which is the direct target of tuning.Regardless, tuned parameters embedded in the application translate to performance benefits for every application and on all platforms.
For device execution alone, the largest speedup is found for miniFE on NVIDIA.The observed 5.75× device compute speedup is consistent with results in isolation.Specifically, 82% of the time of the conjugate gradient computation on the GPU is in the sparsemv kernel, which has 7.2× speedup, while the remaining 18% of time is spent in the dot and waxpby kernels, which have lower speedup.
In LULESH we observe large gains even when comparing endto-end execution time.Given per kernel speedups, one would expect larger gains on the NVIDIA system but instead the optimized LULESH execution time on AMD is 1.53× faster while it improves by a factor of 1.32× on NVIDIA.The kernel with the highest speedup on NVIDIA takes only 2% of the entire device execution time.In contrast, the kernel with the highest speedup (4.32×) on AMD takes 27% of the total device execution time.Overall though, auto-tuning all kernels in LULESH results in significant gains.
The optimized version of OpenMC presents the least gains on both systems, which is due to the recording restriction of selecting only the kernel invocation with the largest grid size.However, hotspot kernels are executed often and with varying trip counts for the associated worksharing loops.Even though there is large speedup in recorded kernel invocations, the speedup does not equally manifest across all other invocations.That said, our restrictive setup still generated an overall speedup of 1.08× in NVIDIA and a 1.03× in AMD for the entire application.
We believe these end-to-end performance benefits justify an OpenMP extension, e.g., the one proposed in Section 4.2, as otherwise users cannot easily embed tuned launch bounds into OpenMP applications.

Record-Replay Auto-Tuning Efficiency
The proposed record-replay mechanism effectively reduces the amount of work required to auto-tune an application, as it dissects GPU kernels into independent executables.While the device compute time is dependent on the original application and input, the replay tool itself adds overheads while setting up the environment.To quantify those overheads we measure the replay steps -discussed in section 3.2 -and visualize them in Figure 12.As the times are kernel dependent we show the overall fastest and slowest results.In all applications, a large portion of time is spent Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.reading the memory file because it contains all data used by the application, regardless of the kernel usage.Smarter checkpointing could certainly improve these results.The JIT optimization and device binary generation is another source of significant overhead.Compile time improvements and pre-optimization of the kernel code could reduce these in the future.
Despite these inefficiencies in our prototype, there is still a large benefit when comparing the runtime of the most expensive replay with a run of the entire application.For example, the slowest kernel replay for LULESH on AMD requires 6.1 seconds to run, whereas the respective application run needs 173 seconds.One auto-tuning step for this kernel alone is consequently 28× faster with replay.
Furthermore, in Figure 13 we visualize the probability of a random kernel to converge to the best observed solution after a given number of tuning steps (x-axis).For example, Figure 13b shows that during the first 48 iterations of the BONN scenario none of the kernels have identified their respective best parameter selection.However, after 50 iterations of the same scenario, 25% of the kernels have found their optimal configuration.We configured our BO to perform 30 random, initial samples for bootstrapping the Gaussian Process model before sampling parameter points selected by the acquisition function.Consequently, the likelihood of finding optimal configurations in the first 30 iterations is expected to be low.That said, we can see that none of the lines plateau early, indicating that later iterations contribute reasonably to the overall performance benefits.

End-to-end optimization time.
To showcase the improved scaling potential of the record-replay-based tuning we analytically compute the time required to tune the applications with and without it.For the latter case, we adopt an Oracle scenario in which a single BO iteration simultaneously tunes all kernels within the application.The amount of time required by the Oracle is described by Equation (1).
corresponds to the number of iterations we search including initial random samples. is the build time of the respective application for BOMN and 0 for BONN as all tuned parameters can be configured without modifying the code. represents the number of experiments required to obtain a statistically stable measurement and   is equal to the execution time of the optimized application.Thus, we conservatively assume the Oracle search algorithm identifies the optimal solution in the first iteration and all further executions have the optimal time as a timeout.
Figure 14 compares the Oracle against record-replay results assuming parallel evaluation of all kernels.To estimate best and worst case scenarios for record-replay we again use Equation (1).  is set to the fastest and slowest observed replay time (ref.Figure 12) and   equals the time required to JIT the image.
Record-replay benefits range from 5.5× to 102× across all benchmarks and scenarios.OpenMC, the largest application, shows the least benefits because the replay time is dominated by the time it takes to read the memory file and JIT the kernel.As both overheads are expected to be lower in the future, we can expect larger benefits over time.Even today, record-replay allows to perform tuning with much less resources and decoupled from the application dependences.On the other side of the spectrum, LULESH shows the largest gains.The application is long running and the replay only needs to load a small checkpoint file while the JIT time is less than a second on NVIDIA and a couple of seconds on AMD.

RELATED WORK
Record-and-replay.Record-and-replay techniques provide the ability to record executions of applications and re-execute them identically.These techniques find use in the contexts of debugging, reproducibility, and fault-tolerance, especially in the presence of nondeterministic factors.In HPC, a number of techniques have been developed to record and replay MPI applications [21,43,45].Recently ReMPI [31] proposed a record-and-replay for MPI+OpenMP non-deterministic applications.Replaying the traces recorded by ReMPI have been used in debugging large-scale programs that exhibit non-deterministic behavior.Simillar functionality is provided by the commercial TotalView debugger [24] which saves state information as the program executes in record mode and when in the replay mode, one can move to any previously executed statement, at which point it displays its saved state information.These works use similar techniques as this work, however we focus on a portable mechanism to auto-tune large GPU application codes in isolation using record-replay mechanisms.Figure 13: BO search: Probability of best found at iteration across kernels on both systems CERE [7,25,44] extracts application hotspots as isolated fragments of code, called codelets.Codelets can be modified, compiled, run, and measured independently from the original application.CERE isolates codes at the compiler IR, and can be applied to C, C++, and Fortran applications.CERE uses the OS memory page protection mechanism to track the application memory state, such mechanisms are not applicable in current GPU architectures.Moreover, our approach does not require the IR of the application and can record binary images as well, Thus our recording works on existing applications without the need to recompile them.
Tuning GPU codes.GPU auto-tuning is a widely-used empirical approach to discover optimal values for GPU executions.Grauer-Gray et al. [12] apply optimizations to GPU code using a high-level directive-based language and source-to-source compiler to generate CUDA and OpenCL code.Their auto-tunes highe level languages by focusing on loop permutations, unrolling and tiling.There is a large corpus of work providing auto-tuning frameworks focusing on different performance bottlenecks, languages and application characteristics.MATOG [39] abstracts array memory accesses in CUDA kernels and optimizes the code according to the used GPUs.Some auto-tuning frameworks, such as Kernel Tuner [37], Open-Tuner [3], ActiveHarmony [33], Kernel Launcher [13], GEIST [34] provide a general-purpose interfaces for building customized, possibly multi-objective program auto-tuners, deploying a number of different searching techniques.CLTune [22] is a domain-specific auto-tuner for OpenCL kernels, supporting user-defined search space of a number of compile time and runtime parameters.Those frameworks are unable to analyze large applications due to their execution time and benefit from our record-and-replay functionality for piece-wise optimization.Figure 14: Time required to optimize a kernel using the whole application (Oracle) and our approach (RR).We consider two cases.
In the first one, all experiments but the last are performed with the slowest replay time we observed during our experimentation (RR slowest), whereas the RR fastest considers the fastest experiment we observed during our experimentation.
A number of papers research domain-specific tuning using code generation, alternate data layouts, or algorithmic parameters, such as [1,9,11,16,20] for linear algebra kernels and [18,27,28,46] for stencils.Our work tunes GPU kernel execution and compilation parameters that are generally applicable to any kernel.
Apollo [4] proposes to auto-tune highly dynamic RAJA applications that uses pre-trained, reusable models to tune input-dependent code at runtime.Artemis [42], an extension of Apollo, enables runtime tuning using online data collection and model training.The above auto-tuning approaches focus on tuning the entire applications, whereas our method can tune specific kernels in isolation.
Tuning Kernel Launch Parameters.Authors in [6] use machine learning to automatically select profitable block sizes; A multiobjective deep neural network was used in [26] to learn a function that maps input graph characteristics and runtime program behavior to a set of launch bound parameters.Snowpack [32] uses static features in a machine learning framework to choose the optimal block size parameter.It does this without needing to execute the kernel multiple times, which is needed in auto-tuning approaches.Our work can accelerate the data gathering steps required to train these models by recording application kernels and replaying them with different configurations.
Bayesian Optimization Tuning.Authors in [41] apply Bayesian Optimization (BO) to finding optimal parameter configurations for tunable GPU kernels.While the application of BO to this problem is challenging, the paper demonstrates how to deal with rough, discrete, constrained search spaces, containing invalid configurations.The authors introduce ideas such a variance exploration factor, and new acquisition functions to improve scalability.Previous work has also been used to optimize parameters in HPC applications [19] (not necessarily GPU kenel parameters); this approach uses a BObased configuration selection framework to identify application and platform-level parameters that result in high performing configurations.GPTune [17] proposes extending BO with multitask learning to speed up searching through parallelization.Prior work [40] proposes special customizations in BO search strategies, specifically designed for tuning GPU kernels for faster finding of better optimizing parameters.Those methods can complement our approach by additionally speeding up searching for optimizing parameters on top of our methods for selective recording and replaying.

CONCLUSIONS AND FUTURE WORK
This paper presents a first-of-its-kind Record-Replay (RR) mechanism for OpenMP offload kernels.At recording, RR extracts kernel code from applications and memory checkpoints to capture the execution state.Extracted kernels can by replayed to study their performance in isolation.We present a use-case of piecewise autotuning kernel execution using Bayesian optimization to search the combined parameter space of runtime kernel execution parameters, as in number of teams and threads, and compile time parameters, as in launch bounds, that affect code generation.RR made possible to explore this large space using HPC proxies and a full-scale scientific application on two GPU platforms, NVIDIA V100 and AMD MI250x, by accelerating execution of the Bayesian search up to 102× compared to executing the whole application.Results reveal that tuning those parameters results in up to 1.53× end-to-end speedup on the LULESH proxy application, while accelerating by several factors numerous, individual kernels across all applications and GPU platforms through auto-tuning.
For future work, we plan to investigate more use-cases facilitated by RR, such as tuning the compiler optimization pipeline, automatic benchmark generation, and automated testing and debugging.

ARTIFACT IDENTIFICATION
The main contributions of this article are: • A method for performing recording and replaying of individual OpenMP offload kernels, • The implementation of the method in LLVM • A use-case of recording and replaying to significantly accelerate auto-tuning the number of threads, teams, and launch bound parameters for GPU kernels with novel insight on the possible speedup of those kernels The largest part of our record-replay implementation is available in the upstream, community version of LLVM since the lately released version 16, including extensions to the OpenMP target runtime to record kernel images and data, and the llvm-omp-replaykernel tool to replay the execution of a kernel.Few remaining parts (fault-proofing memory checkpointing, bug fixes for scaling, performance improvements) are available through a fork of the LLVM repo in GitHub, and are under review for inclusion upstream.For auto-tuning, we have developed a Python script driver that uses LLVM record-replay function to automatically record kernels, re-compile them if necessary, and replay them using Bayesian or exhaustive searching to find values for execution parameters (number of threads, number of teams, launch bounds) that minimize execution time.Data collected from experimentation are stored in JSON format and we provide Python scripts for data extraction and plotting results (using the pandas and matplotlib Python packages).All Python scripts are available in a GitHub repo set up for artifact evaluation of this article.We use three benchmark applications that include two HPC proxies (LULESH, miniFE) and a full-scale scientific application (OpenMC).Extensions to the benchmark applications include a Python descriptor used by the auto-tuning script to compile and execute the application.Those extensions are available in the GitHub repo for artifact evaluation.Code modifications to the applications for integrating the tuned parameters values per kernel found by auto-tuning are also available in the GitHub repo for artifact evaluation.
The fork of the LLVM repo with our extensions, or the community version if those extensions are upstreamed by artifact evaluation time, the provided Python scripts for auto-tuning, and benchmark codes with extensions are the freely available computational artifacts for reproducing our experimentation.

REPRODUCIBILITY OF EXPERIMENTS
We experiment on two machines: • A host with 2 x Power9 CPUs, each having 22 cores, and 256 GB of memory, including an NVIDIA Tesla V100 GPU with 16 GB device memory, • A host with an AMD EPYC 7A53 CPU with 64 cores, 512 GB main memory, including an AMD Instinct MI250x GPU with 128GB device memory Figure 9 in the article details the used benchmark applications and their inputs.The experimentation workflow is: (1) Execute the application once in recording mode to store images and device memory per kernel (2) Select a new configuration of execution parameters for each kernel (using exhaustive or Bayesian searching) (3) Execute using replaying each kernel with this configuration and measure execution time (4) Repeat steps 2-3 until all configuration are explored exhaustively or until the Bayesian searching cutoff (200 steps in our implementation) Also, Figure 9 shows the executing time per application when recording is enabled, which happens one time to collect kernel images and device memory.Each kernel can be tuned independently.Figure 14 details the expected fastest and slowest tuning times across kernels for each application and experimentation machine as a fraction of the time if executing the whole application instead of replaying individual kernel recordings.In more details: • For LULESH, on NVIDIA the slowest kernel tuning time is approx. 1 hour, the fastest is 45 minutes; on AMD the slowest kernel tuning time is approx.1.5 hours, the fastest kernel tuning time is 1 hour • For miniFE, on NVIDIA the slowest kernel tuning time and fastest kernel tuning are similar, approx.45 minutes; on AMD there are also similar approx.70 minutes • For OpenMC, on NVIDIA the slowest kernel tuning time is 9 hours, the fastest is 5.5 hours; on AMD the slowest is approx.5 hours, the fastest is 3 hours If each kernel is tuned in parallel the expected execution time of the workflow equals the slowest tuning time shown.If kernels are tuned sequentially, the expected execution time of the workflow equals the sum of tuning times per kernel, which as a worst-case estimate is the number of kernels of the applications multiplied by the slowest tuning time.
Our computational artifacts (LLVM compiler with record-replay functionality, Python scripts, benchmark codes with extensions) fully automate experimentation, including building the applications, recording kernels, and replaying kernels with different execution parameters guided by Bayesian or exhaustive searching, and collecting performance data.Also, we provide Python scripts for data extraction and plotting to reproduce the visualization results in the same format as shown in the plots of the article.

ARTIFACT DEPENDENCIES REQUIREMENTS
The Readme under the released software provides instructions on how to install and run all benchmarks and their respective libraries etc. (https://github.com/koparasy/openmprr/blob/sc23/README.md)Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Figure 3 :
Figure 3: High-level overview of the recording stage.Application execution, shown in the lowest row, is host execution interleaved with GPU kernel launches.The middle row illustrates the existing offloading runtime which communicates with the respective driver (CUDA, AMD HSA, ...) to move memory and launch kernels.The different interactions with the new record-and-replay layer (top row), are marked with circled numbers and explained in Section 3.1.
memory at which the kernel arguments reside DevMemStart staring address of the recorded device memory DevMemSize size of the recorded device memory DeviceId identification of the device (mapped to OpenMP device numbering) TripCount number of iterations of the loops associated with the (combined) target directive, or 0 NumTeams value of the num_teams clause, or 0 ThreadLimit value of the thread_limit clause, or 0 MemFile file containing the initial memory ValidationFile file containing the final memory GlobalsFile file containing global values ImageFile file containing the device image LowVAStart Low address of memory map host segment HighVAStart High address of memory map host segment VAPad Padding of memory map host segment

Figure 4 :
Figure 4: The keys of the JSON file created during recording and their meaning.

3 Figure 5 :
Figure 5: High-level overview of the replay stage.The user provided recording is shown on the left and passed to the replay tool distributed as part of LLVM, together with the LLVM JIT and the LLVM/OpenMP offload runtime.The device id, either passed to the tool or placed in the recording.json,allows to utilize different devices in a system concurrently.

Figure 6 :
Figure 6: AMD GPU allocation scheme in which all device memory is allocated and the pool start is aligned for consistency.

Figure 8 :
Figure 8: Proposed OpenMP extensions enabling the developer to encode vendor specific attributes on offload regions.

Figure 9 :
Figure 9: Benchmarks used to evaluated the record-replay mechanism and their respective characteristics.

Figure 10 :
Figure 10: The figure contrasts the maximum speedup obtained for both AMD executions (y-axis) and NVIDIA executions (x-axis).Different kernels are depicted with different colors and markers correspond to the auto-tuning scenarios.

Figure 11 :
Figure 11: Execution time breakdown per application before and after tuning on both systems.

Figure 12 :
Figure 12: Time breakdown of replay operations.
the different recording steps are visualized.During application startup, as part of the OpenMP runtime initialization, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.