HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU

The end of Dennard scaling and the slowdown of Moore's law led to a shift in technology trends toward parallel architectures, particularly in HPC systems. To continue providing performance benefits, HPC should embrace Approximate Computing (AC), which trades application quality loss for improved performance. However, existing AC techniques have not been extensively applied and evaluated in state-of-the-art hardware architectures such as GPUs, the primary execution vehicle for HPC applications today. This paper presents HPAC-Offload, a pragma-based programming model that extends OpenMP offload applications to support AC techniques, allowing portable approximations across different GPU architectures. We conduct a comprehensive performance analysis of HPAC-Offload across GPU-accelerated HPC applications, revealing that AC techniques can significantly accelerate HPC applications (1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our analysis offers deep insights into the performance of GPU-based AC that guide the future development of AC algorithms and systems for these architectures.


INTRODUCTION
As Dennard scaling -which stipulated a steady rise in processor clock speed through transistor shrinkage -came to an end, and Moore's law -predicting a doubling of CMOS transistors on a microchip every two years -slowed down, technology trends shifted toward parallel architectures.Parallel architectures focused on multi-core CPUs in the early 2000s, while the emergence of GPGPU paradigms pivoted technology trends to many-core accelerator systems.This trend is evident in the Top500 list [2]: as of November 2022, 7 of the 10 fastest supercomputers use GPUs.Despite the success of many-core architectures overcoming the slowdown of Moore's law [20], HPC requires another paradigm shift to continue delivering performance improvements.
Approximate Computing (AC) has emerged as an attractive new paradigm that increases performance by introducing novel approximations within applications, controllably reducing the application's accuracy.Both hardware and software AC techniques have been proposed.Specifically, [12,40] introduce approximate CPUs, [10] proposes approximate memories while [11,16] discuss approximate accelerators.Software techniques include loop perforation [21,50], which accelerates image processing workloads by up to 3× with less than 10% accuracy loss.Input [35] and output [51] approximate memoization have been used in various domains, such as stencil computations, finance, and image processing, doubling application performance with small error.Other techniques, such as variable precision, can increase performance by 45%.HPAC [38] provides a state-of-the-art compiler and runtime implementation to apply software AC techniques on multi-core CPUs using OpenMP.
These works extensively showcase the potential of AC in various CPU applications.However, little research assesses approximate computing on GPUs, which currently dominate HPC supercomputers.It is imperative to assess whether AC is a viable execution paradigm for next-generation software: any paradigm that cannot apply to many-core architectures will likely have limited impact.Consequently, a comprehensive study applying AC to GPU-enabled applications is essential to fully gauge the potential and challenges imposed by approximations in modern GPUs.
To address this problem, this work studies state-of-the-art software approximate computing techniques applied to HPC GPUenabled scientific applications.We present hpac-offload, an extension of HPAC [38] that supports approximations in GPU applications.The proposed extensions seamlessly compose with the portable OpenMP offload programming model and consist of easyto-use annotations on OpenMP offload applications.The result is an approximate computing framework that enables portable approximations across different GPU architectures, such as NVIDIA and AMD.The composition of approximation and GPU parallel execution results in several challenges due to the execution model of GPU devices.Porting AC techniques to GPUs without considering their unique architectural characteristics results in significant slowdowns.
For example, approximate computing techniques for CPU parallelism typically duplicate the AC state on each CPU thread; however, the massive parallelism of GPUs that use millions of software threads makes this approach impractical by depleting the device's memory.Additionally, CPU-AC allows each parallel thread to independently decide whether to approximate without observable overhead.In contrast, independent thread decision-making in GPUs can introduce thread divergence and reduce performance, limiting the expected performance boosts of approximation.
As such, programming models for GPU-AC must match the hierarchical nature of the underlying execution model.hpac-offload identifies such challenges and proposes programming model extensions to support high-performance implementations of several state-of-the-art approximation techniques: input/output memoization and loop perforation.
This paper makes the following contributions: • hpac-offload, a programming model for composing state-ofthe-art AC techniques (input/output memoization, loop perforation) with OpenMP-offload.Our pragma-based programming model equips developers with AC techniques using familiar idioms.hpac-offload reflects the hierarchical programming model of modern GPUs, enabling users to apply AC with little effort.• New GPU-centric algorithms for approximate computing techniques that leverage the architectural features of modern GPUs.
Our algorithms offer hierarchical control of approximation, coupling flexibility with high performance.• An implementation of hpac-offload in Clang/LLVM.hpacoffload leverages the portable OpenMP offload specification and extends it with a portable layer to apply approximations to GPU-enabled applications.• Evaluation on a comprehensive suite of proxy apps, mini-apps, and benchmarks from HPC and machine learning on AMD and NVIDIA GPUs.We demonstrate that hpac-offload achieves up to 6.9× speedup across all applications (geomean speedup 1.42×) while introducing less than 10% error.

BACKGROUND
This section provides background information on GPU architectures and their execution model, the OpenMP offload programming model, and the HPAC programming model.We use CUDA [36] terminology; other GPU vendors have similar features.In the GPU-accelerated execution model, time-consuming tasks are offloaded as kernels to the GPU through kernel offloading.The host transfers data between Host and Device memory (HtoD and DtoH) and invokes the kernel.

OpenMP Offload
OpenMP offload is a set of pragma directives first introduced in OpenMP 4.0 that allows offloading execution to heterogeneous devices such as GPUs.The OpenMP offload model uses the target directive to define offloading for a code region (target region) to a device, along with a data mapping.The mapping uses modifiers like to, from, and tofrom to specify data directionality.The target directive executes the region on a single device thread.
The teams directive forms a league of teams, with each team's main thread executing the region.The parallel directive allows all threads in a team to execute the region.In CUDA, a team represents a thread block, and the parallel directive assigns a kernel to all threads in the block.The #pragma omp target teams parallel 1 #pragma omp declare target 2 double foo(double input) { ... }; // An expensive function 3 #pragma omp end declare target construct, combined with work-sharing directives (distribute and for), decomposes parallel loops across teams and threads.
Figure 1 shows an OpenMP offload example.The algorithm applies the foo function to all elements of an input vector, storing the result in an output vector.The developer declares the foo function as a device function using the omp declare target directive (lines 1, 3), directing the compiler to include it in the device binary.The developer also implements a parallel loop for device execution using the pragma omp target teams distribute parallel for (line 7).The teams distribute directive shares iterations across thread blocks, while the parallel directive divides iterations among the threads of each block.The map directive (line 8) copies the input vector to the device and the output vector from the device.

State-of-the-art of AC
Many approximate computing techniques are specific to one domain, including mixed-precision [7], Newton's method [13], and the Barnes-Hut algorithm [4].These techniques enhance performance but require domain expertise and familiarity with the application code.This work, however, concentrates on general-purpose AC techniques applicable to multiple domains.We examine three stateof-the-art software-based methods: loop perforation [33,34], input memoization (iACT [35]), and output memoization (TAF [51]).After briefly presenting these techniques, we describe their adaptation to GPU architectures and evaluate their impact on accuracy and performance in scientific applications.
Output Memoization (TAF) leverages temporal function locality: recent past outputs resemble upcoming ones.Instead of computing upcoming evaluations, TAF returns the most recent evaluation's output.TAF caches code region outputs in a sliding window of history size (hSize) output values, calculating the relative standard deviation () 1 .When the  value is below a user-defined threshold, TAF enters a stable regime for the next prediction size (pSize) invocations (hSize and pSize are user-specified), returning the last accurately-computed output.TAF employs a state machine at runtime to track the current sliding window's outputs and determine if it is in a stable regime and should approximate.
Approximate input memoization (iACT) extends traditional memoization (caching).iACT caches the inputs and outputs from code region evaluations.For every new evaluation, the technique computes the distance of the inputs in the cache with the inputs double foo(double input) { ... }; // An expensive function void Hfoo(double *input, double *output, size_t N) { #pragma approx perfo(small:4) #pragma omp parallel for for(size_t i = 0; i < N; ++i) #pragma approx memo(in: 10 : 0 of the current evaluation.When the distance is below some userdefined threshold, input memoization returns the closest previously computed value, skipping the computation of the code region. Loop perforation is an approximation technique that drops the computations of user-specified iterations.Typically, the user describes which iterations to drop through some pattern.For example, the user can skip one of every  iterations (small perforation) or execute one of every  iterations (large perforation).Other perforation types include ini and fini, which drop a user-defined fraction of the first or last loop iterations.
The HPAC programming model [38] enables pragma-based approximate computing, integrating OpenMP parallelism with approximate computing techniques such as input/output memoization and loop perforation.Developers can use HPAC's toolchain to identify code regions amenable to approximate computing and evaluate accuracy/performance trade-offs.
HPAC example.Figure 2 shows an example composing the HPAC programming model with CPU OpenMP.The developer applies small perforation to the parallel for to skip the computation of every fourth iteration in every parallel thread (line 4).The developer applies input memoization for the remaining executed iterations using a cache table of size 10 (line 7).Memoization is activated when the euclidean distance between the current input input and a cache entry is less than 0.5f.If such an entry exists, the memoization cache returns the output value associated with this entry instead of calling the foo function.The cache size and the distance threshold are parameters to the memo clause.
Design of HPAC.The HPAC pragmas support approximate computing by automatically creating a second approximate execution path that co-exists with the original accurate execution path during compilation and execution.Developers specify the desired approximation technique through HPAC clauses, and HPAC creates the approximate execution path with the corresponding technique's implementation.At execution time, the HPAC runtime library decides whether to execute the accurate or approximate execution path.For each supported approximation technique, HPAC provides a parametric activation function that dynamically determines which execution path to follow.If the accurate path is chosen, the code region is executed, and inputs/outputs are captured if necessary.When the approximate path is selected, the AC technique approximates the output.
The HPAC execution harness exhaustively explores the space of user-provided approximation techniques (e.g., iACT and TAF) and parameters (such as error threshold).The technique and parameters are first applied to the program by the harness, which builds and executes the program.After executing the approximated program, the harness calculates and saves runtime information and error to a database.Using these data, the user can decide how approximation fits her application needs.

DESIGN AND IMPLEMENTATION
This section details the design and implementation of hpac-offload.While we focus on the AC techniques introduced by the original HPAC work, our implementation is flexible and extensible to accommodate future AC techniques.
GPUs combine large numbers of threads with rapid context switching between them to hide execution and memory latencies.The deep memory hierarchy, memory coalescing, and SIMD execution model reduce memory latency and enable massive parallelism.An AC framework oblivious to these characteristics will not improve performance.

GPU Aware AC Design
3.1.1AC GPU memory design.Memory-bound GPU applications present a significant challenge for GPU AC.In memory-bound applications, memory bandwidth determines performance.Performance will be harmed by AC techniques that cache results in memory to avoid computation but increase memory traffic.
The resources dedicated to each parallel thread in GPU execution are far more constrained than the respective CPU resources.AC techniques unaware of such limitations will starve the system of resources.For example, in approximate memoization, each thread replicates internal AC data structures (e.g., the memoization table).Adopting this on the GPU is unviable, as depicted by Figure 3.The AC data structures fill the device's global memory when the application uses 2 27 threads, far below the limit of ≈ 2 72 [36].
AC techniques that are agnostic to these hardware characteristics will not improve performance.hpac-offload employs a GPU-aware memory scheme to facilitate efficient AC for GPUs.
To reduce GPU AC's excessive resource consumption, we rely on the following key observation: while a kernel may have millions of threads, a small subset of those threads are actively scheduled on SMs at a given time.Threads scheduled on the same SM over nonoverlapping time periods can use the same memory to store internal AC state and thus substantially reduce storage requirements.We move all internal AC state to a block's shared memory to avoid device-wide locking.Since the number of active threads is bounded by the hardware specification (number of SMs, number of active threads per SM) and orders of magnitude smaller than the total number of supported threads, hpac-offload substantially reduces the resources used to store the AC internal state.
Storing the AC state in shared memory facilitates fast, local access to internal AC data structures.However, shared memory data only persist during the lifetime of the currently executing kernel, and hpac-offload approximations are thus scoped within the kernel lifetime.Once the kernel completes, the internal data are destroyed.In summary, hpac-offload limits resource consumption by dedicating part of the shared SM memory to the internal AC state to leverage the GPU memory hierarchy for performance.We assume a memoization cache table of 5 entries; each entry is of size 36 bytes.
3.1.2Prescriptive and Hierarchical GPU AC.GPU applications decompose their computation into blocks of threads, which are further decomposed into warps that consist of individual threads.These applications frequently use fine-grain synchronization and warp level collective primitives (shuffle intrinsics) to provide highperformance implementations of complex algorithms.
The data-dependent nature of activation functions can introduce deadlocks within the application.Approximation skips the accurate execution path, which can lead to a deadlock when the accurate path contains synchronization primitives.Because of the SIMD execution model, activation functions can also introduce divergence in GPU applications.In a worst-case scenario, one thread in the warp takes the accurate execution path while all others approximate.Thus, the entire warp will wait for this single thread to finish executing, reducing the performance gains of approximation.
Such scenarios introduce deadlock and accuracy loss while limiting the broad adoption and performance benefits of AC. hpacoffload provides programming model extensions to developers to define the hierarchical level of approximation.
hpac-offload considers three hierarchy groups: () thread-level: where each thread has individual approximation criteria; () warplevel: all threads in a warp share collective criteria; and () blocklevel: all threads in a block share collective criteria.When the collective criteria are satisfied, the entire hierarchy group approximates; otherwise, all threads follow the accurate execution path.

Temporal Approximate Function Memoization (TAF)
. TAF assumes that a thread runs the same function in sequence with outputs that exhibit spatial locality over time.The history size (hSize) previous outputs determine whether the accurate or approximate execution paths are taken.The structure adds control dependencies between previous loop iterations and the current iteration.HPAC, as illustrated in Figure 4(b), adopts the same design.In CPU parallel for loops, a thread executes adjacent for iterations to preserve cache locality, and therefore parallel CPU-based TAF is simple and maintains the semantics of temporal output memoization.
On the other hand, target parallel for loops are distributed among adjacent threads executing adjacent iterations.In a semanticallyequivalent TAF algorithm for the GPU, threads must wait for the previous thread to terminate before deciding whether to approximate, serializing execution (Figure 4(c)).To eliminate this serialization, we relax TAF's spatial locality assumption in hpac-offload's TAF algorithm.
The hpac-offload TAF algorithm is depicted in Figure 4(d), where threads exhibit temporal output locality without requiring spatial locality.In target parallel for loops, locality manifests across the iterations of a grid-stride step.With this assumption, no inter-thread dependencies are introduced, but utilization is limited by divergence-induced idle time.Nevertheless, parallelism is increased and coalesced memory accesses are preserved, yielding a high-performance TAF algorithm for the GPU.

Input Approximate Function Memoization (iACT)
. iACT presents interesting trade-offs.Larger memoization tables increase approximation likelihood [38], search costs, and memory use.CPU-HPAC allocates unique memoization tables per thread, minimizing synchronization but increasing memory overhead.To decrease memory overhead on the GPU, this scheme is adjusted to allow iACT memoization tables to be shared within a warp.
Developers control whether threads share memoization tables and, if so, the number of threads that share each table.3.1.5GPU-Aware Loop Perforation.Loop perforation skips a subset of a loop's iterations according to the method used.On the GPU, small and large perforation suffer performance degradation because neighboring threads in a warp take separate execution paths, introducing thread divergence.To eliminate thread divergence, we introduce herded perforation.In herded perforation, the same iterations are dropped by every thread in the grid.For instance, if we skip every third iteration, all threads in the grid will skip the respective iterations.This synchronized skipping ensures that the control flow remains uniform across all threads in a warp, thereby eliminating thread divergence.Maintaining the same perforation pattern throughout the warp reduces global memory traffic, as the memory transactions are aligned and less fragmented.Furthermore, deadlock is avoided because threads in the block take the same execution path.

Programming Model
hpac-offload uses non-intrusive pragma-based annotations for approximation, allowing AC with minimal modifications to existing code.To support the discussed extensions, we introduce a new clause level(hierarchy) that can be applied to the original HPAC approx directive.This new optional clause determines the hierarchy level in which approximation will be applied.Allowed values are thread, warp, team.The default value is thread, closely matching the original HPAC programming model.When warp or team is used, threads collectively decide which execution path to follow.
Further, we extend the input memoization clause to accept an additional optional parameter tperwarp defining the number of tables per warp (memo(in:tsize:threshold:tperwarp)).The warp size is the default value, yielding one independent table for each thread.Lower values increase sharing among threads, increasing synchronization overhead but also reducing shared memory use.
A program using hpac-offload is shown in Figure 5.Using the pragma-based programming model, users choose and parameterize an approximation technique for a code region.To approximate a code region, the developer first chooses which category of AC technique to use (e.g., perforation or memoization), then specifies the particular type and supplies parameters to the technique.
Memoization is specified with the approx memo clause, followed by a keyword designating the memoization algorithm: either in for iACT or out for TAF.The keyword is followed by a colon-separated argument list.In line 9 of Figure 5, the developer specifies iACT with a table size of 2, a distance threshold of 0.5f, and 4 tables per warp.Warp-level decision-making is specified via level(warp).In line 10, she declares the approximated region's inputs and outputs with the array section input[i*5:5:N].The array is a 5-dimensional vector stored in column-major format to maximize memory coalescing, so strided memory access is used with stride  .
A second function is approximated using TAF in lines 13-15 of Figure 5.In line 13, TAF is selected with a history size of 3, prediction size 5, and RSD threshold 1.5 .Thread-level TAF is selected via level(thread).TAF only uses a code region's outputs, so no input is declared in line 14.

Implementation
We initially rebased HPAC to Clang/LLVM [30] version 16 and modified it with the extensions necessary for hpac-offload.We extend the parser to recognize the new clause (level) and the new identifiers defining the tables per warp.This information is lowered to the semantic analysis and embedded in an AST node for the approximation directive.The annotated code region is captured as a closure in Clang, making the accurate, non-approximate version callable as a function.We extend the device code generation for the AST node to allocate and initialize several data structures with the information needed to control and perform the approximation technique.The compiler generates a call to the runtime function whose arguments have the information needed to perform the approximation.The hpac-offload runtime system is implemented as OpenMP offload device functions, making it portable to all architectures supported by the OpenMP offload implementation.
To reduce the storage requirements of GPU AC, the internal AC state is stored in shared memory (Section 3.1.1).The amount of needed shared memory is given by the user when building the hpac-offload runtime library.2At runtime, TAF and iACT use this shared memory.
In iACT, we enable table sharing by dividing table access into reading and writing phases.Threads search for input matches during the reading phase, while a single writer is chosen for each table during the writing phase.The writer is the thread with the largest euclidean distance from any table value.A warp barrier separates phases, and we use a round-robin replacement policy. 3AF is implemented in the hpac-offload runtime system as shown in Figure 4(d).Each thread manages its private sharedmemory TAF state machine and output memoization table.RSD is calculated over the outputs of successive runs of the accurate execution path over a grid-stride loop.When in the approximate state, a thread writes its newest table value to the region's output.When in the accurate state, a thread executes the accurate execution path and copies the output to the thread's output table.
To support small and large perforation at runtime, hpac-offload counts the number of times a thread has encountered the perforated code region.A thread skips the code region according to the perforation technique used.For ini and fini perforation, the compiler generates code to change the lower or upper bounds of the loop, respectively.Perforation can be applied to both omp target teams distribute loops and loops within offload kernels.
To support hierarchical decision-making, threads in a hierarchy group tally the threads whose activation criteria have been met.Using a "majority-rules" system, the entire group approximates if most of its threads meet the activation criteria.For warp-level decision-making, the ballot intrinsic identifies threads that will approximate; popcount counts these threads.For block-level decisionmaking, threads are counted separately in each warp using ballot and popcount.The first thread in each warp atomically adds its count to the block total in shared memory, and the total is read after all warps contribute.If the majority of threads can approximate, the entire block follows suit.Although we focus on the "majority-rules" scheme, hpac-offload can easily be extended to support others.

EVALUATION
We evaluate the approximation techniques implemented in hpacoffload using the benchmarks listed in Table 1.We first profile each benchmark to find the longest-running offload kernel and decorate code regions within each kernel with approximation pragmas.If the two longest-running offload kernels have similar runtime, we approximate them both.We try to maximize the approximated portion of the kernel to increase potential performance benefits.
To explore accuracy, performance trade-offs and GPU parallelism, we perform a design space exploration of hpac-offload parameters and the num_teams OpenMP clause parameter.By adjusting the value passed to num_teams4 , we can assign more items to be computed by the same GPU thread and thus explore the interaction between parallelism and approximation.Table 2 lists all parameter values, and we explore the Cartesian product of these parameters.
Except for Blackscholes 5 , we run each configuration 3 times and report the mean runtime.These parameters yield statistically significant results.Unless otherwise stated, we measure the end-toend application runtime, including time transferring data between the CPU and GPU, and report speedup relative to this end-to-end runtime.To reduce overplotting, we divide the error range for each benchmark into ten equally-sized intervals.For each interval, we show the fastest and slowest 10% of configurations.
Table 1: The benchmarks used to evaluate hpac-offload.

Benchmark Description
LULESH [24] Hydrodynamics proxy application that models a Sedov blast problem with volumetric elements discretized onto a mesh.QOI: The final origin energy.
Leukocyte [6] Detects and tracks rolling white blood cells (leukocytes) in video microscopy of blood cells.QOI: The final location of each leukocyte.

Binomial
Options [39] Iteratively calculates the price for a portfolio of American stock options at multiple time points before expiration.QOI: The computed prices.

MiniFE [1]
Proxy application for unstructured implicit finite element codes.QOI: The final residual of the solver.

Blackscholes [5]
Analytically calculates the price for a portfolio of European stock options.QOI: The computed prices.
LavaMD [6] Calculates particle potential and relocation due to forces between particles in a 3D space.QOI: The final force and location of each particle.

K-Means [6]
Iterative clustering application that assigns observations to their closest cluster.QOI: The cluster id each observation is assigned to.We showcase hpac-offload portability by evaluating it on two state-of-the-art GPU platforms.The first is equipped with 2× IBM Power9 CPUs, each with 22 cores, and 4× NVIDIA Tesla V100 GPUs, each with 80 SMs; the other has 1× AMD Epyc 7A53 CPU with 64 cores and 4× AMD Instinct MI250x GPUs, each with 220 SMs.
We evaluate hpac-offload on two metrics: end-to-end speedup over the original GPU-accelerated application, and quality loss.To quantify quality loss, we choose a Quantity of Interest (QoI) from each application and report approximation-induced error in comparison to the output of the original application.As an error metric we use mean absolute percent error (MAPE) (1) for all applications besides K-Means for which we use the misclassification rate (2) (MCR).
# »   and # »   represent the outputs of the accurate and approximate executions, respectively.I [] is the indicator function that returns 1 if and only if  evaluates to true.best-performing approximation technique that meets the imposed accuracy constraint, whereas input memoization performs worst.Figure 7 depicts LULESH results for all approximation techniques on both systems.We approximate the two most computationally expensive kernels: CalcFBHourglassForceForElems and CalcHourglassControlForElems.

Benchmark Results
Perforation accelerates LULESH by up to 1.64× on NVIDIA and 1.67× on AMD with less than 7% MAPE.fini perforation induces less error than ini perforation, indicating that the first iterations of the simulation contribute more to the output than the final ones.
Contrasting perforation, memoization techniques have less performance benefit but achieve much lower error.For TAF, we observe speedup up to 1.30× on NVIDIA, and up to 1.45× on AMD with 0.67% MAPE.iACT yields a lower error but cannot match TAF's performance.Specifically, iACT speedup is up to 1.07× and up to 1.15×, for NVIDIA and AMD, respectively with 0.3% MAPE.
In Binomial Options, an entire block collaboratively computes the price of a single option, and therefore we only use block-level decision-making.Both memoization techniques introduce minimal errors with large performance benefits, indicating an ideal candidate for AC that demonstrates redundancy in the dataset which hpacoffload can successfully exploit.On the NVIDIA platform, TAF achieves up to 6.90× speedup with 1.40% MAPE (8a) while iACT achieves speedup up to 5.64× speedup with 1.42% MAPE (8b).
Approximate memoization depends on repeated executions of the same code region.For every execution, the approximate execution path may be activated, potentially increasing performance.OpenMP offload clauses allow developers to control the number of parallel threads that execute a region.Fewer threads results in more code region executions per thread, thereby increasing approximation potential, but more threads improve GPU latency hiding because there are more active warps.Thus, there is a conflict between parallel GPU execution and approximation benefits.For problem size  , approximation potential increases with the number of items computed by a thread.Maximizing approximation requires one thread to execute all  iterations, whereas maximum parallelism requires  threads, each executing one iteration.
Figure 8c demonstrates this tradeoff, showing the speedup from approximation for Binomial Options vs. the number of work items per thread.On NVIDIA, speedup increases up to 2048 items per thread and then declines.The decline starts after 1024 items per thread on AMD.AMD performance decreases earlier because the AMD GPU has more SMs than the NVIDIA GPU, and thus needs more thread blocks executing to hide latency.At the same time, the percent of approximated calculations approaches 100.To achieve the best performance with approximation, it is important to balance approximation potential with the device's ability to hide latency.
In Leukocyte, we approximate the IMGVF matrix calculation and observe minimal quality reduction.TAF memoization (Figure 9a) results in speedups up to 1.99× with an error of just 1.12%.In contrast to TAF, iACT reduces error but always slows down the application (Figure 9b).The benefits of approximation are outweighed by the cost of the cache lookups and euclidean distance calculations in every invocation of the approximated function.
In MiniFE, sparse matrix multiplication is approximated, resulting in locally introduced errors that propagate through subsequent iterations, causing high error rates (between 593% and 3.43 × 10 22 %, Figure 9c).iACT is not suitable since input sizes vary across threads due to the CSR matrix's non-zero values.hpac-offload only supports computations with uniform input sizes for all threads.
In Blackscholes, 99% of the time is spent in memory allocations and data transfers between the host and device.However, we approximate the device kernel (the entire price calculation of an option).Therefore, we present kernel performance only.TAF is extremely effective with speedups of up to 2.26× on the AMD platform with 0.015% MAPE (Figure 10a).Performance is best when prediction size and threshold are high, suggesting the aggregate output table across all threads represents the data well.
Blackscholes demonstrates the weaknesses in TAF's activation criterion (RSD).One would expect application error and performance to increase with RSD, but Figure 10a shows otherwise.Figure 10c compares prices from Blackscholes using TAF to the original benchmark at different threshold values.For thresholds below 3.0, approximation is activated, and the error is high.At  = 3.0, the approximated values closely match the original, yielding low error.
In LavaMD, the force calculation for neighboring boxes is approximated.As shown in Figure 11a, TAF offers significant performance gains (2.98× speedup) with minimal errors (0.133%).Higher       In (c) the color scale indicates the percent of total price calculations that are approximated.Although the percent of approximated calculations approaches 100, speedup decreases as insufficient parallelism is available to hide latency.
RSD thresholds and prediction sizes yield better results, similar to Blackscholes.Figure 11b indicates that iACT has lower error than TAF but slows down the application.This is due to the higher cost of accessing the shared cache table and computing euclidean distances, compared to the original force computation.Warp-level decision-making can increase performance.In Figure 11c, we contrast the speedup obtained in LavaMD by threadlevel and warp-level decision-making for different RSD thresholds.Warp-level decision-making, for a given threshold, increases the median speedup by up to 2.27×.In warp-level decision-making, threads decide as a group whether to approximate, eliminating approximation-induced thread divergence.However, divergence elimination is not the only cause of performance benefits.Warplevel decision-making can force threads with RSD values larger than the threshold to approximate because they are within the minority of the group.In such cases, hpac-offload increases approximation.
K-Means.In K-Means, we approximate the kernel computing the euclidean distance of observations with the current clusters.Although this kernel accounts for 3.5% of the total benchmark execution time, we observe substantial performance benefits (3.5× speedup).The introduced inaccuracies cause observations to stay in the same cluster, affecting the convergence criterion.K-Means converges when no observations change cluster.Observations are herded to the same cluster by memoization techniques that use

hpac-offload limitations
Approximation speedup and error are influenced by code region sensitivity and input data, which can vary significantly across applications and the situation in which they are applied.In this work, we study methods and algorithms for porting state-of-the-art AC   (b) iACT (AMD).The user is faced with a large parameter space over which the hpac-offload execution harness performs an exhaustive search, costing compute time 6 to perform the experiments, and costing enduser time to process the results and decide which code regions to approximate.Given this cost, we believe there is considerable value in work that automates the end-to-end workflow.Such automation could integrate with sensitivity analysis tools [31,37,53] to find code regions amenable to approximation and smart search/optimization techniques (genetic algorithms, Bayesian Optimization) to reduce parameter exploration costs.
Despite these limitations, hpac-offload's prescriptive programming model enables our extensive analysis of general-purpose approximate computing for GPUs in HPC to derive deep insights that can guide current use and future systems.

Insights
Our analysis conducted with 57, 288 configurations is, to our knowledge, the most comprehensive study of GPU-accelerated AC ever performed.This analysis and our experience approximating HPCbased applications on the GPU yield several insights into the interplay between approximate computing and GPU-based parallelism: (1) State-of-the-art approximation techniques adapted to GPU architectures can significantly improve the performance of GPUaccelerated HPC applications (Figure 6).Each application has unique, unpredictable performance-accuracy trade-offs.(2) Approximation techniques accelerate applications while reducing available parallelism (Figure 8c).This trade-off is critical as AC affects the GPU's ability to hide latency.Speedup for TAF and iACT decreases as the number of SMs in the GPU increases.(3) TAF RSD behaves differently in each application in ways that can produce unintuitive results (Figures 8a,10a, 10c).Predictable 6 All experiments for a given benchmark took up to 988 GPU hours (median 1.35 hours).
interactions between approximation and speedup/quality loss are important for AC use in production codes.(4) TAF has higher speedup than iACT (Figures 8-12).iACT must always pay the cost of deciding whether to approximate, while TAF can amortize the decision-making cost by approximating multiple times.(5) Load imbalance caused by control divergence within a warp can degrade performance in GPU-based AC (Figure 11c).Hierarchical decision-making improves performance by eliminating control divergence.(6) iACT is slower than TAF but introduces less error, suggesting that matching based on euclidean distance is more suitable for applications than RSD values.Similar to custom hardware to accelerate machine learning, hardware support for approximate memoization tables can help realize the accuracy benefits of iACT with improved performance.

RELATED WORK
Approximate Computing has gained significant attention in recent years as a promising approach for improving performance and energy efficiency.There are numerous strategies from lower-level hardware techniques, such as approximate floating-point multipliers [12,28,41], inexact adders [18,22], voltage scaling [8,10,19,47], to higher-level software techniques including loop perforation [15,21], function memoization [25,32,44], mixed-precision [17,23,26,29,31,43] etc.Our work is complementary to those studies as we introduce a new framework that enables the use and evaluation of approximate techniques on GPUs.
Several frameworks facilitate approximate computing on the CPU.For example, GREEN [3] is an API that enables loop and function-level approximations and generates functions for applicationspecific error models.ACCEPT [46] is a compiler framework developed on top of the LLVM compiler infrastructure.It automatically selects the best approximation strategy for a given quality threshold provided via programmer annotations.In [52], authors create a task-based programming model allowing users to specify task significance and provide approximate versions of tasks.ApproxHPVM [48] uses an intermediate representation to enable accuracy-aware compiler optimizations for ML applications.The authors in [49] extend ApproxHPVM with an auto-tuning framework for approximate aware optimizations of tensor-based applications.HPAC [38] is a framework that supports approximations in OpenMP applications and studies the interaction between approximate techniques and parallelism.While these works cater to CPU applications, we focus on GPU kernel approximations.
There have been frameworks designed for GPU applications.SAGE [45] is a compiler and a runtime system that targets machine learning and image processing applications.It automatically generates approximate kernels for GPUs and selects suitable kernels to approximate at runtime to meet a user-provided quality threshold.TruLook [14] is a configurable approximation framework that accelerates GPU applications.Using an approximate multiplier, it supports computational reuse via a lookup table and approximate arithmetic operations.While TruLook provides hardware support for approximation, our framework is designed for software approximation techniques.Authors of [27] evaluated several applications on CPUs and GPUs, studying the cumulative effect of heterogeneity and approximations on energy footprint and output quality.Our work, on the other hand, is focused on improving performance.
Besides those frameworks that provide programmatic support for approximation, others help users decide where to approximate.The ASAC [42] framework automatically identifies approximable data in a program via sensitivity analysis.Authors of [53] employ profiling and interval analysis to create a programming model that finds code amenable to approximation.Puppeteer [37] measures code region sensitivity using uncertainty quantification, and ranks the regions according to their sensitivity.

CONCLUSION
This paper introduces hpac-offload, a portable pragma-based programming model that extends OpenMP offload applications to support Approximate Computing techniques in GPU-based HPC systems with minimal code changes.Our comprehensive analysis demonstrates that state-of-the-art approximation techniques, when adapted to fit the distinct architectural features of GPUs, can significantly improve the performance of GPU-accelerated HPC applications with minimal quality loss.We demonstrate that hpacoffload achieves up to 6.9× speedup on HPC-centric workloads while introducing less than 10% error.
The insights provided in this paper contribute to a deeper understanding of the interplay between approximate computing and GPU-based parallelism, guiding the future development of AC algorithms and systems for these architectures.For instance, hierarchical decision-making can help eliminate control divergence.As the end of Dennard scaling and the slowdown of Moore's law challenge the advancement of computing performance, approximate computing in HPC systems offers a promising avenue to continue pushing the boundaries of performance.

Figure 2 :
Figure 2: An example of the HPAC programming model.

Figure 3 :
Figure3: Number of threads (x-axis) vs. the percent of an NVIDIA V100 GPU's 16GB of global memory needed to store per-thread memoization tables (y-axis).We assume a memoization cache table of 5 entries; each entry is of size 36 bytes.

Figure 4 : 13 # 16 } 17 }Figure 5 :
Figure 4: Different algorithmic TAF implementations using 2 for the prediction and history sizes for the parallel loop shown in Panel (a) on CPU and GPU architectures.CPU threads execute contiguous, non-overlapping loop iterations; adjacent GPU threads execute adjacent iterations.In (b), each CPU thread independently manages TAF state and controls approximation using the activation function.(c) The semantically-equivalent GPU version fulfills TAF's spatial locality assumption, but GPU threads remain idle awaiting activation criteria fulfillment.(d) shows the hpac-offload TAF algorithm, where GPU threads independently manage TAF state, but spatial locality is violated.

Figure 6 Figure 6 :
Figure6illustrates the highest speedup observed in our exploration for a maximum error of 10% for both systems.The figure depicts a clear trend: TAF approximate memoization is typically the LULESH perfo results on NVIDIA.
LULESH perfo results on AMD.
LULESH iACT results on AMD.

Figure 8 :
Figure8: TAF and iACT results for Binomial Options (a), (d).In (c) the color scale indicates the percent of total price calculations that are approximated.Although the percent of approximated calculations approaches 100, speedup decreases as insufficient parallelism is available to hide latency.

Figure 9 :
Figure 9: TAF and iACT results for Leukocyte (a), (b), and TAF for MiniFE on NVIDIA.iACT is not applicable to MiniFE.

Figure 10 :
Figure 10: Results of blackscholes on the AMD system.In (c) the distribution of the Exact output prices vs. prices calculated using TAF with history size 5, prediction size 512, and RSD threshold  .When the RSD threshold is too low, unrepresentative values are output and the error is high.TAF RSD interacts with the application to produce unintuitive results.

Figure 11 :
Figure11: LavaMD: results when approximated using TAF (a) and iACT(b).In (c) the paired line boxplot depicts the speedup on AMD.Lines connect points whose configurations differ only in decision hierarchy, and the boxplots show the speedup distribution for a given threshold and hierarchy level.Warp-level can increase speedup by eliminating control divergence.

Figure 12 :
Figure 12: K-Means (a) TAF and (b) iACT results for K-Means.(c) In K-means, speedup comes primarily from early convergence.Time Speedup (y-axis) vs. Convergence Speedup (x-axis), where if the non-approximated kernel converges in  iterations and the approximated kernel converges in  iterations, then convergence speedup is: / Table sharing allows us to mediate memory overhead, serialization, and synchronization with the benefits of a larger aggregate table size.This extension has a few advantages: (1) warp-level sharing reduces synchronization, requiring coherence only among threads sharing a table; (2) warp-level sharing allows threads to access computed values from adjacent threads, boosting approximation rates; (3) balancing memory use and synchronization/serialization overhead is possible through sub-warp table sharing.Private per-thread tables increase parallelism and memory use.Conversely, sharing one table per warp reduces memory use but requires synchronized access to prevent data races.Our design allows us to explore these trade-offs.

Table 2 :
Parameters used in our evaluation.Those listed under "Memo" are used for both memoization techniques.Only the AMD platform uses 64 tables per warp.small and large perforation use Items per Thread.
Time Speedup (y-axis) vs. Convergence Speedup (x-axis), where if the non-approximated kernel converges in  iterations and the approximated kernel converges in  iterations, then convergence speedup is: / techniques to GPUs without investigating how different datasets affect error.Given representative input data, the hpac-offload execution harness allows exploring whether AC yields tolerable error for a given application and input data set.