A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code

Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from making additional low-level optimizations to take advantage of the advanced architectural features of GPUs because the actual generated computation is hidden from the application developer. This paper describes and implements a novel flexible optimization technique that operates by inserting a code emulator phase to the tail-end of the compilation pipeline. Our tool emulates the generated code using symbolic analysis by substituting dynamic information and thus allowing for further low-level code optimizations to be applied. We implement our tool to support both CUDA and OpenACC directives as the frontend of the compilation pipeline, thus enabling low-level GPU optimizations for OpenACC that were not previously possible. We demonstrate the capabilities of our tool by automating warp-level shuffle instructions that are difficult to use by even advanced GPU programmers. Lastly, evaluating our tool with a benchmark suite and complex application code, we provide a detailed study to assess the benefits of shuffle instructions across four generations of GPU architectures.


Introduction
Effectively utilizing the vast amount of computational performance available in modern supercomputers remains a challenge to this day.Hardware, middleware, and parallel algorithms should be carefully orchestrated so that ideal efficiency may be obtained for solving large real-world problems in high-performance computing (HPC).Compiler technologies are developed with highly-automated program optimizations that use domain-specific knowledge and target architecture specialization to solve a part of this puzzle.With the end of Moore's Law [19] approaching, the focus on supercomputing technology is shifting toward even more specialized accelerators, which in turn increases their complexity.This trend further signifies the importance of compiler technology to relieve programmers from the burden of understanding the complex architecture of modern accelerators to be able to efficiently optimize their applications.
Currently, Graphics Processing Units (GPUs) are the most widely adopted accelerator technology, as these are present in seven out of the top 10 systems in the TOP500 list [29].GPUs work for accelerating application execution time through their highly parallelized yet cooperative architecture.To benefit the most from GPUs, however, programmers must be proficient in writing complex low-level GPU code, often a largely time-consuming task.To overcome the complexity of low-level GPU code development, pragma-based programming models such as OpenACC/OpenMP [3,24] have been developed or adapted to be able to automatically retarget existing code for acceleration.Although these automation tools have improved the utilization of GPU acceleration by many different types of applications, they lack the ability to benefit from low-level architecture-specific optimizations.One such type of optimizations is the use of warp-level primitives, which have been available since NVIDIA Kepler GPUs.Warp-level primitives, such as shuffle operations, may be used to fill a gap between threads and thread-blocks working as collaborative mechanisms, instead of relying on shared and global memory accesses.
The main operation across the warp is the shuffle, which delivers computed elements to neighbor threads to suppress the redundancy of computation and memory accesses.However, as many existing efforts [5,7,13,25] have demonstrated, those primitives often require non-trivial modification of algorithms in the fundamental part of their codes.Since the latency of the shuffle is similar to that of shared memory loads [7] (apart from storing and synchronization), it may serve as a cache system, holding data in registers [5].However, the effectiveness of this technique is still unknown when disregarding domainspecific knowledge.
Our work provides a middle-end environment to extend the code of the NVIDIA GPU assembly PTX and enables, for the first time in the literature, automatic shuffle synthesis to explore the opportunity of this operation.Our environment, PTXASW § (Wrapper of PTX optimizing ASsembler), addresses the entire computational flow of PTX, leveraging a symbolic emulator that can symbolically extract memory-access patterns.We introduce a Satisfiability Modulo Theories (SMT) solver to prune avoidable control flows while tracking down the register update.
Following the emulating results, PTXASW utilizes the solver and detects the global-memory loads that are possible to be covered by the shuffle operation.Around those loads, additional instructions are implanted, while supporting corner cases and circumventing overheads.We conduct the shuffle synthesis on an OpenACC benchmark suite, a directive-based programming model having no user exposure to warp-level instructions.Our implementation functions as a plugin of the compilation tool yielding moderate overhead.
Applying our technique, we find various opportunities to enable the shuffle over the original code of the benchmarks.The performance improvement achieved is up to 132% with no user intervention on the NVIDIA Maxwell GPU.Additionally, based on the results of the experiments using several generations of GPUs, we analyze the latency caused for the shuffle operations to provide guidelines for shuffle usage on each GPU architecture.In summary, the contributions of our work are: 1. We create a symbolic emulator to analyze and optimize GPU computing code, equipped with an SMT solver for the comparison of symbolic expressions, induction variable recognition for loops, and various optimizations to reduce overheads.
2. Through symbolic analysis, we automatically find the possible cases to utilize the shuffle operation, which previously required in-depth domain knowledge to be § The artifact is available at https://github.com/khaki3/ptxas-wrapper.
applied.Then, we synthesize those to the applications, while avoiding expensive computation.
3. Using a directive-based programming model, we generate various shuffle codes on several generations of GPUs and show the cases that attain performance improvement with no manual effort.
4. We show the latency breakdown of the optimization on each GPU architecture and provide general guidelines for the use of shuffle operations.
Our work is the first attempt at general utilization of shuffles.Although manual warp-level operations often contributed to domain-specific optimizations, the metrics to be addressed by warp-level efforts have not been studied.Even when computation or memory accesses are reducible, the trade-offs have remained unknown to date, especially when thread divergence is involved.
The rest of the paper is structured as follows.Section 2 provides the necessary background on GPUs for general-purpose computing, PTX code, and shuffle operations.Section 3 provides a high-level overview of our work.Sections 4 and 5 describe our symbolic emulator and shuffle synthesis, while Section 6 details our overall methodology.Sections 7 and 8 provide the results of our experimental evaluation and in-depth analysis.Section 9 discusses previous related work and Section 10 provides concluding remarks.

Background
This section provides the necessary background on GPUs for general-purpose computing, low-level PTX code, and warp-level shuffle operations.

GPUs
A Graphics Processing Unit (GPU), is a massively parallel accelerator architecture having with several computational and communication layers.The minimum execution unit is a thread.Each thread can collaborate with other threads bound to a certain thread-block and grid, through per-block shared memory and/or grid-wise global memory.The architecture is composed of many streaming multiprocessors (SMs), which execute distributed thread-blocks in groups of threads (usually 32), called warps.Using inner parallel processing units, the SM takes advantage of instruction-level parallelism (ILP), as well as parallelism among warps and thread-blocks.Since the memory-access latency increases through the levels of the memory hierarchy, the concept of locality is highly respected for performance, while locality optimizations bring additional synchronization and resource use to programs.Warp-level primitives, available since the NVIDIA Kepler generation of GPUs, allow for the communication among threads within the same warp [21], avoiding access to either shared or global memory.
All threads execute the same program code, known as GPU kernels, customarily written in CUDA [22] for NVIDIA GPUs, in a single-instruction multiple-data fashion.Threads operate on different data, specified in kernels by programmers, deriving from thread and thread-block identifiers.Kernels accept arguments, and the number of threads and thread-blocks is specified as variables.

NVIDIA PTX
User-level code implemented manually in CUDA or OpenACC is brought to execution on GPUs through NVIDIA PTX [23], a virtual machine and ISA for general-purpose parallel thread execution.PTX programs feature the syntax and sequential execution flow of assembly language.Thread-specific variables are replicated to be run over SMs in parallel using the same program but different parameters.Since the actual machine code (SASS) cannot be modified from official tools [35], PTX is the nearest documented and standard GPU code layer that may be modified.
PTX code consists of kernel and function declarations.Those have parameters and instruction statements along with variable declarations, labels, and predicates.Listing 2 provides the CUDA-generated PTX kernel from Listing 1. Variable declarations from several data spaces and types correspond to the usage of on-chip resources, especially In the unidirectional shuffle, the delta part, which has no source lane from the same warp, will be unchanged and obtain a false value in the resultant predicate (%p1); only the active threads (%mask) of the same control flow participate in the same shuffle.Inactive threads or threads from divergent flows produce neither valid results nor predicates to destination lanes.Each operation is accompanied by the warp-level synchronization, some of which are optimized away during compilation.While shuffle instructions allow for sub-warp granularity, our paper focuses on the unidirectional instruction with 32 threads using 32-bit data, as applying sub-warp granularity to applications tends to feature corner cases and suffers from exception handling for intricate patterns.Listing 3. The use of shfl.sync in PTX Table 1 shows the latencies (clock cycles) of shared memory (SM; no-conflict) and L1 cache as reported by [16], besides that of shuffle, from a microbenchmark based on [33].In the table, Kepler is NVIDIA Tesla K80, Maxwell is M60, Pascal is P100 and Volta is V100, while Tesla K40c/TITAN X are used for the shuffle of Kepler/Maxwell.This table reveals that shuffle brings benefits over shared   1. Latencies (clock cycles) as reported by [16,33] 3 Overview Our work PTXASW can substitute the original PTX assembler, which accepts input code from arbitrary sources.We do not rely on specific information of any certain language or any certain generation of GPU architecture.
Figure 1 provides a high-level overview of PTXASW's execution flow.PTXASW primarily aims at shuffle synthesis on PTX code.The input is produced by user-level code compilers, while directive-based programming models (OpenACC/OpenMP) do not expose control over warp-level operations, and CUDA prevents code extension due to its code complexities.Once PTXASW inserts shuffles, the resultant code is assembled to GPU binary by the original PTX assembler.
PTXASW emulates the PTX execution based on the input.Since runtime information is not provided, we employ symbolic evaluation for each operation.First, 1 register declarations are processed to be mapped in a symbolic register environment (described in Section 4.1).Second, 2A for each statement of PTX instructions, a corresponding operation is performed to update registers (Section 4.1).While continuing the execution, 2B PTXASW gathers branch conditions for avoiding unrealizable paths (Section 4.2) and creates memory traces (Section 4.3).When the entire emulation is finished, 3 we discover shuffle opportunities from memory traces (Section 5.1).Finally, 4 we insert shuffle operations to the input code (Section 5.2); then, the generated code is consumed by the original PTX assembler.

Symbolic Emulator
Analysis of high-level code has posed questions about its applicability to abstract program structures or other user-level languages.While high-level code analysis may process intact code information, enormous engineering efforts are required just for specific forms within one language [13,32].Therefore, virtual machines are utilized for providing a cushion between real architectures and user codes.In particular, analysis and optimization of the virtual-machine code tend to be reusable without the restriction of input types [12,14,34].
Our work uses PTX as the virtual machine layer and performs general analysis through code emulation.We introduce symbolic emulation to encapsulate the runtime information in symbol expressions and compute concolic (concrete + symbolic) values for each register.Although a number of previous work have been conducted on symbolic emulation for the purpose of software testing [4], our work (PTXASW) especially aims at code optimization of memory access on GPUs, since it is often regarded as one of the bottlenecks of GPU computing [7].Those computed values are utilized for code generation as described in Section 5.

Instruction Encoding
Since the subsequent PTX assembler, while generating SASS code, will eliminate redundant operations and resources, we may abundantly use registers while not causing register pressure by unnecessary data movement outside of the static single assignment form (SSA). First, PTXASW recognizes variable declarations and prepares a symbolic bitvector of the corresponding size for each register.Since arithmetic calculation and bitwise operations are supported on the combination of concrete and symbolic bitvectors, we encode each PTX instruction as the computation over vectors.For example, addition for 16-bit vectors is encoded as in the following pseudocode: With the add instruction corresponding to the above calculation, we detect the instruction type and source registers (%a, %b) and compute the result: add .u16 %c , %a , % b ; // dst : % c ; src : %a , % b Then, having the binding with the name of the destination register (%c), we keep the computed value in the register environment.PTXASW defines each instruction to update the destination registers according to the instruction options and types, and those registers may be fully concrete with the movement or computation from constant values.Also, to support floating-point instructions, we insert the conversion by uninterpreted functions at loading and storing bitvectors to and from floating-point data.Regarding casting operands among integer types and binary types, truncating or extending is performed based on the PTX specification.The computational instructions under predicates issue conditional values in registers.Since registers are not used before initialization, these always have evaluated values, except for special registers, such as thread IDs and uninterpreted functions of loops and memory loads, which are described in following sections.

Execution Branching
Branching is caused by jumping to labels under binary predicates that are computed by preceding instructions.Since inputs and several parameters are unknown at compilation time, unsolvable values of predicates are often observed leading to undetermined execution flows where computation is boundless.Thus, we abstract the repeated instructions in the same execution flow.At the entry point to the iterative code block, we modify each iterator of the block to have uninterpreted functions with unique identities and perform operations only once upon those uninterpreted functions.Since those uninterpreted functions produce incomparable values, we clip the initial values out and add them to registers containing uninterpreted functions at the block entry, for better accuracy in the case of incremental iterators to be found by induction variable recognition [10,11].
We continue each branching while duplicating the register environment for succeeding flows.All the flows finish at re-entry to iterative blocks or at the end of instructions, completing their own results.The symbolic expressions in predicates used at the prior divergence are recorded as assumptions while updating those predicates, to have constant booleans in the register environment, based on whether it is assumed as true.Conflicting values in assumptions are removed according to an SMT solver (Z3 [9]) when new expressions are added.If the destination of a new branch can be determined providing assumptions to the solver, unrealizable paths are pruned for faster emulation.Also, we skip redundant code-block entry bringing the same register environment as other execution flows by memoization, to force new results at each entry.

Shuffle Synthesis
Mapping programs over thread-level parallelism, while pursuing the performance of modern complex architectures and ensuring correctness, is a far-from-easy task.Most likely, existing GPU programs are already optimized in terms of resource use and scheduling, which does not smoothly allow for further optimization, especially at the low-level code.The shuffle operation performs at its best when the communication is fully utilized [25], but such cases are not common in compiler-optimized code or even manually-tuned code in HPC.The big trouble is corner cases.Not only halo, but fractional threads emerged from rounding up dynamic input sizes, demand exceptional cases to be operated on GPUs.While the generality and applicability of GPU shuffle instructions for all types of applications or computational patterns are yet unknown, the level of difficulty in manually applying shuffle instructions in different cases adds further hardness to the already complex task of understanding the true nature of the performance of shuffle operations.
Hence, we implement automatic shuffle synthesis through PTXASW to drive the lower-latency operations seen in Section 2.3, while supporting corner cases and covering global-memory loads with warp-level communication.PTXASW is accordingly extended to seek shuffle candidates among loads, and embed shuffle instructions into code while alleviating register pressure.

Detection
Warps are comprised of neighboring threads.We do not consider adjacent threads in non-leading dimensions, since those tend to generate non-sequential access patterns.Upon finding a global-memory load, PTXAS compares its load address to those of previous loads found through the same execution flow and not invalidated by any store.If for all threads in a warp the load is overlapped with existing loads, those instructions are recorded as possible shuffle sources.To utilize a load with an address represented as (%tid.x)for another having the address (%tid.x),there must exist an integer  such that (%tid.x+  ) = (%tid.x)and −31 ⩽  ⩽ 31.For example, when  = 0, the load can be fully utilized in the same thread.When  = 1, we can adapt the shfl.sync.downinstruction to convey existing register values to next threads while issuing the original load for the edge case (%warp_id = 31).In the case of the memory trace in Listing 5, the load accesses of w0(i-1, j+1) and w0(i-1, j-1) are uniformly aligned with the close addresses to each other, so we can search the variable  , which satisfies the above condition, by supplying  along with those addresses to the solver and find  = −2.
We make sure that each shuffle candidate has the same  as a shuffle delta in all the execution flows.This delta must be constant regardless of runtime parameters.Since the steps of loop iterators in PTX code could be any size (e.g.NVHPC Compiler uses the thread-block size), shuffles are detected only in straight-line flows, whereas live variable analysis is employed to exclude the case in which source values possibly reflect a different iteration from the destination.For faster analysis, we construct control-flow graphs before shuffle detection, while pruning unrelated instructions to memory operations and branches, and at the use of the SMT solver, uninterpreted functions are converted to unique variables.

Code Generation
Warp divergence may be caused by various reasons, including the dynamic nature of the program execution, which is inconvenient to optimization, where the uniformity of threads matters for collaboration.Not only inactive threads, but an insufficient number of threads to constitute complete warps, raises corner cases in which original computation should be retained.Our shuffle synthesis handles both situations by adding dynamic checkers for uniformity.Listing 6 presents an example of the synthesis by PTXASW.Once all the emulation is finished, the results are collected and filtered to satisfy all the above-mentioned conditions.Then, PTXASW selects the possible shuffle for each load with the smallest shuffle delta ( ) and allows only the least corner cases.At the code generation, each source load instruction is extended to be accompanied by the mov instruction to prepare the source register (%source).The destination load is covered with the shuffle operation and a corner-case checker.First, we check if the thread has no source from the same warp (%out_of_range).Second, the ld .global .nc .f32 % f4 , [% rd31 +12]; // w0 (i -1 , j +1) /* ... */ ld .global .nc .f32 % f7 , [% rd31 +4]; // w0 (i -1 , j -1) ld .global .nc .f32 % f4 , [% rd31 +12]; mov .f32 % source , % f4 ; /* ... */ mov .u32 % wid , % tid .x ; rem .u32 % wid , % wid , 32; activemask .b32 % m ; setp .ne .s32 % incomplete , %m , -1; setp .lt .u32 % out_of_range , % wid , 2; or .pred % pred , % incomplete , % out_of_range ; shfl .sync .up .b32 % f7 , % source , 2 , 0 , % mask ; @ % pred ld .global .nc .f32 % f7 , [% rd31 +4]; Listing 6. Shuffle synthesis on Jacobi kernel (Upper is original and lower is synthesized code; variable declarations are omitted and the naming is simplified) PTXASW incompleteness of the warp (%incomplete) is confirmed with a warp-level querying instruction.In any case, the shuffle operation is performed at the position of the original load, shifting the value of the source register with the distance of the extracted shuffle delta.Finally, only the threads participating in an incomplete warp or assuming no source lane execute the original load under the predicate (%pred).When  < 0, the shfl instruction takes the .upoption and when  > 0, the .downoption is selected.If  = 0, just the mov instruction is inserted instead of all the synthesized code.In actual code, the calculation of %warp_id is shared among shuffles and set at the beginning of the execution to reduce the computational latency.
To preserve the original program characteristics, such as the register use, uniformity, and ILP, following ways of generation are avoided.We can produce the correct results even if shfl is predicated by %incomplete, but it often imperils the basic efficiency with an additional branch, which limits ILP.On the other hand, our code introduces only one predicate to each shuffle and does not leave any new branch in the resultant SASS code.Also, we do not use a select instruction for merging the results between shuffles and corner cases, because it would aggravate register pressure.The output predicate by shuffle poses execution dependency and provides the invalid status of inactive threads; thus, it is ignored.Moreover, we only create shuffles from direct global-memory loads and do not implement shuffles over shuffled elements for better ILP.

Experimental Methodology
We build PTXASW using Rosette [30], a symbolicevaluation system upon the Racket language.PTXASW is equipped with a PTX parser and runs the emulation of the parsed code while expressing runtime parameters as symbolic bitvectors provided by Rosette.Our shuffle synthesis is caused at code generation, which prints the assembler-readable code.We evaluate our shuffle mechanism with the NVHPC compiler [20] by hooking the assembler invocation and overwriting the PTX code before it is assembled.The NVHPC compiler accepts the directive-based programming models OpenACC and OpenMP to generate GPU code, which have no control over warp-level instructions.The emulation is also tested for GCC with OpenACC/OpenMP code and LLVM with OpenMP code, but these use a master-worker model to distribute computation across thread-blocks [15] and do not directly refer to the thread ID in each thread, so mainly ineffective results are obtained.Our synthesis is not limited to global-memory loads and works on shared memory (such as Halide [26]), but the performance is not improved due to the similar latency of shared-memory loads and shuffles.The NVHPC compiler utilizes the same style to translate both OpenACC and OpenMP codes written in C/C++/Fortran to PTX, hence supporting any combinations.For the evaluation, we use the KernelGen benchmark suite for OpenACC [18], shown in Table 2.Each benchmark applies the operator indicated in the benchmark name, to single or multiple arrays and updates different arrays.The benchmarks gameoflife, gaussblur, jacobi, matmul, matvec and whispering are two-dimensional, whereas others are three-dimensional, both having a parallel loop for each dimension, in which other loops might exist inside-except matvec, which features only one parallel loop.The thread-level parallelism is assigned to the innermost parallel loop and the thread-block level parallelism to the outermost.We show the total time of running the shuffle-synthesized kernel ten times on Kepler (NVIDIA Tesla K40c with Intel i7-5930K CPU), Maxwell (TITAN X with Intel i7-5930K), Pascal (Tesla P100 PCIE with Intel Xeon E5-2640 v3), and Volta (Tesla V100 SXM2 with IBM POWER9 8335-GTH).We use NVHPC compiler 22.3 with CUDA 11.6 at compilation, but due to environmental restrictions, run the programs using CUDA driver 11.4/11.4/10.0/10.2 for Kepler/Maxwell/Pascal/Volta, respectively.The compiler options in NVHPC are "-O3 -acc -ta=nvidia:cc(35|50|60|70),cuda11.6,loadcache:L1".To fully utilize computation, 2D benchmarks select 32768x32768 as their dynamic problem sizes and 3D compute 512x1024x1024 grids, except uxx1, which leverages 512x512x1024 datasets and whispering, where more buffers are allocated, computing over 8192x16384 data elements.To assess a performance breakdown, we prepare two other versions of PTXASW: NO LOAD and NO CORNER.The former eliminates loads that are covered by shuffles, whereas the latter only executes shuffles instead of original loads, without the support of corner cases.
The shuffle synthesis fails on four benchmarks.In matmul and matvec, the innermost sequential loop contains loads, but these do not have neighboring accesses along the dimension of the thread ID.The benchmarks sincos and vecadd do not have several loads sharing the same input array.

Evaluation
Figure 2 shows the speed-ups of benchmarks on each GPU with original code and PTXASW-generated code along with the NO LOAD and NO CORNER versions.The line plots provide the SM occupancy of each benchmark.Since there is no resource change other than the register use from the original execution, the occupancy rate is directly affected by the number of registers.The performance improvement on Kepler/Maxwell/Pascal/Volta is confirmed with 7/6/9/4 benchmarks showing up to 16.9%/132.3%/9.1%/14.7%performance improvement, respectively.We see performance degradation with Volta in the case where more than ten shuffles are generated.Other GPUs mostly gain better performance with such cases.With increased shuffle deltas, more corner cases are expected.Volta shows optimal efficiency when  ⩽ 1.5, while other GPUs benefit from the case of  = 2.5.For example, Maxwell attains the best performance with gaussblur ( = 2.5), although Volta's performance drops by half for the same case.The average improvement across all GPU generations is -3.3%/10.9%/1.8%/-15.2%for Kepler/Maxwell/Pascal/Volta, respectively.
Overall, the performance improvement by PTXASW is found when NO LOAD and NO CORNER have sufficiently better performance compared to the original and when the occupancy typically rises on Kepler/Maxwell and drops on Pascal/Volta.The average number of additional registers with NO LOAD/NO CORNER/PTXASW compared to the original is -6.4/-5.2/2.7 on Kepler, -6.6/-5.9/4.2 on Maxwell, -7.0/-5.9/3.8 on Pascal, and -6.4/6.8/9.2 on Volta.

Analysis
This section provides the performance detail of our shuffle synthesis on each GPU. Figure 3 shows the ratio of stall reasons sampled by the profiler for all the benchmarks.Those characteristics of computation appear as the results of the program modification (e.g.register use, shuffle delta) and the architecture difference (e.g.computational efficiency, cache latency).

Kepler
The Kepler GPU has long stalls on computational operations with each benchmark.The average execution dependency is 24.7% and pipeline busyness is 7.5% with the original.When we look at the memory-bound benchmarks such as gameoflife, gaussblur, and tricubic, NO LOAD significantly reduces the amounts of memory-related stalls.Especially, tricubic has 56.0 percentage points below memory throttles from the original to NO LOAD, yielding 2.53x performance.From NO LOAD to NO CORNER, the execution dependency increases by 4.0 percentage points and the pipeline busyness decreases by 1.6 percentage points on average.The performance degradation at NO CORNER with the memory-bound benchmarks is observed with the latency of the pipelines and the wait for the SM scheduler.PTXASW suffers from memory throttling and additional computation for the corner cases, which limit the improvement up to 16.9%.
The memory throttling and the additional computation bottlenecks suffered by PTXASW may be hidden if the shuffle operations reduce the original computation and communication into just one transfer among threads, functioning as a warp-level cache.Otherwise, there is a need to face a trade-off between the redundancy of operations and the efficiency on the architecture.On Kepler, both heavy computation and memory requests are imposed by the corner case.Therefore, in the general use of shuffles, the uniformity of calculation is crucial and it requires domain-specific knowledge.

Maxwell
There are two obvious compute-bound benchmarks: gameoflife and tricubic.For these, no improvement is perceived with NO LOAD, and there are no particular changes in occupancy or stalls throughout the four different versions.In summary, gameoflife experiences -0.1%/5.7%/6.2% lower performance and tricubic shows -1.6%/7.7%/15.4% lower performance with NO LOAD/NO CORNER/ PTXASW, respectively, compared to the original version.In other cases, memory dependency is dominant.However, the merit of NO LOAD is limited to gaussblur and lapgsrb, which experience large texture-memory latency of read-only cache loads, successfully replaced with shuffles by PTXASW.The texture stall was reduced from 47.5% to 5.3% in gaussblur and from 23.0% to 0.1% in lapgsrb from the original to PTXASW, attaining 132.2% and 36.9% higher throughput.Other benchmarks do not feature stalls that allow for clear performance improvement by NO LOAD.As it can be observed in Figure 3, the memory dependency stalls are maintained for most benchmarks, except for that of tricubic2, which shows 32.9 percentage points lower memory dependency and only 14.3% overall improvement with NO LOAD.Those values are mostly absorbed by the corner cases.
On the Maxwell GPU, only the texture stalls are improvable for efficiency in the tested cases.Since we observe a moderate overhead of the corner cases, our synthesis tool may enhance the overall performance.The memory-dependency stalls work as a good indicator of the memory utilization.If, in addition, a high execution dependency would exist, it would provide the warp-level shuffle optimization the opportunity to be beneficial to speed up the computation.

Pascal
Even more than in Maxwell, texture stalls are found in most benchmarks and those produce higher throughput with NO LOAD.Especially, gameoflife and tricubic, the compute-bound kernels on Maxwell, become memory intensive on Pascal and the performance increases by 5.9% and 5.4% with PTXASW.The unspecific latency ("Other") fills many parts of computation on Pascal.Further investigation shows that this mainly consists of the latency from register bank conflicts and the instructions after branching.With the optimization adding a predicate to check the activeness of the warp (@!incomplete) before the shuffle and generating a uniform branch, the ratio of this latency improves from 34.4% to 8.6% with PTXASW at gameoflife, obtaining 150.8% efficiency compared to the original.However, as mentioned in Section 5.2, it decreases the average relative execution time to 0.88x slowdown.Since the latency of the L1 cache is higher than that of one shuffle operation, the computation may be hidden by data transfers.Once the memory-dependency stall ratio For shuffle instructions to be beneficial, the execution should be less divergent and careful register allocation is recommended to maximize the thread utilization.

Volta
On Volta, most benchmarks become memory-bound and memory-intensive applications become sensitive to memory throttles.Nevertheless, the speed-up by NO LOAD is limited to up to 1.35x (gameoflife), due to the highly efficient cache mechanism.As argued in Section some of the benchmarks attain higher performance with NO CORNER than in the case of NO LOAD for the lower occupancy.Other than that, we observe performance degradation due to increased execution dependency for lapgsrb and tricubic with NO CORNER.Those further reduce the efficiency with PTXASW while featuring stalls for instruction fetching.Also, the memory dependency of tricubic develops a large latency for memory accesses with PTXASW even though the corner cases experience fewer loads.This leads to unstable speed-ups between 0.315x and 1.15x.The calculation through shuffles is expected to be effective depending on the utilization of communication, and the nonentity of warp divergence.Especially, as Volta shows minimal latency at each operation, the penalty of non-aligned computation becomes apparent and must be avoided by the algorithm.

Application Example
We also apply PTXASW for the compilation of CUDA benchmarks extracted from applications.We select three benchmarks that appeared as complex 3D stencil operations in [27]: hypterm, rhs4th3fort, and derivative, to run on the Pascal GPU.hypterm is a routine from a compressible Navier-Stokes mini-app [1].rhs4th3fort and derivative are stencils from geodynamics seismic wave SW4 application code [2].Each thread in the benchmarks accesses 152/179/166 elements over 13/7/10 arrays, respectively.We modify the execution parameters to execute at least 32 threads along the leading thread-block dimension and use the float data type.Since we saw in the prior section the overhead of long-distance shuffles, which generate many corner cases, we limited the shuffle synthesis to be | | ⩽ 1 and found shuffles only with | | = 1.
hypterms contains three kernels that work along different dimensions.In the kernel for the leading dimension, 12 shuffles are generated over 48 loads, producing 0.48% improvement.rhs4th3fort and derivative feature a single kernel each.rhs4th3fort experiences 2.49% higher throughput by PTXASW while placing 44 shuffles among 179 loads.For derivative, having 52 shuffles from 166 loads, PTXASW attains 3.79% speed-up compared to the original execution.

Related Work
Ever since warp-shuffle instructions were introduced during the Kepler generation of GPUs, these have been the subject of various lines of research.Early work described their manual use for specific computational patterns such as reduction operations [17] and matrix transposition [6].Other research described the use of warp-shuffle instructions in the context of domain-specific optimizations such as employing them as a register cache for stencil operations [5], or to replace memory access for Finite Binary Field applications [5].
Research on the automatic generation of warp-shuffle instructions has been explored.Swizzle Inventor [25] helps programmers implement swizzle optimizations that map a high-level "program sketch" to low-level resources such as shuffle operations.The authors meticulously design the abstraction of shuffles, synthesize actual code roughly based on algorithms found in previous literature, and attain enhanced performance while reducing the amounts of computation.Tangram, a high-level kernel synthesis framework, has also shown the ability to automatically generate warp-level primitives [13].Unlike the work presented in this paper, both of the above-mentioned efforts leverage domain-specific information to map computational patterns such as stencil, matrix transposition, and reductions to shuffle operations.
Recent code-generation techniques allow for obtaining optimal SIMD code generation.Cowan et al. [8] generate program sketches for execution on ARM processors, by synthesizing additional instructions, as well as input/output registers, to implement the shortest possible SIMD code of reduction.Unlike PTXASW, which uses an SMT solver to find the optimal shuffle deltas, this work runs a comprehensive search of multiple possible code versions; thus, the search space is exponential to the number of instructions.VanHattum et al. [31] attain faster execution on digital signal processors while employing equality saturation [28], a modern way of optimization that generates possible code as much as possible from a basic program according to the rules of term rewriting.They derive shuffles along with vector I/O and computation from sequential C code.Their intermediate code contains instructions in one nested expression and the shuffle operation only works for memory loads that appear as arguments of the same vector operation.Therefore, the code rewriting for shuffles assumes a top-down style where outer expressions have to be vectorized first, in order to vectorize inner expressions containing shuffled loads.While their technique may provide a powerful method to the implementation of libraries, irregular patterns such as corner cases found in HPC applications are out of scope.

Conclusion
This paper introduces symbolic emulation to compiling GPU code in order to discover hidden opportunities for optimization.We employ several languages, enabling OpenACC directives such as in C and Fortran, for the frontend to generate GPU assembly code.Then, our tool emulates the code upon symbols that substitute dynamic information.While pruning control flows to reduce the emulation time, we automatically find possible warp-level shuffles that may be synthesized to assembly code to bypass global-memory accesses.We apply this technique to a benchmark suite and complex application code showing results that improve multiple benchmarks on several generations of GPUs.We also provide the latency analysis across multiple GPUs to identify the use case of shuffles.

e r g e n c e g a m e o f l i f e g a u s s b l u r g r a d i e n t j a c o b i l a p g s r b l a p l a c i a n t r i c u b i c t r i c u b i c 2 Figure 2 .
Figure 2. Speed-up compared to Original.NO LOAD/NO CORNER produce invalid results

Figure 3 .
Figure 3. Stall breakdown in the order of Original/NO LOAD/NO CORNER/PTXASW from left to right for each benchmark increases due to replacing the texture stalls, Pascal may maintain the efficiency with the corner cases, resulting in speed-up in nine benchmarks.For shuffle instructions to be beneficial, the execution should be less divergent and careful register allocation is recommended to maximize the thread utilization.

Table
of PTXASW memory as a communication mechanism when data movement is not redundantly performed, so storing and synchronization are avoidable.In particular, latencies of L1 cache on Maxwell/Pascal are higher compared to Kepler/Volta, which integrate shared memory with L1 cache.Those allow the shuffle to be utilized as a register cache for performance improvement, but the engineering efforts in order to modify the fundamental parts of parallel computation are considerably high.

Table 2 .
The KernelGen benchmark suite.Lang indicates the programming language used (C or Fortran).Shuffle/Load shows the number of shuffles generated among the total number of global-memory loads.Delta is the average shuffle delta.Analysis is the execution time of PTXASW on Intel Core i7-5930K