APPy: Annotated Parallelism for Python on GPUs

GPUs are increasingly being used used to speed up Python applications in the scientific computing and machine learning domains. Currently, the two common approaches to leveraging GPU acceleration in Python are 1) create a custom native GPU kernel, and import it as a function that can be called from Python; 2) use libraries such as CuPy, which provides pre-defined GPU-implementation-backed tensor operators. The first approach is very flexible but requires tremendous manual effort to create a correct and high performance GPU kernel. While the second approach dramatically improves productivity, it is limited in its generality, as many applications cannot be expressed purely using CuPy’s pre-defined tensor operators. Additionally, redundant memory access can often occur between adjacent tensor operators due to the materialization of intermediate results. In this work, we present APPy (Annotated Parallelism for Python), which enables users to parallelize generic Python loops and tensor expressions for execution on GPUs by adding simple compiler directives (annotations) to Python code. Empirical evaluation on 20 scientific computing kernels from the literature on a server with an AMD Ryzen 7 5800X 8-Core CPU and an NVIDIA RTX 3090 GPU demonstrates that with simple pragmas APPy is able to generate more efficient GPU code and achieves significant geometric mean speedup relative to CuPy (30× on average), and to three state-of-the-art Python compilers, Numba (8.3× on average), DaCe-GPU (3.1× on average) and JAX-GPU (18.8× on average).


Introduction
Python has gained increasing popularity due to its concise and exible language features and a rich ecosystem for various application domains, such as computational science, data science and machine learning.Programs in such domains naturally exhibit abundant data parallelism, making them amenable for GPU acceleration.Two common approaches to accelerate Python programs on the GPU are 1) create a custom native GPU kernel, and import it as a function that can be called from Python; 2) use libraries such as CuPy which provides prede ned GPU-implementation-backed tensor operators.The rst approach is very exible, but programming the GPU is notoriously di cult [4,14,17], due to the inherent di culty of parallel programming and the complex hierarchy of modern GPU's hardware parallelism and memory systems.It generally requires signi cant expertise and manual e ort to create a correct high-performance GPU kernel.While the library operator-based approach dramatically simpli es programming and improves productivity, it is limited in its generality, as many applications cannot be expressed using only the pre-de ned tensor operators (Listing 1 shows one such example).Additionally, data locality remains a major source of ine ciency for a sequence of operators, where the intermediate results will need to be stored back to the main memory and loaded again by the next operator, which can slow down the overall execution.There exist tensor operator compilers that are able to compile a sequence of operators and generate fused GPU kernels [5,15,18], however, these tools/techniques are often speci c to machine learning, and are not applicable to general Python programs with combinations of explicit loops and generic tensor operators.
In this work, we present APPy: Annotated Parallelism for Python, a Python-embedded parallel programming model and JIT compiler that allows users to express their parallel program using loops, or tensor expressions or a mix of both, with annotated directives.Compared to existing GPU programming models, APPy has two distinct advantages: • In APPy, the user writes sequential loops targeting at an abstract shared-memory multi-vector processor machine that exposes two layers of parallelism: synchronized vector processing (vectorization), and asynchronous multi-threading (parallelization).Users can 1 def group_by_sum(X, labels, centroids, M, N): Listing 1.A kernel from the K-Means algorithm that groups rows from a matrix according to the label of each row.Such a kernel involves indirect memory access, and the same pattern cannot be expressed solely with tensor operators.
1 @appy.jit(auto_simd=True) 2 def group_by_sum(X, labels, centroids, M, N): Listing 2. Parallelized version of Listing 1.This implementation utilizes both loops (line 4-7) and tensor expressions (line 7).Besides parallelizing the for loop, the compiler also recognizes line 7 as a tensor expression, and will generate a loop automatically for it with auto-vectorization.Directive #pragma atomic indicates parallel reduction.
parallelize and vectorize a loop using OpenMP-like directives, parallel for and simd, respectively.• In addition to loops, users can also write tensor/array expressions purely or combined with loops, and the compiler will automatically convert these array expressions into optimized loops.
Programming for the abstract machine model simpli es programming the GPU (the complex hierarchy of parallelism in modern GPUs is hidden), and makes the program portable for execution on any current or future hardware that ts with the abstract model.In addition, compiling both loops and tensor expressions makes APPy general enough to cover a diverse spectrum of scienti c programs and maintain high productivity at the same time.Listing 2 shows the APPy parallelized version of kernel group_by_sum, shown in Listing 1.
In summary, we present the design of a loop-oriented programming model targeting an abstract multi-vector processor machine (section 3), as well as a tensor-oriented programming model (section 4) with array expressions.We then present an implementation of APPy and its code generation algorithms (section 5).Finally, we evaluate the performance of APPy on 20 scienti c kernels (section 6), and show that by adding simple pragmas users can get signi cant performance improvement with APPy for scienti c Python programs on the GPU, compared to other state-of-the-art programming tools and compilers.

Abstract Machine Model
In APPy, parallelism is speci ed against an abstract machine: a multi-vector processor.Such a system consists of one or more vector architecture processors [10] where each processor is able to process either a scalar or a vector of data in vector registers in a SIMD style, as illustrated in Fig. 1.The maximum vector length is represented by a built-in variable appy.MVL, and all vector operations must be of equal or smaller length.One can apply lane-wise operations to a vector register, such as basic mathematical unary and binary operations, and cross-lane operations such as reductions.It also provides an appy.where(condition,x, y) instruction to select values from two vectors based on a condition.Besides the SIMD-level parallelism, the system consists of multiple such vector processors and is capable of executing multiple vector instruction streams concurrently in a MIMD style.The processors may communicate via shared memory, e.g.updating a memory location atomically.In the paper, we refer to the SIMD level parallelism as vectorization and the MIMD level parallelism as parallelization.Maximum parallelism is achieved when both parallelization and vectorization are utilized.
The programmer only needs to think about this abstract machine model when programming with APPy, and the compiler automatically compiles user code to actual GPU code, and handles the underlying complexity.We note that when programming such abstract machine, one can use regular control ow structures and array indexings and slicings, just as in regular Python programming.The only thing that's speci c to the abstract machine is about parallelism, e.g.specifying a loop as parallel, and processing data in a vector architecture style.The mathematical functions and operators applied on the data can be viewed as native instructions provided by the machine.Section 3 and 4 present the programming interface of the loop-oriented model and the tensor-oriented model respectively.
3 Loop-Oriented Programming Interface

Parallelization
A loop can be parallelized by being annotated with #pragma parallel for, where the end of the loop acts as a synchronization point.A vector addition example is shown in Listing 3.Each loop iteration is said to be assigned to a worker, and the number of workers launched is always equal to the number of loop iterations, unless directive #pragma sequential for is used, which launches only one worker that executes all iterations sequentially, e.g.due to loop-carried dependences.Each worker is scheduled to a single abstract processor, and executes its instructions sequentially.A parallel for-loop must be a for-range loop, and the number of loop iterations must be known at kernel launch time, i.e. no dynamic parallelism.
Tensors within the parallel region must already be a GPU tensor (data reside in the GPU memory), e.g.created using cupy/torch/jax with data on the device.Such libraries also provide APIs to create a GPU tensor from a NumPy [9] array.
1 @appy.jit 2 def vector_add(A, B, C, N): Listing 3. Parallelize a for loop with APPy via #pragma parallel for."#pragma ..." is a regular comment in Python, but will be parsed and treated as a directive by APPy.

Vectorization
Although #pragma parallel for parallelizes a loop, maximum parallelism is achieved when the loop body is also vectorized, when applicable.APPy provides three ways to achieve vectorization: 1) use tensor/array expressions (compiler generates a loop automatically, with more details in section 4); 2) annotate a loop with the #pragma simd, which divides the loop into smaller chunks, and assigns each chunk to a worker; 3) manually strip-mine a loop and operate on vectors of size appy.MVL or smaller.The last approach is considered "assembly-level programming" of the abstract machine, while the rst two approaches rely on the compiler to generate "assembly code" (strip-mined loops).To handle arbitrary sized input, APPy provides a convenient built-in function appy.vidx(i,step, bound=N) to create a vector of indices that start from i and are bounded by N, with maximum length step.
Listing 4 shows the APPy implementations of vector addition and sparse matrix vector multiplication (SpMV) using parallel for and simd directive combined to achieve maximum parallelism.As with the parallel for directive, the simd directive relies on the programmer to check for dependences and guarantee correctness.
A loop that is not applicable for parallelization may be vectorizable.One example is the j loop in the SpMV example, where it has dynamic loop bounds.

Data Sharing
APPy does not require the programmer to manually specify whether each variable is private or shared.Instead, it enforces syntactical di erence between array and non-array variables, and use simple rules to infer the scope of the variables.Array variables are always followed by a square bracket, such as A[vi], and the rest are non-array variables 1 .Array variables inside the parallel region are always considered shared (and their data reside in the global memory of the GPU).Non-array variables de ned within the parallel region are considered private to each worker.Read-Only non-array variables are shared.APPy prohibits multiple reaching denitions from both inside and outside the parallel region of a non-array variable, which prevents writing into a shared non-array variable.To achieve such e ects, the idiom is make the variable an array of size 1 (thus it has a global scope).An example is parallel reduction into a scalar, as shown in Listing 5.

Synchronization
The only synchronization across workers (loop iterations) supported is atomically updating a memory location.This can be achieved by either annotating a statement with compiler directive #pragma atomic or by using "assembly-level" programming of the abstract machine via built-in instruction appy.atomic_<op>.Note that within a worker, no synchronization is necessary even if it works on multiple elements.1 @appy.jit 2 def vector_sum(A, s, N): Listing 5. Parallel reduction idiom in APPy.The reduction variable is an array of size 1.Directive simd divides the loop into smaller chunks.Each worker gets one chunk, and performs local reduction (SIMD capability is utlized).Finally each worker performs an atomic update to the nal sum.

Tensor-Oriented Programming Interface
In addition to loops, APPy also allows users to use tensor/array expressions and the tensors can have arbitrary size.This often results in more natural and succinct program compared to the loop oriented version.Listing 6 shows a generic implementation of the softmax activation function [8], a fundamental building block of machine learning, using loops only and loops combined with tensor expressions respectively.As can be seen, to perform a max or a sum reduction on a row, the generic model must create a loop that works on a limited amount of elements at a time.In contrast, the tensor-oriented model allows the user to directly work with tensors of arbitrary size and dimensions, where the compiler automatically converts the tensor expressions into explicit loops.To disambiguate, we use the term "tensor/array assignment" to refer to operations performed on tensors of arbitrary size (as a high-level programming notation), and use the term " vector statement" to refer to operations that can be directly executed by the abstract machine model, i.e. the size of the vector is up to maximum vector length of the machine.
The tensor-oriented model works by programming in and annotating tensor expressions.However the tensor expression must be expressed in sliced index notation before it can be annotated.

Sliced Index Notation
We introduce sliced index notation as a form of tensor expression that simpli es compiler transformations while maintaining legitimacy as Python/NumPy syntax (no domain-speci c language is used).In sliced index notation, each dimension of every tensor must be expressed as a slice with at least the upper bound explicitly speci ed.This notation draws inspiration from the Einstein summation convention [23], extended to support operations beyond multiplication and addition.Here are some examples: • Matrix addition: Listing 6.Two equivalent implementations of generic softmax using loops only and loops combined with tensor expressions respectively.The APPy compiler automatically converts tensor expressions into explicit loops with vectorization.
The sliced index notation requires the user to de ne a variable (referred to as a dimension variable) for each unique dimension within the expression, e.g.M and N. The expression is then rewritten to utilize slicings, such as ":M" to represent dimensions in every tensor.A scalar dimension can be denoted by a scalar index.It's crucial to note that distinct dimensions must be represented using di erent variables, even if their runtime values are identical, as in the case of a square matrix.Furthermore, tensor assignments can reference different slices of the same dimension, provided that the lengths of the slices are consistent, e.g. in stencil computations.
The tensor-oriented model supports only three classes of tensor operations: element-wise (both unary and binary), broadcast, and reduction operations (all examples above belong to these categories).Operations that don't belong to these categories, such as sorting, squeezing (removing zeros and compacting), matrix inversion etc are not supported.

Tensor-Oriented Pragmas
A tensor expression in sliced index notation can be annotated.The annotation process goes by enumerating every distinct dimension slice in the expression, and optionally specifying a list of properties for it 2 .The syntax is {slice}=>{properties}, and the following properties are currently supported (multiple properties are comma-separated): • parallel: the dimension can be parallelized, i.e. translated to a parallel for loop.The default value is False.• simd: the dimension can be executed in a SIMD fashion.
The default value is False.• reduction/reduce: the dimension is reduced.The default value is False.• le(constant): indicate the dimension is less or equal than a small constant, which may enable more optimizations, such as caching a small tensor in registers.
The tensor oriented model also comes with a new compiler option auto_simd, which automatically adds a simd property to the last dimension.This transformation is always legal due to the vectorized semantics of tensor expressions.Listing 7 shows example pragmas used for gesummv and jacobi_2d from polybench.Note that the order in which the slices appear in the pragma (from left to right) determines the loop order of the generated nested loop.For instance, annotation 1:M-1=>parallel 1:N-1=>parallel would imply that 1:M-1 corresponds to the outer loop and 1:N-1 corresponds to the inner loop.
1 @appy.jit(auto_simd=True) 2 def gesummv(alpha, beta, A, B, x, y, tmp, M, N): Listing 7. Annotated version of gesummv and jacobi_2d, a linear algebra kernel and a stencil kernel from polybench.Within each tensor assignment statement, operator fusion will be performed automatically during code generation to improve locality.

Implementation Overview
The APPy compiler rst performs a sequence of high-level analysis and transformations that rewrite the input program to "assembly-level" APPy code targeting the abstract multivector processor machine, where the program loads and processes data in strip-mined loops and "vector registers".Then it compiles this "assembly code" to actual GPU code that can run on the GPU.Speci cally, APPy generates a Triton kernel [24] and its corresponding host code for each top-level parallel for-loop.The overall compilation ow is described in Alg. 1.For simplicity, we use code examples below to highlight the key points related to these transformations.
Algorithm 1 Top-level compilation process.
is the function object to be compiled.A new function is returned.append to , dynamically load module and return the new function from .

High-Level APPy To "Assembly" APPy Transformation
In the interest of brevity, we highlight a few key transformations in this section using code examples.

Creating Temporary Arrays.
In a tensor assignment, if the tensor to be stored also appears on the right hand side with non-identical slices, APPy creates a new temporary array and splits the original statement into two assignments, where the computed values are rst stored in a temporary array ( rst assignment) and then stored in the nal destination (second assignment) to ensure the correctness when generating a loop from the tensor expression 3 .Listing 2b shows the result of the transformation from 2a.

Inserting Synchronizations.
APPy maps each vector statement to a thread block for exibility and performance consideration.However, a thread block does not naturally have a vectorized execution semantic and may contain threads/warps that execute asynchronously.Therefore, APPy inserts synchronization statement after certain memory operations to ensure the sequential execution of the vector statements.
When inserting synchronizations, APPy performs an optimization speci c to its programming model: synchronizations  (b) Rewrite and split the array assignment if in the original assignment, the tensor to be stored also appears on the right hand side with a di erent slice.This ensures correct code generation when the array assignment is converted to a loop.within loops generated from element-wise and broadcast tensor assignments are unnecessary.Therefore, at code generation time, synchronizations are inserted before tensor assignments are converted to loops.Listing 2c shows the code state after inserting synchronizations.

Lowering Tensor Assignments To Loops.
A tensor assignment statement is converted to a loop nest by the compiler.An index variable and a loop is created for each unique dimension, and the slicings in the original statement are replaced by these new index variables.The depth of the loop nest equals the number of unique dimensions in the tensor assignment, with the original statement placed in the innermost loop.Listing 2d shows the transformation from 2c.
5.1.4Reduction-Bounded Operator Fusion.When converting a tensor expression into loops, the compiler automatically performs what we call reduction-bounded operator fusion.This type of fusion fuses a sequence of element-wise operators and stops when a reduction operator is encountered, which creates a fusion boundary.The reduction operator itself will be the last operator in this fusible group.Listing 8 shows the compiler-generated code for a generic matrix vector multiplication with operator fusion.
1 ## Before transformation 2 @appy.jit(auto_simd=True) 3 def kernel(alpha, A, x): Reduction-Bounded fusion is illustrated in Fig. 3, where each circle represents a data element, and each arrow signies an operation as well as a data dependence.A horizontal arrow represents an element-wise operation, while a gather arrow represents a reduction operation, and a scatter arrow represents a broadcast operation.The same operation is applied vertically to di erent elements.op1 and op4 are element-wise operations, while op2 and op3 are reduction and broadcast operations, respectively.Di erent colors represent the worker assignments of the data elements.
In this gure, op1 and op2 can be safely fused for each worker, as can op3 and op4.Fusing op1 and op2 means that the same data element is immediately being applied op2 after op1, without waiting for all data elements to have undergone op1.However, fusing op3 together with op1 and op2 would not be legal since op3 requires op2 to be completed on all data elements rst. Figure 3 also implies that when a worker is mapped to asynchronously executing hardware threads, the threads can work on their data elements concurrently, and only need to synchronize when a reduction operator is encountered.

Triton Background
After all the passes in the high-level transformation phase are performed, the code is now at "assembly" level and is ready to be lowered to the backend code.In our implementation we choose to generate Triton code instead of CUDA or OpenCL code because its programming model matches well with our abstract machine model, and simpli es our implementation.Nonetheless it's totally possible to generate CUDA or OpenCL from APPy.
First, we provide a brief introduction to the Triton language and compiler before getting into generating Triton code.Triton [24] is a GPU programming language and compiler that aims to make GPU programming more productive than CUDA.Based on Python, Triton provides a singleprogram-multiple-data (SPMD) programming interface, where the user speci es the behavior for a thread block as a whole, instead of individual threads, and the Triton compiler automatically distributes work across threads and generates SIMT-style code from it.This paradigm simpli es threadblock scope collective operations such as reduction, where the programmer just makes a high-level API call.Listing 9 illustrates a Triton kernel that performs a dot product of two vectors.The function kernel is the launch function that speci es the block size and grid size, and invokes _kernel (lines [15][16].Function _kernel is the Triton GPU kernel decorated with triton.jit,which the Triton compiler will JIT-compile to native code on the target machine, e.g.PTX code for NVIDIA hardware.Within the kernel, tl.load loads a block of data from global memory to registers that will be processed by the thread block.The block size BN is userdetermined.Line 10 performs a local dot product (data are in registers) to get the partial results, and line 11 performs a global reduction via tl.atomic_add.It's worth noting that Triton (at present) still requires the programmer to insert thread synchronizations 4 if two statements might have crossthread data dependencies.

"Assembly" APPy to Triton Code Generation
When generating the backend code, for each top-level parallel for-loop, a Triton kernel function is generated, together with its corresponding host code.The parallel for-loop can be a nested parallel loop, in which case a multi-dimensional thread block will be launched.The host code part includes creating an N-dimensional grid, where "N" is the nesting level of #pragma parallel for, and launching the kernel by passing the actual arguments, including the block size (appy.MVL), which is automatically determined by APPy given a hardware platform and a kernel.For our implementation and evaluation on an NVIDIA GPU, APPy's backend code includes some auto-tuning code that automatically selects the best appy.MVL value (that lead to the best runtime) from 4 possible options, 128, 256, 512 and 1024, for all applicable kernels.The best block size is cached for later invocations of the kernel.To generate a device function, an empty function decorated with @ . is rst created.The code generator rst inserts the _ statements into the function body, and then it traverses the loop body of the loop being compiled, converts the statements into corresponding Triton statements, and insert these Triton statements into the kernel function body.This includes converting array indexings into Triton load and store statements etc.

Performance Evaluation
We compare APPy's performance with ve other popular frameworks: NumPy (CPU baseline), Numba (CPU compiler) [13], CuPy (GPU baseline) [16], JAX-GPU (GPU compiler) [6] and DaCe-GPU (GPU compiler) [26].We evaluate the performance of APPy using 20 kernels from the NPBench benchmarking framework [25].Speci cally, our evaluation encompasses all benchmarks in NPBench that are under 40 lines of code, have parallelism present, do not use any data types/functions unsupported by APPy, and pass the validation test, which resulted in a set of 20 kernels.
NPBench comprises a set of NumPy code samples representing a wide variety of HPC applications and includes benchmarking tools to compare the performance of di erent NumPy-accelerating compilers (e.g.Numba etc).For each application, NPBench has a baseline NumPy implementation written in idiomatic Python/NumPy, which may use explicit loops, tensor operators, or both.Additionally, it includes implementations derived from the NumPy version using various compiler frameworks.As of now, it already includes compiler frameworks such as Numba and DaCe [26] (with both CPU and GPU backends), and we added APPy and JAX as additional frameworks.

Code Adaptation
Table 1 uses source lines of code (SLOC) as a metric to measure how much code adaptation is required for each framework.The SLOC count is obtained using the pygount tool.The code adaptation process for each framework is then summarized in the following sections.
6.1.1Code Adaptation for APPy.The APPy versions of these kernels are derived from the baseline NumPy version with three modi cations in most cases: 1) the tensor expressions are updated to use sliced index notation (e.g.

A[:] + B[:] becomes A[:N] + B[:N])
; 2) each parallel loop and tensor expression is annotated with pragmas, and the le clause (small tensor optimizations) is used when applicable; 3) the kernel function is decorated with @appy.jit.We do not annotate matmul operators used in the kernels and delegate them to library implementations.
For three kernels (softmax, spmv and azimint_naive), we also rewrite some tensor operators using loops due to limitations of our tensor-oriented model (e.g.multiple reduction operators will not be fused with our current operator fusion approach but a loop-based implementation would fuse them).Additionally, reduction sub-expressions are extracted out to form separate statements due to current implementation limitations, which caused a slight increase in additional lines of code for benchmarks such as gesummv and cholesky in table 1.

Code Adaptation for JAX.
Due to the language design constraints of JAX, creating the JAX versions of the kernels requires more signi cant changes to the original NumPy versions compared to other frameworks.These changes arise from three major JAX restrictions: 1) in-place array updates must use the at idiom; 2) control ows, such as for loops, must be rewritten to structured forms using jax.lax.fori_loop.Otherwise, the compilation would take too long (e.g., hours for large loops) because JAX unrolls Python loops by default; 3) array slices with dynamic sizes are not supported.When creating the JAX versions of the kernels, we attempt to convert as many Python loops as possible to JAX structured loops and rewrite array slices with dynamic sizes using where if the slice size depends on the loop index variable.

Code Adaptation for Other Frameworks.
Code adaptation for CuPy is the most straightforward, as it serves as an almost seamless drop-in replacement for NumPy.The only necessary change is to replace import numpy as np with import cupy as np.For Numba, two changes are required: 1) the kernel function should be decorated with @numba.jit;2) if a loop can be parallelized, modify it to use numba.prangeinstead of range.Similar to Numba, DaCe kernels are also decorated with @dace.programfor JIT compilation.Additionally, DaCe mandates that function arguments be type-annotated, including their shapes, which must be either integer constants or symbols, e.g.dace.float64[N,M].These shape variables are de ned similarly to APPy dimension variables.Notably, DaCe will automatically discover for loops that can be parallelized.

Performance Results
We evaluate the performance using the NPBench's benchmarking functionality, with the same input sizes as in the NPBench paper [25], such that each benchmark is run for approximately 1 second.By default, NPBench runs each benchmark 10 times and reports the median runtime 5 .The test machine used for results presented in this subsection is a Ryzen 7 5800X 8-Core CPU and a RTX 3090 GPU 6 .The operating system is Ubuntu 18.04.6LTS and the CUDA version is 12.1.We use CPython version 3.9.16 as part of an Anaconda 3 environment.The NumPy version is 1.26.2,Numba version is 0.58.1,CuPy version is 12.2.0,JAX version is 0.4.23,DaCe version is 0.14.4 and Triton version is 2.1.0(installed via torch 2.1.2).
Fig. 4 shows the performance results.The NumPy column on the right shows the absolute execution time, while the columns for the other frameworks show their speedup (up arrow) or slowdown (down arrow) relative to NumPy.On average (using geometric means), APPy achieves 3.1× speedup over DaCe-GPU (up to 11×), 18.8× speedup over JAX (with Table 1.Measure the amount of code adaptation using source lines of code (SLOC) as a metric.The column labeled "Contains Loops" indicates whether the benchmark contains explicit loops in its NumPy reference implementation, and "(P)" denotes that at least one loop is parallelizable.The column "Contains Tensor Expressions" indicates whether the NumPy reference implementation employs any (non-scalar) tensor expressions.The "NumPy" column displays the number of lines of code for the benchmark, while all other columns indicate their additional SLOC relative to the NumPy version.Note that pragma lines are included in the SLOC count for APPy.

6.2.1
Comparison with DaCe-GPU.Upon inspecting the generated CUDA code from DaCe, we identi ed two potential ine ciencies: 1) a sequential loop is generated when writing to an array slice, for instance, C[i,:i+1] = ...; 2) parallel reduction is not supported, which is utilized in the kernels go_fast and azimint_naive.APPy, with appropriate annotations, e ectively parallelizes both cases.For a subset of benchmarks exclusively using operators, such as gesummv, floyd_warshall, and gemver, DaCe is also slower than APPy.This di erence may be due to the fact that APPy is more e ective in fusing these operator patterns.Moreover, in benchmarks like softmax and cholesky, their APPy implementations bene t from the le pragma, leveraging prior knowledge of small dimensions to enhance data locality.On the other hand, DaCe and APPy exhibit comparable performance for stencil kernels, including jacobi_2d, heat_3d, and hdiff.

6.2.2
Comparison with JAX-GPU.JAX is signi cantly slower compared to APPy or DaCe for kernels written with explicit loops that can be parallelized, such as covariance, npgofast, spmv, symm, and syrk, among others.In JAX, a fori_loop is compiled but not parallelized, even if it is parallelizable.For a number of benchmarks written solely using operators, JAX is even slower than CuPy, despite the fact that it compiles and fuses operators.This could be due to sub-optimal code generation of the fused operators in JAX.
In the case of spmv, JAX is > 100× slower than the NumPy baseline because the original NumPy reference implementation uses arrays of dynamic shapes (depending on the row), which has to be rewritten to an implementation using where operators in JAX since dynamic shapes are not supported by JAX.Further, the where based implementation has to load the entire data and column arrays in every iteration and then select only a few elements.This approach is highly ine cient and fails to leverage the advantages of sparse representations.However, JAX demonstrates signi cant performance improvement over all other frameworks for azimnaiv.This could be attributed to the fact that azimnaiv presents additional optimization opportunities when the outer loop is unrolled 7 , for instance, allowing for the saving of re-computation of a sub-expression.

Comparison with CuPy.
The CuPy versions of the kernels often still require explicit loops, which executes sequentially in the Python interpreter, while with APPy such loops are parallelized when possible.One might wonder why CuPy is even slower than NumPy, which has identical code but uses CPU-based implementations instead of GPU-based 7 JAX unrolls Python loops by default.
implementations.This is because each loop iteration typically processes relatively small sized data in our input setting.Due to the small data size, processing them on the CPU isn't necessarily slower than on the GPU.When the program can be entirely expressed using tensor operators, such as in case of gemm, gemver, gesummv, and softmax, CuPy typically achieves signi cant speedup over NumPy.

6.2.4
Comparison with NumPy and Numba.GPUs are known to be more e cient than CPUs when it comes to data parallelism.NumPy runs the code sequentially in the Python interpreter, although each operator is backed by an optimized native code implementation.Still it's limited by the sequential execution and the interpreter overhead.Compared to NumPy, Numba applies Python-to-native-code compilation, automatic optimizations and parallelization (with prange), but it parallelizes the code only on the CPU.In contrast, APPy compiles the kernel to native GPU execution.However, we note that for trisolv, APPy is signi cantly slower than NumPy (3.1× slower) and Numba.In fact, all the frameworks employ the same parallelization strategy for trisolv, i.e. the outer loop is sequentialized, and the loop body contains a dot product that exhibits parallelism and gets parallelized.In case of APPy, the outer loop is annotated with #pragma sequential for to ensure that only one worker thread is launched.This limited parallelism within one worker thread could be one of the reasons why APPy is 3.1× slower than NumPy baseline for trisolv.

Additional Performance Results
For completeness, we performed an additional performance evaluation on a di erent machine consisting of an AMD EPYC 7502 32-Core CPU and an NVIDIA A100-PCIE-40GB GPU.The package versions and benchmarking methodologies are identical with the previous evaluation reported in Section 6.2.DaCe-GPU is excluded from this additional evaluation due to compilation errors on this platform.There is also one data point missing for Numba due to a runtime error that was encountered in that case.
Fig. 5 shows the performance results.In general, the relative speedup f the GPU frameworks over the CPU frameworks roughly doubles compared to the results in Fig. 4. Notably, JAX has improved performance over APPy for floydwar, fdtd_2d and heat3d.This could be due to the fact that JAX's code generation is better optimized for the A100 hardware, compared to RTX 3090.However, the general performance trends are still similar to that of Fig.  7 Related Work 7.1 SIMT/SPMD-Based GPU Programming General purpose GPU programming can be classi ed into 1) explicit parallel programming and 2) compiler directivebased programming.The former traditionally employs an extended version of C/C++ language that enables expressing parallelism in the style of single program multiple data (SPMD), and two prominent programming languages are CUDA [14] for NVIDIA hardware and OpenCL [22] which is more portable and widely supported among vendors.Combined with hardware-managed SIMD execution, this programming paradigm is referred to as single instruction multiple thread (SIMT) [14], where the same piece of user code is executed by multiple threads in parallel and threads within a thread block can communicate via on-chip shared memory.SPMD combined with hardware SIMD enables highly exible programming interface and high performance at the same time, but this low-level GPU programming is inherently challenging, due to the complexity of managing thread synchronization and memory hierarchy optimization.Numba [13] supports GPU programming by directly exposing the parallel execution model of the hardware in a way similar to CUDA-C and OpenCL, which still requires explicit parallel programming, although embedded in Python.Triton [24] language and compiler is a more recent e ort in the realm of general purpose GPU programming.Its goal is to provide a simpli ed programming interface compared to CUDA or OpenCL, by letting the compiler automatically manages intra-block thread scheduling, memory coalescing and shared memory management.However, Triton still requires the programmer to manually partition the computation onto di erent program instances, selecting block sizes and launching the GPU kernels.In addition, optimizations such as operator fusion are still manual.
None of these approaches parallelize a sequential Python loop directly and support array/tensor operations as APPy does.

Directive-Based GPU Programming
Two popular directive-based programming models for GPUs are OpenMP [3] and OpenACC [21], which are based on C/C++/Fortran.OpenMP originally only targets at multicore CPUs, but has introduced new pragmas to target at GPUs since OpenMP 4 while OpenACC was originally designed for GPUs but has support for CPUs as well.They simplify the programming e ort compared to CUDA and OpenCL by allowing annotating sequential loops with pragmas and the compiler generates the nal parallel code.However, they still directly expose GPU's complex hierarchy of parallelism, and as a result, one has to grasp the GPU hardware execution model and use a complex set of directives to achieve maximum parallelism.[19] proposes a uni ed SIMD primitive coupled with the Kokkos ecosystem that allows intrinsicsbased vectorization on CPUs and GPUs.A feature called logical vector length (LVL) is also proposed to write code without considering underlying physical vector length.This is similar to the maximal vector length (MVL) in APPy.But the SIMD primitive in [19] is still built upon Kokkos' existing complex parallel programming concepts and requires the programmer to manage teams, threads etc.In contrast, APPy's directives are much higher level and let the programmer write just regular sequential code.

Library-Based GPU Acceleration
CuPy [16] is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python.Its interface is highly compatible with NumPy and SciPy and in most cases it can be used as a drop-in replacement.In addition to providing NumPy-compatible operators, CuPy also provides simple interface to quickly write custom element-wise and reduction CUDA kernels.Custom generic CUDA kernels can also be imported using CuPy's RawKernel module, which compiles and wraps CUDA C++ code to a Python function.PyTorch [18], TensorFlow [1] and JAX [6] are among the most popular machine learning frameworks that provide CuPy-like generic tensor computation APIs as well as machine learning building blocks.Legate [2] (cuNumeric) is another framework that aims to provide a distributed and accelerated drop-in replacement for the NumPy.It supports both single node CPU/GPU and transparent distributed acceleration across multi-node machines.
7.4 High-Level Language to GPU Compiler jit4GPU (implemented under unPython) [7] is an early effort (2010) that just-in-time compiles Python/NumPy code to both CPU code (with OpenMP) and GPU code (AMD hardware-only).It also takes an annotation-based approach and requires the programmer to annotate parallel loops with prange.However, the paper does not document how it performs thread-level parallelization and parallel reduction does not appear supported as well.In terms of memory hierarchy optimization, jit4GPU does not generally perform operator fusion, due to lack of dependence analysis.Loo.py [12] employs a domain-speci c language similar to the sliced index notation to express parallel kernels, and also provides schedules to specify code transformations.However it does not provide general programming interface to express more general programs, such as parallel reduction with indirect memory access.ALPyNA [11] proposes an automatic Python loop parallelization approach that splits the compilation and dependence analysis into an ahead-of-time (AOT) stage and a just-in-time (JIT) stage.However, its performance evaluation does not demonstrate consistent performance improvement over Numba (can be even 3× slower than Numba).In addition, it targets at only loops, not tensor expressions.DaCe [26] compiles and parallelizes Python code to both CPU and GPU execution, using a stateful data ow multiGraphs (SDFG) data-centric intermediate representation, on which automatic or manual transformations can be performed.Parallel loops are either manually speci ed as dace.mapiterator, or automatically inferred by a transformation pass via dependence analysis.However we have observed that certain parallelizable patterns are not parallelized by DaCe which could lead to sub-optimal performance.AutoMPHC [20] proposes an approach to automatic ahead-of-time (AOT) parallelization and optimization of sequential Python programs for execution on distributed heterogeneous platforms, based on extensions to the polyhedral framework that unify userwritten loops and tensor operators.However its intra-node parallelization strategy of explicit loops is limited in that an equivalent pre-de ned tensor operator must exist in order for the loop to be parallelized, nor can it parallelize loops with indirect memory accesses.

Conclusion
In this work, we present APPy, which allows users to parallelize their sequential Python loops and tensor assignments on GPUs using compiler directives.We present the design and implementation of APPy, including code generation and automatic compiler optimizations.We evaluate the performance of APPy using 20 kernels from scienti c computing and demonstrate signi cant speedup over a state-of-the-art library CuPy (30× on average) and three Python compilers Numba (8.3× on average), DaCe-GPU (3.1× on average) and JAX-GPU (18.8× on average).

Figure 1 .
Figure 1.An illustration of the abstract machine model that exposes two layers of parallelism: multi-processor parallelization and vectorization within each processor.

Figure 2 .
Figure2.Some key steps of the high-level transformations using the same example.

Figure 3 .
Figure 3. Reduction-Bounded fusion illustration.Vertically, workers (denoted by di erent colors) can work on their own data elements concurrently, and horizontally a sequence of operations (such as op1 and op2) can be fused, until a reduction operation like op2 is encountered, which ends a fusible group (itself included) and demands thread synchronizations (the dotted line).

12 13
def kernel(a, b, c, N): 14 BN = 512 # User-determined data block size 15 grid = (N + BN -1) // BN 16 _kernel[grid](a, b, c, N, BN) Listing 9. A Triton kernel that performs a dot product of vector a and b.The result is stored in c. _kernel is the Triton GPU kernel, launched at line 16.

Figure 4 .
Figure 4. Performance results on a set of NPBench benchmarks.All frameworks show their speedup relative to NumPy, except NumPy itself shows its absolute execution time in seconds.Up arrows indicate performance improvement, and down arrows indicate performance slowdown.

Figure 5 .
Figure 5.Additional performance results on the same set of benchmarks as Fig. 4 but using a di erent server with an AMD EPYC 7502 32-Core CPU and an NVIDIA A100-PCIE-40GB GPU.All frameworks show their speedup relative to NumPy, except NumPy itself shows its absolute execution time in seconds.Up arrows indicate performance improvement, and down arrows indicate performance slowdown.