GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE.In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined 64-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by 19%. Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of 796×, 14.2×, and 2.3× over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively.


INTRODUCTION
Large-scale machine learning (ML) models, such as Ope-nAI's GPT series and DALL-E, Google AI's BERT and T5, and Facebook's RoBERTA, have made significant advances in recent years.Unfortunately, providing public access for inference on these large-scale models leaves them susceptible to zero-day exploits [38,71].These exploits expose the user data as well as the ML models to hackers for potential reverse engineering [38], a concerning prospect as these models are highly valued assets for their respective companies.For example, a recent security vulnerability in the Redis client library resulted in a data breach on ChatGPT [60], which is currently regarded as one of the leading machine learning In the past decade, Fully Homomorphic Encryption (FHE) has emerged as the "holy grail" of data privacy.Using FHE, one can perform operations on encrypted data without decrypting it first (see Figure 1).FHE adopters can offload their encrypted private data to third-party cloud service providers while preserving end-to-end privacy.Specifically, the secret key used for encryption by users is never disclosed to the cloud providers, thus facilitating privacy-preserving ML training and inference in an untrusted cloud setting (whether self-hosted or utilizing public cloud services) [77,83,87].
During its early stages, homomorphic encryption was limited by the number and types of computations, rendering it viable solely for shallow circuits [30].In these circuits, the error would propagate and increase with each addition or multiplication operation, ultimately leading to decryption errors.Following Gentry's groundbreaking work [30], this important limitation was resolved by using bootstrapping [19], resulting in FHE computations that permit an unlimited number of operations.Although FHE offers significant benefits in terms of privacy preservation, it faces the challenge of being extremely slow (especially the bootstrapping operation), with performance up to five orders of magnitude slower than plaintext computing [42].
Prior studies have tried to accelerate FHE kernels by developing CPU extensions [15,31,42,55], GPU libraries [4,54,61,76], FPGA implementations [1,66,88], and custom accelerators [33,45,67].CPU-based solutions inherently face limitations due to their limited compute throughput [17], while FPGA-based solutions are constrained by their limited oper-ating frequency and resources available on the FPGA board.ASIC-based solutions provide the most acceleration [29], but they cannot be easily adapted to future algorithmic changes and can be fairly expensive to use in practice.Additionally, as the number of diverse domain-specific custom accelerators grows rapidly, it becomes increasingly difficult to create high-quality software libraries, compilers, drivers, and simulation tools for each accelerator in a timely manner, posing a challenge in terms of time-to-market.Therefore, while previous work has accelerated FHE workloads, they often fall short in terms of cost-effectiveness or lack the necessary infrastructure to support large-scale deployment.
Rather than developing domain-specific custom accelerators, our work focuses on enhancing the microarchitecture of GPUs that are currently deployed in the cloud and can be easily upgraded.This leads to a practical solution as we can readily exploit the cloud ecosystem that is built around GPUs.On the upside, GPUs offer a large number of vector processing units, so they are a good match to capitalize on the inherent parallelism associated with FHE workloads.However, FHE ciphertexts are large (dozens of MB), require a massive number of integer arithmetic operations, and exhibit varying stride memory access patterns.This imposes a true challenge for existing GPU architectures since GPUs have been historically designed to excel at executing thousands of threads in parallel (e.g., batched machine-learning workloads) featuring uniform memory access patterns and rich floating-point computations.
To bridge the wide performance gap between operating on encrypted data using FHE and operating on plaintext data in GPUs, we propose several microarchitectural features to extend the latest AMD CDNA GPU architecture.Specifically, our efforts are focused on improving the performance of the Residue Number System (RNS) version of the CKKS FHE scheme, as it naturally supports numerous privacy-preserving applications.Similar to results found in earlier studies [24], our benchmarking of CKKS FHE kernels indicates they are significantly bottlenecked by the limited main memory bandwidth.This is because current GPUs suffer from excessive redundant memory accesses when executing FHE-based workloads.Present GPUs are ill-equipped to deal with varying stride FHE memory access patterns.According to our experiments, this can lead to a very high degree of compute unit stalls and is a primary cause of the huge performance slowdown in FHE computations on GPU-based systems.
To address these challenges, we propose GME, a hardwaresoftware co-design specifically tailored to provide efficient FHE execution on the AMD CDNA GPU architecture (illustrated in Figure 2).First, we present CU-side interconnects that allow ciphertext to be retained within the on-chip caches, thus eliminating redundant memory transactions in the FHE kernels.Next, we optimize the most commonly executed operations present in FHE workloads (i.e., the modular reduction operations) and propose novel MOD-units.To complement our MOD-units, we introduce WMAC-units that natively perform 64-bit integer operations, preventing the throttling of the existing 32-bit arithmetic GPU pipelines.Finally, in order to fully benefit from the optimizations applied to FHE kernels, we develop a Locality-Aware Block Scheduler (LABS) that enhances the temporal locality of data.LABS is able to  retain on-chip cache data across FHE blocks, utilizing block computation graphs for assistance.
To faithfully implement and evaluate GME, we employ NaviSim [11], a cycle-accurate GPU architecture simulator that accurately models the CDNA ISA [6].To further extend our research to capture inter-kernel optimizations, we extend the implementation of NaviSim with a block-level directed acyclic compute graph simulator called BlockSim.In addition, we conduct ablation studies on our microarchitectural feature implementations, enabling us to isolate each microarchitectural component and evaluate its distinct influence on the entire FHE workload.
Our contributions include: 1. Simulator Infrastructure: We introduce BlockSim, which, to the best of our knowledge, is among the first efforts to develop a simulator extension for investigating FHE microarchitecture on GPUs.
put for FHE workloads.

4.
Locality-Aware Block Scheduler: Utilizing the CU-side interconnect (cNoC), we propose a graph-based block scheduler designed to improve the temporal locality of data shared across FHE primitives.
Our proposed improvements result in an average speedup of 14.6× over the prior state-of-the-art GPU implementation [41] for HE-LR and ResNet-20 FHE workloads.Our optimizations collectively reduce redundant computation by 38%, decreasing the memory pressure on DRAM.Although the proposed optimizations can be adapted for other architectures (with minor modifications), our work primarily concentrates on AMD's CDNA microarchitecture MI100 GPU.

BACKGROUND
In this section, we briefly describe the AMD CDNA architecture and background of the CKKS FHE scheme.

AMD CDNA Architecture
To meet the growing computation requirements of highperformance computing (HPC) and machine learning (ML) workloads, AMD introduced a new family of CDNA GPU architectures [8] that are used in AMD's Instinct line of accelerators.The CDNA architecture (see Figure 3) adopts a highly modular design that incorporates a Command Processor (CP), Shader Engines (including Compute Units and L1 caches), an interconnect connecting the core-side L1 caches to the memory-side L2 caches and DRAM.The CP receives requests from the driver on the CPU, including memory copying and kernel launch requests.The CP sends memory copying requests to the Direct Memory Access (DMA), which handles the transfer of data between the GPU and system memory.The CP is also responsible for breaking kernels down into work-groups and wavefronts, sending these compute tasks to Asynchronous Compute Engines (ACE), which manage the dispatch of work-groups and wavefronts on the Compute Units (CUs).
The CDNA architecture employs the CU design from the earlier GCN architecture but enhances it with new Matrix Core Engines.A CU (see Figure 3) is responsible for instruction execution and data processing.Each CU is composed of a scheduler that can fetch and issue instructions for up to 40 wavefronts.Different types of instructions are issued to different execution units, including a branch unit, scalar processing units, and vector processing units.The scalar processing units are responsible for executing instructions that manipulate data shared by work-items in a wavefront.The vector processing units include a vector memory unit, four Single-Instruction Multiple-Data (SIMD) units, and a matrix core engine.Each SIMD unit is equipped with 16 single-precision Arithmetic Logic Units (ALUs), which are optimized for FP32 operations.The matrix core engine handles multiplyaccumulate operations, supporting various datatypes (like 8-bit integers (INT8), 16-bit half-precision FP (FP16), 16-bit Brain FP (bf16), and 32-bit single-precision FP32).We cannot leverage these engines for FHE, as they work with INT8 operands that are not well-suited for FHE computations [78] (FHE workloads benefit from INT64 arithmetic pipelines).Each CU has a 64 KB memory space called the Local Data The CDNA architecture has a two-level cache hierarchy.Each CU has a dedicated L1 vector cache.CUs in a Shader Engine (typically 15 CUs) share an L1 scalar cache and an L1 instruction cache.The second level of cache is composed of memory-side L2 caches.Each L2 cache interfaces to a DRAM controller (typically implemented in HBM or GDDR technology).The L2 caches and the DRAM controllers are banked, allowing them to service a part of the address space.

CKKS FHE Scheme
In this paper, we focus on the CKKS FHE scheme, as it can support a wide range of privacy-preserving applications by allowing operations on floating-point data.We list the parameters that define the CKKS FHE scheme in Table 1 and the corresponding values of key parameters in Table 3.The main parameters -i.e., N and Qdefine the size of the ciphertext and also govern the size of the working data set that is required to be present in the on-chip memory.The ciphertext consists of a pair of elements in the polynomial ring . Each element of this ring is a polyno- Polynomial encrypting message m [P] q i q i -limb of P evk Evaluation key evk

(r) rot
Evaluation key for HE-Rotate block with (r) rotations mial ∑ N−1 i=0 a i x i with "degree-bound" N − 1 and coefficients a i in Z Q .For a message m ∈ C n , we denote its encryption as m = (A m , B m ) where A m and B m are the two polynomials that comprise the ciphertext.
For 128-bit security, typical values of N range from 2 16 to 2 17 and log Q values range from 1700 to 2200 bits for practical purposes.These large sizes of N and log Q are required to maintain the security of the underlying Ring-Learning with Errors assumption [57].However, there are no commercially available compute systems that have hundredbit wide or thousand-bit wide ALUs, which are necessary to process these large coefficients.A common approach for implementing the CKKS scheme on hardware with a much smaller word length is to choose Q to be a product of distinct word-sized primes q 1 , . . ., q ℓ .Then Z Q can be identified with the "product ring" ∏ l i=1 Z q i via the Chinese Remainder Theorem [79].In practice, this means that the elements of Z Q can be represented as an ℓ-tuple (x 1 , . . ., x ℓ ) where x i ∈ Z q i for each i.This representation of elements in Z Q is referred to as the Residue Number System (RNS) and is commonly referred to as the limbs of the ciphertext.
In this work, as shown in Table 3, we choose N = 2 16 and log Q = 1728, meaning that our ciphertext size will be 28.3MB, where each polynomial in the ciphertext is ∼14 MB.After RNS decomposition on these polynomials using a word length of 54 bits, we get 32 limbs in each polynomial, where each limb is ∼ 0.44 MB large.The last level cache and the LDS in the AMD MI100 are 8 MB and 7.5 MB, respectively.Thus we cannot accommodate even a single ciphertext in the on-chip memory.At most, we can fit ∼18 limbs of a ciphertext polynomial, and as a result, we will have to perform frequent accesses to the main memory to operate on a single ciphertext.In addition, the large value of N implies that we need to operate on 2 16 coefficients for any given homomorphic operation.The AMD MI100 GPU includes 120 CUs with 4 SIMD units each.Each SIMD unit can execute 16 threads in parallel.Therefore, a total of 7680 operations (scalar additions/multiplications) can be performed in parallel.However, we need to schedule the operations on 2 16  coefficients in over eight batches (2 16 / 7680), adding to the complexity of scheduling operations.
We list all the building blocks in the CKKS scheme in Table 2.All of the operations that form the building blocks of the CKKS scheme reduce to 64 bit-wide scalar modular additions and scalar modular multiplications.The commercially available GPU architectures do not implement these wide modular arithmetic operations directly, but can emulate them via multiple arithmetic instructions, which significantly increases the amount of compute required for these operations.Therefore, providing native modular arithmetic units is critical to accelerating FHE computation.To perform modular addition over operands that are already reduced, we use the standard approach of conditional subtraction if the addition overflows the modulus.For generic modular multiplications, we use the modified Barrett reduction technique [76].
The ScalarAdd and ScalarMult are the two most basic building blocks that add and multiply a scalar constant to a ciphertext.PolyAdd and PolyMult add and multiply a plaintext polynomial to a ciphertext.We define separate ScalarAdd and ScalarMult operations (in addition to PolyAdd and PolyMult) because the scalar constant values can be fetched directly from the register file that can help save expensive main memory accesses.Note that the PolyMult is followed by an HERescale operation to restore the scale of a ciphertext to ∆ from scale ∆ 2 .The CKKS supports floating-point messages, so all encoded messages must include a scaling factor ∆.This scaling factor is typically the size of one of the limbs of the ciphertext.When multiplying messages together, this scaling factor grows as well.The scaling factor must be shrunk down in order to avoid overflowing the ciphertext coefficient modulus.
In order to enable fast polynomial multiplication, by default, we represent polynomials as a series of N evaluations at fixed roots of unity.This allows polynomial multiplication to occur in O(N) time instead of O(N 2 ) time.We refer to this polynomial representation as the evaluation representation.There are certain sub-operations within the building blocks, defined in Table 2, that operate over the polynomial's coefficient representation, which is simply a vector of its coefficients.Moving between the two polynomial representations requires a number-theoretic transform (NTT) or inverse NTT, which is the finite field version of the fast Fourier transform (FFT).We incorporate a merged-NTT algorithmic optimization [65], improving spatial locality for twiddle factors as they are read sequentially.
The HEAdd operation is straightforward and adds the corresponding polynomials within the two ciphertexts.However, the HEMult and HERotate operations are computationally expensive as they perform a KeySwitch operation after the multiplication and automorph operations, respectively.In both the HEMult and HERotate implementations, there is an intermediate ciphertext with a decryption key that differs from the decryption key of the input ciphertexts.In order to change rot )+ Circular rotate elements left by r slots Restore the scale of a ciphertext from scale ∆ 2 back to ∆ rot ) and a ciphertext m s that is decryptable under a secret key s.The output of the key switch operation is a ciphertext m s ′ that encrypts the same message but is decryptable under a different key s ′ .
To incur minimal noise growth during the key switch operation, the key switch operation requires that we split the polynomial into dnum digits, then raise the modulus before multiplying with the switching key followed by a modulus down operation.The modulus raise and down operations operate on the coefficient representation of the polynomial, requiring us to perform expensive NTT and iNTT conversions.Moreover, the switching keys are the same size as the ciphertext itself, requiring us to fetch ∼112 MB of data to multiply the switching keys with the ciphertext.Thus, the key switching operation not only adds to the bulk of the compute through hundreds of NTT and iNTT operations, but also leads to memory bandwidth bottlenecks.Finally, there exists an operation known as bootstrapping [30] that needs to be performed frequently to de-noise the ciphertext.This bootstrapping operation is a sequence of the basic building blocks in the CKKS scheme, meaning that it suffers from the same compute and memory bottlenecks that exist in these building blocks, making it one of the most expensive operations.

GME ARCHITECTURE
The current issue with GPUs while implementing FHE workloads is the significant disproportion in the usage of various hardware resources present on the GPUs.As a result, specific resources such as CUs experience underutilization, while others, like HBM and on-chip caches, pose as signif-icant bottlenecks.In this paper, we propose to re-architect the current GPU microarchitecture and also introduce novel microarchitectural extensions that enable optimal utilization of GPU resources so as to maximize the performance of the FHE workloads running on the GPU.We propose GME, a robust set of microarchitectural features targeting AMD's CDNA architecture, unlocking the full potential of the GPU to accelerate FHE workloads over 14.2× as compared to the previous comparable accelerators [41].
In our work, we pinpoint critical bottlenecks encountered during FHE workload execution and address them progressively using four microarchitectural feature extensions.Our on-chip CU-side hierarchical network (cNoC) and the Locality Aware Block Scheduler (LABS) contribute to minimizing the DRAM bandwidth bottleneck.Simultaneously, our implementation of native modular reduction (MOD) and wider multiply-accumulate units (WMAC) features improve the math pipeline throughput, ensuring a streamlined data flow with evenly distributed resource utilization.The list and impact of our contributions can be visualized in Figure 2.

cNoC: CU-side interconnect
Modern GPUs have a network-on-chip that interconnects the cores (in the case of AMD GPUs, compute units) together with the memory partitions or memory banks.On-chip communication occurs between the cores and the memory banks, not necessarily between the cores.In this work, we propose a new type of on-chip interconnect that we refer to as a CU-side network-on-chip (cNoC) that interconnects the CUs together -in particular, all the CU's LDS are interconnected together with (cNoC) to enable a "global" LDS that can be shared between the CUs.By exploiting the (cNoC), the dedicated onchip memory can be shared between cores, thus minimizing memory accesses.We also provide synchronization barriers of varying granularity to mitigate race conditions.Since the LDS is user controlled, our approach does not incur the overhead associated with cache coherence and avoids redundant cache invalidations, but comes with some extra programmer effort.By implementing a global address space (GAS) in our GPU, we establish data sharing and form a unified GAS by combining all LDSs.The virtual address space is then mapped onto this unified GAS, with translation using a hash of the lower address bits.Current GPUs are designed hierarchically -e.g., MI100 GPU comprises numerous compute units, with 8 of them combined to form a Shader Engine (seen in Figure 5).The proposed (cNoC) takes advantage of this hierarchy, utilizing a hierarchical on-chip network (illustrated in Figure 5) that features a single router for each Shader Engine, connecting the eight compute units that make up a Shader Engine.The MI100 GPU houses 15 Shader Engines, resulting in a total of 120 compute units.The routers are arranged in a 3 × 5 2D grid and interconnected through a torus topology.While this concentrated-torus topology [10,39] can increase network complexity, it reduces the number of required routers (from 120 to 15), thereby minimizing the chip area needed for the network.In a concentrated-torus topology, all routers have the same degree (number of ports), creating an edgesymmetric topology that is well-suited for the all-to-all communication patterns of FHE workloads.
Figure 4(a) illustrates the conventional approach of data sharing, where memory transactions must traverse through the full memory hierarchy to share data between neighboring LDS.In contrast, our proposed CU-side interconnect, presented in Figure 4(b), incorporates on-chip routers that circumvent off-chip interconnects, improving data reuse.This results in a decrease of redundant memory operations by 38%, effectively supporting the all-to-all communication pattern commonly seen in FHE workloads.The existing GPU arithmetic pipeline is highly optimized for data manipulation operations like multiply, add, bit-shift, and compare.A wavefront executing any of these instructions takes 4 clock cycles in a lock-step manner in the SIMD units.In a single wavefront consisting of 64 threads, 16 threads are executed concurrently on the SIMD units during each clock cycle.Conversely, operations like divide and modulus are emulated using a series of native instructions, resulting in considerably slower performance compared to their native counterparts.
As stated in Section 2.2, the modular reduction operation, used for determining the remainder of a division, is performed after each addition and multiplication.As a result, optimizing modular reduction is crucial for speeding up FHE workloads.At present, the MI100 GPU executes a modular operation through a sequence of addition, multiplication, bit shift, and conditional operations, drawing on the conventional Barrett's reduction algorithm [48].This operation currently takes a considerable amount of time, with the mod-red operation requiring an average of 46 cycles for execution on the MI100 GPU.In our study, we suggest enhancing the Vector ALU pipeline within the CDNA architecture to natively support modular reduction, which brings it down to an average of 17 cycles for each mod-red instruction.We augment the CDNA instruction set architecture (ISA) with a collection of vector instructions designed to perform modular reduction operations natively after addition or multiplication operations.The new native modular instructions proposed include: • Native modular reduction: mod-red <v0,s0> | V 0 = V 0 mod s 0 • Native modular addition: mod-add <v0,v1,s0> Modular reduction involves several comparison operations, resulting in branch divergence in GPUs.Our implementation is derived from an improved Barrett's reduction algo- rithm [76].This approach minimizes the number of comparison operations to one per modular reduction operation, significantly reducing the number of branch instructions and enhancing compute utilization.
Wider multiply-accumulate units (WMAC): In the CKKS FHE scheme, we can choose to perform operations on 32, 64, or 128-bit wide RNS limbs for a ciphertext.This limb bit width governs the operand size for the vector ALUs, impacting the number of modular addition and multiplication operations required.Moreover, there is an algorithmic-level performance versus precision trade-off to consider when deciding on the bit width.If we opt for 32-bit wide RNS limbs, we will have numerous limbs to work with, increasing the available levels [2] while simultaneously reducing the achievable precision for an application.Conversely, if we select 128-bit RNS limbs, we will have fewer limbs to work with, resulting in a decrease in the number of available levels but result in high precision for an application.With our chosen parameters, using 128-bit wide RNS limbs would leave us with an insufficient number of limbs to perform a single bootstrapping operation.To strike a balance between performance and precision, we choose to use 64-bit wide RNS limbs in this work.
Most GPUs in the market natively support 16-, 32-, and 64bit floating point computations as well as 4-, 8-, 32-bit integer computations.Unfortunately, they lack dedicated hardware support for 64-bit integer operations, the most common operation in FHE workloads.Instructions for processing 64-bit integer operands are emulated using multiple 32-bit integer instructions, making them comparatively slower.To complement our native modular reduction, which relies on 64-bit integer operations, we add support for hardware-backed 64bit integer multiplier and accumulator, as well as widen the register-file size to accommodate the large ciphertexts.Table 4 demonstrates the decrease in total cycles for each of our proposed native modular instructions in comparison to the MI100 GPU-emulated instructions in the baseline (vanilla) configuration.
Prior studies [28,84] argued that dedicating resources to specialized 64-bit integer cores was not justifiable in terms of opportunity cost, as workloads at the time did not necessitate INT64 support, and emulation with 32-bit cores was sufficient.However, in the context of FHE, we maintain that the performance improvements attained through using an upgraded vector ALU justify the additional chip resources allocated.

LABS: Locality-Aware Block Scheduler
So far, our microarchitectural extensions primarily focused on optimizing individual FHE blocks.To better leverage these new features, we focus next on inter-block optimization opportunities, targeting the workgroup dispatcher within the CDNA architecture.GPU scheduling is typically managed using streams of blocks that are scheduled on compute units in a greedy manner [9].The presence of large GPU register files allows the scheduler to oversubscribe blocks to each compute unit.However, the existing scheduler within the CDNA architecture is not cognizant of inter-block data dependencies, forcing cache flushes when transitioning from one block to the next.
We propose a Locality-Aware Block Scheduler (LABS) designed to schedule blocks with shared data together, thus avoiding redundant on-chip cache flushes, specifically in the LDS.LABS further benefits from our set of microarchitectural enhancements, which relax the operational constraints during block scheduling and create new opportunities for optimization (for instance, the (cNoC) feature enables LDS data to be globally accessible across all CUs, thereby allowing the scheduler to assign blocks to any available CU).To develop LABS, we employ a well-known graph-based mapping solution and frame the problem of block mapping to CUs as a compile-time Graph Partitioning Problem (GPP) [80,85].
Graph Partitioning Problem: To develop our localityaware block scheduler, we use two graphs.Let G = G(V, E) represent a directed acyclic compute graph with vertices V (corresponding to FHE blocks) and edges E (indicating the data dependencies of the blocks).Similarly, let G a = G a (V a , E a ) denote an undirected graph with vertices V a (representing GPU compute units) and edges E a (illustrating the communication links between compute units).Both edge sets, E and E a , are assumed to be weighted, with edge weights of E signifying the size of data transferred between related blocks, and E a representing the bandwidth of communication between corresponding compute units.We can then define π : V → V a as a mapping of V into V a disjoint subsets.Our objective is to a mapping π that minimizes communication overhead between compute units.
We formulate our Graph Partitioning Problem (GPP) by introducing a cost function Φ.For a graph G, if it is partitioned such that E c denotes the set of edge cuts, then Φ can be expressed as the sum of the individual cut-edge weights (with (v, w) representing the edge-weight of the edge connecting node v to node w).The cost function Φ reflects the communication overhead associated with assigning FHE blocks to separate compute units.The goal of the graph partitioning problem is to discover a partition that evenly distributes the load across each compute unit while minimizing the communication cost Φ.
In this equation, |(v, w)| signifies the data transferred between FHE blocks.To partition the compute graph and prepare it for mapping onto the architecture graph, we utilize a multilevel mesh partitioning technique.For readers interested in gaining further insights into our graph partitioning implementation of the multi-level mesh partitioning algorithm, we recommend Architecture-aware mapping: In this work, we focus on mapping our partitioned subgraphs onto the set of compute units V a , where communication costs (both latency and bandwidth) are not uniformly distributed across the network [75].To uniformly distribute the communication overheads across the network, we introduce a network cost function Γ.Here, Γ is defined as the product of individual cut-weights and their corresponding edge-weights in the architecture graph when mapped using a mapping function π.Formally, Γ is described as: In this equation, π(v) represents the mapping of block v to a compute unit from the set V a , after applying the mapping function π.Additionally, |(π(v), π(w))| represents the communication bandwidth between compute units π(v) and π(w).Similar to our analysis with Φ, our goal is to minimize Γ.To accomplish this, we use a compile-time optimization by applying simulated annealing, alongside mesh partitioning, to map FHE blocks onto compute units efficiently.The evaluation of performance improvements by incorporating the LABS is discussed further in Section 4.

EVALUATION
In this section, we first give a concise overview of the GPU simulator employed to model our microarchitectural extensions.Next, we outline the evaluation methodology assumed to assess the performance of our bootstrapping and other workload implementations.Finally, we present evaluation results.

The NaviSim and BlockSim Simulators
In our work, we leverage NaviSim [11], a cycle-level execution-driven GPU architecture simulator.NaviSim faithfully models the CDNA architecture by implementing a CDNA ISA emulator and a detailed timing simulator of all the computational components and memory hierarchy.NaviSim utilizes the Akita simulation engine [81] to enable modularity and high-performance parallel simulation.NaviSim is highly configurable and accurate and has been extensively validated against an AMD MI100 GPU.As an executiondriven simulator, NaviSim recreates the execution results of GPU instructions during simulation with the help of an instruction emulator for CDNA ISA [7,12].Currently, Nav-iSim supports kernels written in both OpenCL [43] and the HIP programming language [9].For our experiments, we implement our kernels using OpenCL.NaviSim can generate a wide range of output data to facilitate performance analysis.For performance metrics related to individual components, NaviSim reports instruction counts, average latency spent accessing each level of cache, transaction counts for each cache, TLB transaction counts, DRAM transaction counts, and read/write data sizes.For low-level details, NaviSim can generate instruction traces and memory traces.Finally, NaviSim can produce traces using the Daisen format so that users can use Daisen, a web-based visualization tool [82], to inspect the detailed behavior of each component.
We enhance NaviSim's capabilities by incorporating our new custom kernel-level simulator, BlockSim.BlockSim is designed to enable us to identify inter-kernel optimization opportunities.With an adjustable sampling rate for performance metrics, BlockSim accelerates simulations, facilitating more efficient design space exploration.BlockSim generates analytical models of the FHE Blocks to provide estimates for run times of various GPU configurations.When the best design parameters are identified, NaviSim is then employed to generate cycle-accurate performance metrics.Besides supporting FHE workloads, BlockSim serves as an essential component of NaviSim by abstracting low-level implementation details from the user, allowing them to focus on entire workloads rather than individual kernels.BlockSim enables restructuring of the wavefront scheduler and integrates compile-time optimizations obtained from LABS.We utilize AMD's CDNA architecture-based MI100 GPU to create a baseline for FHE application evaluations.We further validate our BlockSim findings with the MI100 GPU.

Experimental Setup
In our experiments, we determine our baseline performance using an AMD MI100 CDNA GPU (see table 5).We then iteratively introduce microarchitectural extensions and evaluate the performance benefits of each enhancement.We first evaluate our three microarchitectural extensions (cNoC, MOD, WMAC), then evaluate our compile-time optimization LABS, and conclude with a memory size exploration to determine the impact of on-chip memory size on FHE workloads.We evaluate these microarchitectural enhancements and compiler optimization using NaviSim and BlockSim.To determine the power and area overhead of our proposed microarchitectural components, we implement them in RTL.Utilizing Cadence Genus Synthesis Solutions, we synthesize these RTL components targeting an ASAP7 technology library [22] and determine the area and power consumption for each proposed microarchitectural element.
We first evaluate our bootstrapping implementation performance, utilizing the amortized mult time per slot metric [41].This metric has been used frequently in the past to perform a The CDNA architecture-based MI100 GPU chip area and power consumption are not disclosed.We display the publicly available approximated values.† We compute the chip area and power requirements of our microarchitectural extensions using RTL components and Cadence Synthesis Solutions with the ASAP7 technology library.
‡ Reported values are of maximum clock frequency F max that the design can sustain without violating timing constraints.
comparison between different bootstrapping implementations.We can compute this metric as follows: Here, T boot stands for total bootstrapping runtime, and L boot stands for the number of levels that the bootstrapping operation utilizes.The rest of the parameters are defined in Table 1.The parameters that we have used in our implementation have an L boot = 17 and n = 2 15 .In addition, we analyze the performance of two workloads: HE-based logistic regression (HELR) [35] and encrypted ResNet-20 [50] utilizing the CIFAR-10 dataset.For all three workloads, we evaluate the contributions of each individual FHE building block (see Table 2) that make up the respective workload.
In addition, for these workloads, we report the performance benefits achieved by employing each of the proposed microarchitectural enhancements.
We also compare our implementations with other stateof-the-art CKKS accelerators, incorporating a diverse selection of CPU [16,62], GPU [27,41,62], FPGA [1], and ASIC [44,45,69,70] platforms. 1Table 6 presents a detailed comparison of the key architectural parameters across all the related works.Table 6 also showcases the distribution of chip area and power requirements for each microarchitectural enhancement of GME.Since the maximum operating frequency F max of our microarchitectural enhancements (1.63 GHz) is greater than the typical operating frequency of the MI100 GPU (1.5 GHz), we do not expect our extensions to change the critical path timings of the MI100 design.It is essential to emphasize that operating frequencies differ across various designs, a crucial factor to consider when comparing execution times in absolute terms.Moreover, the ASIC designs make use of large on-chip memory, resulting in an expensive solution, and they are also not as flexible as CPU, GPU, and FPGA.

Results
Performance of FHE Building Blocks: We begin by comparing the performance of individual FHE blocks with the previous state-of-the-art GPU implementation [41].Since these are individual FHE blocks, the reported metrics do not account for our inter-block LABS compiler optimization.We find that HEMult and HERotate are the most expensive operations, as they require key switching operations that involve the most data transfers from the main memory.The next most expensive operation is HERescale, where the runtime is dominated by the compute-intensive NTT operations.
Across the five FHE blocks mentioned in Table 7, we achieve an average speedup of 6.4× compared to the 100x implementation.In particular, we see a substantial performance improvement in the most expensive operations, namely HEMult and HERotate, as our proposed microarchitectural enhancements reduce the data transfer time by 12× for both blocks.For HERescale, we manage to decrease the average memory transaction latency by 13× using our microarchitectural enhancements to the on-chip network, cNoC.Thus making HERescale the fastest block in comparison to 100x GPU implementation.First, our proposed concentrated 2D torus network enables ciphertexts to be preserved in on-chip memory across kernels, leading to a significant increase in compute unit utilization across workloads, thereby reducing the average cycles consumed per memory transaction (see Avg. CPT in Figure 6).In fact, when comparing the average number of cycles spent per memory transaction (average CPT), we observe that the ResNet-20 workload consistently displays a lower average CPT value compared to the HE-LR workload.This indicates a higher degree of data reuse within the ResNet-20 workload across FHE blocks as opposed to the HE-LR workload.With cNoC enhancement, as the data required from previous kernels is retained in the on-chip memory, CUs are no longer starved for data and this also results in a substantial decrease in DRAM bandwidth utilization and DRAM traffic (the total amount of data transferred from DRAM).The L1 cache utilization decreases notably across all three workloads for the cNoC microarchitectural enhancement.This is due to the fact that the LDS bypasses the L1 cache, and memory accesses to the LDS are not included in the performance metrics of the L1 cache.
The proposed MOD extension enhances the CDNA ISA by adding new instructions.These new instructions are complex instructions that implement commonly used operations in FHE, like mod-red, mod-add, and mod-mult.As these instructions are complex (composed of multiple subinstructions), they consume a higher number of cycles than comparatively simpler instructions such as mult or add.This is the reason for the increase in the average cycles per instruction (CPI) metric shown in Figure 6.
The compile-time LABS optimization in our approach further removes redundant memory transactions by scheduling blocks that share data together, thus reducing total DRAM traffic and enhancing CU utilization.The reported speedup is cumulative, with each microarchitectural enhancement building upon the previous ones delivers an additional speedup of over 1.5× on top of cNoC and MOD (See Figure 7).Performance Comparison: We compare the performance of GME with 100x implementation of FHE workloads in Table 8.GME surpasses the previous best GPU-based implementation for bootstrapping and HE-LR by factors of 15.7× and 14.2×, respectively.Note that we do not compare the performance of ResNet-20 workload with 100x, as they do not implement this workload.With close to double the onchip memory (LDS), and similar peak memory bandwidth, our microarchitectural extensions paired with our compiler optimization delivered significant performance improvement across all three FHE workloads.GME significantly outperforms the CPU implementation Lattigo by 514×, 1165×, and 427× for bootstrapping, HE-LR, and ResNet-20 workloads, respectively.We assessed Lattigo's performance by executing workloads on an Intel 8th-generation Xeon Platinum CPU with 128 GB of DDR4 memory.
In addition, GME outperforms the FPGA design implementation of FHE workloads, called FAB [1], by 2.7× and 1.9× for bootstrapping and HE-LR workloads, respectively.A primary factor contributing to this acceleration is the low operating frequency of FPGAs (the Alveo U280 used in FAB operates at 300MHz, while GME cores can achieve peak frequencies of 1.5GHz [21]).In their work, FAB scales their implementation to 8 FPGAs for the HE-LR workload (referred to as FAB-2).GME surpasses FAB-2 by 1.4×.This occurs because, when the intended application cannot be accommodated on a single FPGA device, considerable communication overheads negate the advantages of scaling out.
However, GME does not outperform all ASIC implementations shown in Table 8.While it achieves an average speedup of 18.7× over F1 for the HE-LR workload, it falls short in comparison to BTS, CL, and ARK due to their large on-chip memory and higher HBM bandwidths.ASIC implementations are tailored for a single workload.Their customized designs lack flexibility, so they cannot easily accommodate multiple workloads across domains.Cutting-edge implementations such as ARK [44] integrate the latest HBM3 technology, enabling them to utilize nearly twice the memory bandwidth available in HBM3, as compared to HBM2 used on MI100 GPUs.CraterLake (CL) [70] incorporates extra physical layers (PHY) to facilitate communication between DRAM and on-chip memory, thereby enhancing the available bandwidth for FHE workloads.In this paper, we limit our focus to an existing HBM model compatible with the CDNA architecture without modifications to the physical communication layers.On-chip Memory Size Exploration: Finally, we look for the ideal on-chip memory (LDS) size for the FHE workload, as shown in Figure 8.By increasing the total LDS size from 7.5MB (which is the current LDS size on MI100 GPU) to 15.5MB, we achieve speedups of 1.74×, 1.53×, and 1.51× for Bootstrapping, HE-LR, and ResNet-20 workloads, respectively.However, increasing the LDS size beyond 15.5 MB does not result in substantial speedup, as DRAM bandwidth becomes a bottleneck.

DISCUSSION
In the field of accelerator design, developing general-purpose hardware is of vital importance.Rather than creating a custom accelerator specifically for FHE, we focus on extending the capabilities of existing GPUs to take advantage of the established ecosystems for GPUs.General-purpose hardware, such as GPUs, reap the benefits of versatile use of all microarchitectural elements present on the GPU.In this section, we demonstrate the potential advantages of the proposed microarchitectural enhancements across various domains, confirming the importance of these microarchitectural features.Our observations are based on prior works, which highlight the potential benefits of similar optimizations across diverse workloads.We evaluate the influence of each optimization by examining communication overheads, high data reuse, utilizing modular reduction, or employing integer arithmetic.Table 9 presents an overview of our findings, highlighting the potential advantages of the proposed microarchitectural extensions across an array of other workloads.
The recent Hopper architecture by NVIDIA for the H100 GPU introduced a feature termed DSMEM (Distributed Shared Memory).This allows the virtual address space of shared memory to be logically spread out across various SMs (streaming multiprocessors) [26].Such a configuration promotes data sharing between SMs, similar to the (cNoC) feature we introduced.However, the details of the SM-to-SM network for DSMEM are not publicly available and to the best of ✘ ✘ ✔ ✘ N-Queens [40] ✘ ✘ ✔ ✔ Black-Scholes [32] ✘ ✘ ✔ ✘ Fast Walsh [14] ✔ ✘ ✔ ✔ ✔ Proposed optimization has the potential to significantly improve workload performance.
✘ Proposed optimization is unlikely to result in notable performance improvements.◆ Further experimentation is necessary, as it is uncertain whether the proposed optimization will lead to performance improvement our knowledge, the SM-to-SM connectivity is not global but limited to the Thread Block Cluster comprised of 8 SMs.In contrast, the (cNoC) proposed by us enables global connectivity to all 120 CUs in our MI100 GPU, enabling efficient all-to-all communication.For enhancing FHE performance, it's crucial to substantially reduce the latency in SM-to-SM communication.We aim to conduct a detailed analysis comparing the inter-SM communication overheads of the H100 GPU to those of GME in future work.
PRIFT [3] and the work by Badawi et al. [5] aims to accelerate FHE using NVIDIA GPUs.Although they support most HE blocks, they do not accelerate bootstrapping.100x [41] speeds up all HE blocks, including bootstrapping.While 100x optimizes off-chip memory transactions through kernelfusions, their implementation still results in redundant memory transactions due to partitioned on-chip memory of V100.Locality-aware block scheduling [51] has been proposed in GPUs to maximize locality within each core; however, LABS maximizes locality by exploiting the globally shared LDS through the proposed (cNoC).
FPGA accelerators: Multiple prior efforts [46,47,66,68] have developed designs for FHE workloads.However, most of them either do not cover all HE primitives or only support smaller parameter sets that allow computation up to a multiplicative depth of 10.HEAX [66] is an FPGA-based accelerator that only speeds up CKKS encrypted multiplication, with the remainder offloaded to the host processor.
FAB demonstrates performance comparable to the previous GPU implementation, 100x [41], and ASIC designs BTS [45] and F1 [69] for certain FHE workloads.Although FPGAs show great potential for accelerating FHE workloads, they are limited by low operating frequencies and compute resources.Furthermore, the substantial communication overhead and the time required to program the FPGA discourages their wide-scale deployment [63].ASIC accelerators: There exist several recent ASIC designs including F1 [69], CraterLake [70], BTS [45], and ARK [44] that accelerate the CKKS FHE scheme.F1 implementation makes use of small N and Q values, implementing only a single-slot bootstrapping.BTS is the first ASIC proposal demonstrating the performance of a fullypacked CKKS bootstrapping.CraterLake and ARK design further enhance the packed CKKS bootstrapping performance and demonstrate several orders of performance improvement across various workloads.

CONCLUSION
In this work, we present an ambitious plan for extending existing GPUs to support FHE.We propose three novel microarchitectural extensions followed by compiler optimization.We suggest a 2D torus on-chip network that caters to the all-to-all communication patterns of FHE workloads.Our native modular reduction ISA extension reduces the latency of modulus reduction operation by 43%.We enable native support for 64-bit integer arithmetic to mitigate math pipeline throttling.Our proposed BlockSim simulator enhances the capabilities of the open-source GPU simulator, NaviSim, allowing for coarse-grained simulation for faster design space exploration.Overall, comparing against previous state-of-theart GPU implementations [41], we obtain an average speedup of 14.6× across workloads as well as outperform the CPU, the FPGA, and some ASIC implementations.

Figure 1 :
Figure 1: FHE offers a safeguard against online eavesdroppers as well as untrusted cloud services by allowing direct computation on encrypted data.

Figure 2 :
Figure 2: The four key contributions of our work (indicated in green) evaluated within the context of an AMD CDNA GPU architecture.

Figure 4 :
Figure 4: Inter-CU communication: Traditional vs proposed communication with on-chip network

Figure 5 :
Figure 5: Proposed hierarchical on-chip network featuring a concentrated 2D torus topology

Figure 6 :
Figure 6: Influence of individual proposed microarchitectural extension on architectural performance metrics.Metrics illustrate a cumulative profile where each enhancement builds upon the preceding set of improvements

Figure 7 :
Figure 7: Speedup achieved from each microarchitectural extension.The baseline refers to a vanilla MI100 GPU.The reported speedup is cumulative, with each microarchitectural enhancement building upon the previous ones

Figure 8 :
Figure 8: Exploring the impact of on-chip memory size on FHE workload performance

Table 2 :
HE building blocks using CKKS Block Computation Description ScalarAdd( m , c) m + c = (B m + c, A m ) Add a scalar c to a ciphertext where, c is a length-N vector with every element c ScalarMult

Table 3 :
Practical parameters for our FHE operations.
log(q) N log Q L L boot dnum fftIter λ

Table 4 :
Cycle counts for 64-bit modulus instructions comparing MOD and WMAC features µ-arch.Cycle count is averaged over 10,000 modulus instructions computed on cached data (using LDS cache) and rounded to the nearest integer.∆ Modular operation is computed with various compile-time prime constants as modulus incorporating compiler optimizations into the performance.

Table 7 :
Performance of various FHE building blocksThe values displayed here exclude contributions from the LABS optimization, as LABS is an inter-block optimization, and the metrics provided are intended for individual blocks.
LABS takes advantage of the on-chip ciphertext preservation enabled by our cNoC microarchitectural enhancement.Across bootstrapping, HE-LR, and ResNet-20 workloads, LABS consistently

Table 8 :
HE workloads execution time comparison of proposed GME extensions with other architectures † F1 is limited to a single-slot bootstrapping, while others support packed bootstrapping.

Table 9 :
Potential benefits of proposed microarchitectural extensions across various workloads