A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic Encryption

Fully Homomorphic Encryption (FHE) enables computing on encrypted data, letting clients securely offload computation to untrusted servers. While enticing, FHE has two key challenges that limit its applicability: it has high performance overheads (10,000× over unencrypted computation) and it is extremely hard to program. Recent hardware accelerators and algorithmic improvements have reduced FHE’s overheads and enabled large applications to run under FHE. These large applications exacerbate FHE’s programmability challenges. Writing FHE programs directly is hard because FHE schemes expose a restrictive, low-level interface that prevents abstraction and composition. Specifically, FHE requires packing encrypted data into large vectors (tens of thousands of elements long), FHE provides limited operations on these vectors, and values have noise that grows with each operation, which creates unintuitive performance tradeoffs. As a result, translating large applications, like neural networks, into efficient FHE circuits takes substantial tedious work. We address FHE’s programmability challenges with the Fhelipe FHE compiler. Fhelipe exposes a simple, numpy-style tensor programming interface, and compiles high-level tensor programs into efficient FHE circuits. Fhelipe’s key contribution is automatic data packing, which chooses data layouts for tensors and packs them into ciphertexts to maximize performance. Our novel framework considers a wide range of layouts and optimizes them analytically. This lets compile large FHE programs efficiently, unlike prior FHE compilers, which either use inefficient layouts or do not scale beyond tiny programs. We evaluate on both a state-of-the-art FHE accelerator and a CPU. is the first compiler that matches or exceeds the performance of large hand-optimized FHE applications, like deep neural networks, and outperforms a state-of-the-art FHE compiler by gmean 18.5. At the same time, dramatically simplifies programming, reducing code size by 10–48.


INTRODUCTION
Fully Homomorphic Encryption (FHE) is an emerging class of encryption that allows computing directly on encrypted data.FHE enables offloading computation to untrusted servers in the cloud with cryptographic privacy.While enticing, FHE is rarely used today due to two key challenges: it has high performance overheads, and it is extremely hard to program.
Currently, FHE programs are about 10,000× slower than their unencrypted equivalents when run on a CPU.This has sparked work that has countered most of these overheads: FHE accelerator chips [45ś47, 74,75] are about 10,000× faster than CPUs; and recent GPU [42] and FPGA [5,85] implementations achieve speedups beyond 100×.For example, deep neural networks like ResNet can be evaluated in seconds on a GPU using modern FHE libraries and optimizations [66].
As FHE infrastructure and optimizations enable large and complex applications, it becomes crucial to tackle the programmability challenges of FHE through new compiler techniques.While recent work has proposed compilers for FHE, they either leave significant performance on the table or tackle only small programs with tens of operations.Thus, large applications are still coded by hand, a tedious process that takes weeks or months of work.In this paper, we present a compiler called Fhelipe that, for the first time, translates large applications into efficient FHE programs that match or outperform state-of-the-art, painstakingly optimized hand-coded programs.As a result, finding efficient layouts manually is very tricky (Sec.2.2).For example, Lee et al. 's stateof-the-art FHE ResNet introduces a new layout for efficient multi-channel convolutions [51ś53], that relies on a complex interleaving, and builds on years of prior work on this problem [13,31,44].Why are current FHE compilers insufficient?Recent FHE compilers recognize that data layouts are key, but they either leave substantial performance on the table or only handle tiny programs.
CHET [26], HECO [79], and HeLayers [6] are FHE compilers that abstract data layouts: programmers use a high-level language, and they automatically pack data into FHE vectors.But these compilers use a limited set of layouts (e.g., row-major for CHET, column-major for HECO, and the same layout throughout the program for HeLayers), which forces either inefficient packing or expensive conversions.Moreover, these compilers rely on profiling to choose layouts, which is expensive and greatly limits their search space.Though these compilers scale to large programs, their limited layouts cause order-of-magnitude overheads over hand-coded benchmarks.
Prior work has also proposed compilers based on program synthesis, Porcupine [23] and Coyote [58].These are analogous to superoptimizers [59]: they produce high-performance FHE circuits, but work only on tiny kernels with tens of instructions [23,58] and cannot scale to large applications.
To tackle these challenges, we present Fhelipe, a tensor compiler that translates applications written in a simple, numpy-style tensor language to efficient FHE programs.Fhelipe relies on two key novel contributions: 1.A flexible framework to represent and optimize data layouts, achieving the dual goals of maximizing packing and minimizing costly layout conversions.Our approach builds on four new techniques.First, we introduce a new layout representation that enables many packing choices, including arbitrary dimension orders and interleaved dimensions.This representation captures the complex layouts used in real-world programs.Second, we contribute a novel compaction technique that leverages our flexible layouts to pack data with little cost and achieve high utilization.Third, we contribute an analytical layout assignment procedure that chooses layouts throughout the program to minimize the number and cost of layout conversions.This process systematically negotiates complex choices, and avoids the limitations of prior profiling-based approaches.Fourth, we contribute novel layout conversion techniques that reduce the cost of necessary conversions.2. Automatic end-to-end noise management, including bootstrapping: Compiling large FHE programs requires addressing one more challenge beyond layouts: to cope with noise, programs must perform auxiliary noise management operations to prevent data corruption.Prior compilers like EVA and HECATE [25,55] have automated the local aspects of noise management (rescaling).However, large programs require bootstraps, expensive operations that reset ciphertext noise.Bootstraps often dominate execution time, and placing them well is hard because it requires reasoning about global program structure; optimal bootstrap placement is NP-hard [9], and the only prior automatic technique, FHE-booster [83], places bootstraps poorly (Sec.8.1).
We present a new scalable algorithm for automatic bootstrap placement.We identify a set of simple heuristics that reduces this problem to dynamic programming.This lets Fhelipe efficiently place bootstraps in close to linear time.
Together, these contributions enable abstraction and composition.Even local changes to an FHE kernel often change the layout and noise of its output.Without a compiler, this requires (1) rewriting all downstream kernels to use compatible layouts and (2) performing noise management from scratch.Fhelipe automates these, drastically reducing programmer effort: for example, Lee et al. 's hand-coded ResNet takes 4,800 lines of code [52]; with Fhelipe, it takes just 100.
We implement Fhelipe targeting CKKS on both CPUs and CraterLake [75], a state-of-the-art FHE accelerator.We evaluate Fhelipe on several complex FHE programs, including neural networks, logistic regression training, and higher-dimensional tensor kernels.Fhelipe is the first compiler to match or exceed the performance of these large and carefully hand-optimized FHE applications, with speedups of up to 12.3×; Fhelipe also outperforms CHET+, which is CHET+EVA extended with automatic bootstrapping, by gmean 18.5×.

BACKGROUND AND MOTIVATION
In this section, we first present the necessary background on FHE, focusing on the CKKS scheme targeted by Fhelipe; then, we show the importance of data layouts in FHE; finally, we discuss prior compilers and their limitations.

FHE Schemes
There are two types of FHE schemes: vector schemes like BGV and CKKS [11,12,18,29], encrypt long vectors of numbers and provide arithmetic operations on them, and scalar schemes like FHEW/TFHE [21,27] encrypt one value per ciphertext, typically a Boolean or a small integer.
Scalar schemes are more flexible than vector ones, but they have have much higher overheads, especially in data-parallel applications.For example, in ResNet-20, the Lattigo CPU CKKS library takes 31s on average per application-level scalar multiply (Table 5); the TFHE-rs [86,88] library takes 2.1s per 32-bit multiply, over 60,000× slower.For this reason, Fhelipe targets CKKS [18], the state-of-the-art vector scheme.But Fhelipe's techniques apply to all vector schemes, as they all have the same operations and performance trade-offs.
We introduce CKKS's interface, i.e., its datatypes and operations, without delving into its implementation.We present only the details needed to understand CKKS's tradeoffs between performance, security, and correctness; full implementation details are available in prior work [18,69,75].Ciphertexts encrypt long vectors: In CKKS, each ciphertext encrypts a vector of  fixed-point numbers.Each ciphertext has a scale parameter, , that determines the width of the fractional part of each element; typically,  is between 30ś60 bits.
Internally, ciphertexts are represented using two polynomials with integer coefficients modulo some value.Typically, coefficient bitwidth  is over 1,000 bits, much larger than .
Due to security requirements (that we detail later),  must be quite large, typically 32K or 64K.As leaving vector slots unused does not reduce operation costs, applications must find ways to fill these large vectors to avoid ineffectual work.FHE provides a limited set of operations: Ciphertexts support only three operations, called homomorphic operations: elementwise adds, elementwise multiplies, and cyclic rotates of the underlying encrypted vectors (add and multiply allow their second operand to be unencrypted).
FHE programs are static dataflow graphs of these operations: since data is encrypted, there is no data-dependent control flow and all operations are known in advance.
Note that FHE provides no way to access individual vector elements.As a result, shuffling vector elements is tremendously expensive, requiring many rotates and multiplies; this is a major difference between FHE and conventional SIMD processing, as in vector instructions or GPUs.
Homomorphic operations have costs that vary with vector length  and coefficient width : 1 • Adds of all types, and multiplies of an encrypted and an unencrypted vector are cheap, costing  ( • ).• Rotates, and multiplies of two encrypted vectors are expensive, costing  ( •  2 ).However, these direct costs tell only part of the story: operations also indirectly affect the cost of other operations due to noise, which we discuss next. 1 Costs are given in aggregate bit-complexity per homomorphic operation; they are derived from a state-of-the-art CKKS implementation that is optimized with RNS representation, NTTs, and multi-digit keyswitching [63,75] Performance depends on multiplicative depth: For security, ciphertexts are encrypted with a small amount of noise.Unfortunately, homomorphic operations increase noise, and if noise becomes too large, it corrupts the underlying encrypted values.Noise grows primarily due to multiplies: each one adds about  bits of noise.Adds and rotates add negligible noise.
There are two main techniques to keep noise in check: (1) Noise is trimmed by progressively narrowing .In CKKS,  is reduced by about  bits after each multiply through two operations, rescaling and mod-switching.Thus, each ciphertext supports  ≈ / noise trims before running out of bitwidth; we call  the level of the ciphertext.Rescaling and mod-switching are cheap and prior work has already proposed effective ways to apply them automatically [25,54,55].
(2) Once  cannot be narrowed further ( reaches 0), the ciphertext must be bootstrapped, a procedure that lowers noise and restores the ciphertext to a high , letting it undergo more operations.Though bootstraps enable arbitrarily deep computations, they are extremely expensive: they involve hundreds of rotates and multiplies of high- ciphertexts.Thus, they should be minimized.Fig. 2 shows how  evolves in a typical program:  decreases as noise is trimmed, then resets up during bootstrapping.Since multiplies consume , the number of bootstraps depends on the application's multiplicative depth (i.e., the longest chain of multiplies).And since bootstraps dominate cost in most FHE programs, reducing multiplicative depth is in practice far more important than minimizing direct operation costs.

Data Layout Is Crucial
Selecting good layouts is crucial in FHE.We first show this concretely with a simple example, then discuss how this problem affects more complex prior applications.
Consider a workload that consists of successive matrix-vector multiplications: each step multiplies an -element vector with an  × matrix, producing an -element output vector for the next step.This workload arises frequently, in e.g., fully connected or recurrent neural network (RNN) layers.
FHE ciphertexts have thousands of slots, and using all of these slots is crucial for performance.For concreteness, assume =128-element vectors and =16K-slot ciphertexts.Placing each matrix row in a separate ciphertext would be tremendously inefficient, as each ciphertext would use only 128 of these slots.Instead, we must pack data further: we store the entire  ×  matrix in one ciphertext, using all its 16K slots.Fig. 3 shows this example scaled down to =4 and =16.
To perform the matrix-vector multiply efficiently, we replicate vector  to match the shape of matrix  (Fig. 3b, step 1), then compute all partial products using a single FHE multiply (step 2), and finally sum each row's partial products to produce the output vector (step 3).This procedure is efficient because replicate and sum are relatively cheap in FHE, performing only log =7 rotates and adds; overall, this computes matrix-vector multiply in multiplicative depth 1 and modest overhead.
The problem is that this procedure leaves the output in a different format than the input: while the input 's elements are contiguous, the output 's elements have gaps of  − 1 unused slots.These gaps cannot be removed efficiently: rotates are the only mechanism for moving elements between slots, and each of the  output elements must be rotated by a different amount.Converting  back to 's contiguous layout requires 128 rotates and masks, which would add about 10× overhead.
The right approach is to use different layouts for successive matrix-vector multiplies: instead of converting  to follow 's layout, we can find a different procedure that uses  as-is.In this simple example, this is achieved by alternating row-major and column-major layouts on successive  matrix-vector multiplies, as Fig. 3c shows: a column-major procedure takes an input with gaps, and produces an output with contiguous elements.Stitching these layouts manually is tedious: changing the layout of a tensor requires rewriting all downstream computation.And this example only scratches the surface of the issues in finding good layouts in complex applications.For a more complex example, consider deep learning.In 2018, GAZELLE [44] proposed a clever layout for convolutional layers that allowed packing across input or output channels, but not both.In 2022, Lee et al. [52] invented a more complex layout that allows full packing, even with striding, and improves ResNet performance by about 10×.This layout uses a tricky interleaving of elements (see Sec. 5.6 and [52, Fig. 3, Fig. 5]).Similarly, LoLa [13] achieves large speedups over CryptoNets [31] through the use of efficient data layouts.
In summary, experts spend substantial effort to find data layouts, and the key contribution in many FHE application papers is a new layout.Fhelipe is the first compiler that finds and systematically optimizes these layouts, matching or outperforming these manual implementations.

Prior FHE Compilers
Prior FHE compilers automate important aspects of FHE programming, but have key limitations.Table 1 summarizes the main differences among these compilers.Tensor compilers: CHET [26] is a domain-specific compiler for neural networks.CHET abstracts the data layout of FHE programs, like Fhelipe.However, CHET is very different from Fhelipe: First, it provides a limited interface that supports only a few coarse-grained operations (e.g., convolutions and fully connected layers); by contrast, Fhelipe exposes a general tensor programming interface that enables a broad set of applications beyond the specific neural nets targeted by CHET.Second, CHET considers only row-major layouts; Fhelipe considers a far wider range of layouts, where dimensions can be in arbitrary orders and interleavings.Third, CHET compares layout choices by profiling, which limits it to evaluate only four layout combinations per program; instead, Fhelipe selects layouts analytically, without profiling, and systematically produces programs that combine hundreds of different layouts in linear time.As Sec. 8 shows, CHET's limited layouts cause large overheads: gmean 18.5× in neural networks and up to 7,600× in tensor applications.
Other tensor compilers improve some aspects of CHET, but share many of its limitations.HECO [79] is a more general compiler than CHET that works by automatically vectorizing scalar loop nests to FHE.But like CHET, HECO uses a fixed layout (column-major) for all tensors.AHEC [16] compiles a range of machine learning frameworks down to different hardware backends, but it also does not optimize layouts.✓ ✗ ✗ ✓ ✗ HECO [79], HeLayers [6] ✓ ✗ ✗ ✓ ✓ EVA [25], HECATE [55], ELASM [54] ✗ nGraph-HE2 [10] and SEALion [77] are also compilers for neural networks, but they use only one ciphertext slot per inference and rely on batching to use more slots.Batching many inferences together makes vectorization easy, but it is impractical: using all slots requires batching =32K inferences.Individual clients cannot provide this many inferences, and on large networks like ResNet-20, batched activations take 100s of GB of memory [52,75].
Finally, HeLayers [6] is a compiler for neural networks that supports a wider range of layouts than row-or column-major.But these are only a small subset of Fhelipe's layouts, which limits packing.Moreover, HeLayers picks a single layout for the whole program, and uses profiling to select it.Thus, HeLayers has similar overheads to CHET: its evaluation reports similar performance on single inferences, and HeLayers outperforms CHET substantially only when batching is used [6].Vector compilers: EVA [25], HECATE [55], and ELASM [54] abstract key details of CKKS, preventing several correctness and security bugs.Their main contributions are efficient techniques for inserting rescaling (managing noise by trimming coefficients, Sec.2.1).However, these compilers expose a vector interface that leaves data layouts to programmers.Their techniques are orthogonal to our contributions, and in fact, Fhelipe adopts EVA's waterline rescaling.
Alchemy [24], E3 [20], Marble [81], and T2 [33] are also vector compilers, but they do not optimize rescaling.Program synthesis: Porcupine [23] and Coyote [58] use program synthesis techniques to optimize tiny FHE kernels.Like Fhelipe, Coyote optimizes layouts, whereas Porcupine leaves them to programmers.These techniques work well on small programs, but are limited to programs with only tens of scalar operations: they are so expensive that they would take years to compile any real-world application.For example, Coyote [58] takes ≈10s per scalar operation.Extrapolating linearly (generous, since its search algorithms are superlinear), ResNet-20 (120M operations) would take 3.8 years.By contrast, Fhelipe leverages scalable optimizations to compile programs with millions of operations in seconds (e.g., 15s for ResNet-20).Automatic bootstrapping: All compilers mentioned so far target small applications and do not perform bootstrapping.FHE-Booster [83] is recent work that automates bootstrapping.However, FHE-Booster uses score-based heuristics that achieve limited speedups and incur pathologies, as we show in Sec. 8.Moreover, these heuristics have superlinear runtime, and take many minutes or fail to complete in some circuits [82].By contrast, Fhelipe uses heuristics to reduce the problem to dynamic programming, which it solves optimally in close to linear time.
DaCapo [19] is concurrent work that, like Fhelipe, automates bootstrapping by applying heuristics to reduce the search space and then optimizing in polynomial time.Scalar FHE compilers: Several prior compilers target scalar FHE schemes [14,15,22,32,50,80,87], like TFHE.These compilers have different objectives from vector ones: scalar schemes encrypt one value per ciphertext, so packing is unnecessary, and they bootstrap after every operation.However, as Sec.2.1 discussed, scalar schemes have much higher overheads than vector ones.Fig. 4 shows an overview of Fhelipe.Fhelipe takes as input a program written in a simple tensor language (Sec.4) that hides FHE details.Fhelipe first parses the program to produce a dataflow graph (DFG) of tensor operations, which is refined in successive passes.Then, Fhelipe assigns layouts to tensors (Sec.5), inserting layout conversions when needed.Next, Fhelipe applies noise-management techniques: waterline rescaling first, and then automatic bootstrapping (Sec.6).

Tensor DSL program
Finally, Fhelipe lowers tensor operations to CKKS homomorphic operations on vector ciphertexts.The resulting circuit can be executed by a variety of backends: Fhelipe currently supports the Lattigo CPU FHE library [63] and the CraterLake FHE accelerator [75].Adding other backends (e.g., other CPU [1,3,7,34] and GPU [43] libraries) would be easy: for scale, the Lattigo backend is only 400 lines of code (2% of the codebase).

FHELIPE PROGRAMMING INTERFACE
Fhelipe's input language represents data as tensors: multidimensional arrays of fixed-point numbers.Table 2 details the language's native operations.This is a Python-embedded DSL, providing usual conveniences like functions and loops.Overall, this interface is similar to numpy [39] and other tensor languages, like those provided by PyTorch [68] and TensorFlow [4].Listings 1 and 2 show two basic examples: matrix-vector multiply and convolution.Note that, while tensors have a logical shape, their ciphertext layout is left unspecified.For instance, Listing 1 can be synthesized using both the row-major and the column-major layouts from Sec. 2.2 (and many others).
Because Fhelipe enables composability, programmers can reuse procedures like these to build more complex ones.We implement a simple standard library that we reuse across applications.It includes common kernels (e.g., convolutional and fully connected layers) and non-linear functions (e.g., ReLU and sigmoid), which in FHE are approximated using polynomials.

AUTOMATIC DATA PACKING
Fhelipe's key contribution is a framework to represent, analyze, and choose tensor layouts that efficiently pack data into FHE's enormous vectors.FHE has unique restrictions and optimization goals that are not present in unencrypted computation.Prior work, including tensor compilers [4,68], tensor optimizers [65,72,84,89], and automatic vectorization techniques [17,60,70,78], optimizes data layouts to use SIMD datapaths and systolic arrays well, reduce data transformations and shuffles, and tile to reduce data movement.By contrast, FHE requires optimizing layouts to pack much larger vectors and minimize multiplicative depth, while coping with operations that create large gaps and avoiding expensive layout conversions.New techniques are necessary to reason about these tradeoffs and choose appropriate layouts.
Fhelipe's layout framework combines four novel contributions.First, we introduce a flexible layout representation (Sec.5.1) that supports arbitrary dimension orders, interleavings of dimensions, and gaps.This representation generalizes the wide range of packing choices proposed in prior FHE applications, enables new optimizations, and reduces data transformations.Second, we introduce a compaction procedure (Sec.5.2) that leverages our flexible layouts to keep ciphertexts highly packed.Third, we present a layout assignment algorithm (Sec.5.4) that operates analytically and without profiling, by reasoning about the work added by conversions induced by incompatible layouts.Fourth, novel FHE permutation algorithms (Sec.5.5) reduce the cost of needed conversions.
These contributions open a wide range of layouts and enable the compiler to optimize them quickly.We showcase these new capabilities with two end-to-end examples (Sec.5.6).

Flexible Layout Representation
To motivate the need for flexible layouts, consider again the matrix-vector multiply example in Sec.2.2.If we restricted all tensors to a row-major layout, we would miss the efficient implementation that alternates row-major and column-major layouts.To enable this, layouts need two key ingredients.First, they need to support arbitrary dimension orders (e.g., row-major and columnmajor in this case).Second, they need to support gaps, runs of empty slots that arise during tensor operations like striding or summing.For instance, in Fig. 3b the  ×  output vector  has a stride of , with  − 1 gaps between each element, due to the summing of partial products.To avoid conversions, we must track and cope with these gaps.
Fhelipe's layout representation is even more flexible than just allowing arbitrary dimension orders and gaps: it allows arbitrary permutations of the bits of each index.Layout definition: Consider a tensor with dimension indices , , , ..., each with a different number of bits  ,  , , .... Let the string  = (  −1 , ...,  0 ,   −1 , ...,  0 ,   −1 , ...) be the concatenation of the individual bits of all indices.Then, a layout of this tensor is any permutation of the elements of , with gap bits (denoted with ) interleaved arbitrarily.
Fig. 5 shows the flexibility of this representation: it allows performing many operations by simply changing the tensor layout (the permutation of index bits), without changing the underlying ciphertexts.Fig. 5 shows that transposing two dimensions simply swaps their layout bits; and shrinking a dimension, striding by a power of two, and sum-reducing simply introduce gap bits.
Additionally, allowing the bits of each dimension to be out of order reduces the costs of more complex transformations, like compaction, which we will see in detail later.

Restrictions:
The key limitation of this layout format is that dimensions whose size is not a power of two need padding (e.g., a 3×3 matrix would have one element unused between rows).However, padding non-power-of-two dimensions is a good choice overall: it simplifies layout conversions, and it is a natural fit for FHE vectors, which are power-of-two sized for performance reasons.Slot-and ciphertext-selecting bits: As ciphertexts have only  slots, larger tensors must be stored across multiple ciphertexts.By convention, the lowest-order log 2  bits of  encode the slot of each element, and the remaining (| | − log 2 ) bits encode its specific ciphertext.We call these bits slot-selecting and ciphertext-selecting, respectively.
Since we can access ciphertexts individually (unlike slots within the ciphertext), ciphertextselecting bits are more flexible than slot-selecting bits: they can be reordered for free and gaps can be discarded with no overhead.

Compaction
As we have seen, common operations like striding and summing introduce gaps in slot-selecting bits.Gaps cause low utilization, as they leave many ciphertext slots unused.With restrictive layouts (e.g., row-or column-major), eliminating these gaps would require an expensive format conversion (Sec.2.2).But our flexible layouts make eliminating gaps cheap, by converting ciphertext-selecting bits into slot-selecting bits, replacing the gap bits.We call this process compaction.
Fig. 6 shows compaction at work.A 4×4×4 tensor that takes four 16-slot ciphertexts is strided in the  and  dimensions, creating a 4×2×2 tensor with two gap bits in its layout.Compaction merges these four ciphertexts into a single ciphertext, by filling gap bits with ciphertext-selecting bits.This produces a tensor with layout ( 0 ,  1 ,  0 ,  0 ).Fig. 6 (right) shows that compaction is relatively cheap: ciphertexts have the same gap pattern, so they can be combined using only one rotate and add each, without consuming any levels.This also shows why we allow the bits of a dimension to be out of order: cheap compaction would have been impossible if 's bits had to be contiguous.
Compaction's compute cost is quickly recouped: compacting by  × takes  -1 rotates and adds, and reduces downstream computation by  ×.Thus, compaction breaks even after a single rotate or multiply, and brings large savings when followed by more expensive operations like polynomials or bootstraps.As a result, compaction lets us eliminate all inefficiencies due to gaps: compaction is automatically performed on any tensor that spans multiple ciphertexts, and gaps are unavoidable for small tensors that already fit within a single ciphertext.

Implementation of Tensor Operations
Fhelipe's layout representation makes it easy to perform operations efficiently on all tensor layouts: Reshapes only change the tensor's layout string (extend, shrink, stride, drop_dim, insert_dim, reorder_dim), but do not modify the underlying data.replicate and sum perform a logarithmic sequence of rotates and adds.Both operations rotate by the powers of 2 corresponding to the bits of the replicated or summed dimension: in replicate, each rotate-add doubles the number of replicated copies; in sum, each rotate-add produces partial sums on subsets of double the size.shift, rotate, and layout conversions require reordering the elements of the underlying FHE vectors; Fhelipe implements them using a unified approach for permutations, described in Sec.5.5.+ and * perform elementwise adds or multiplies on the underlying FHE vectors.This requires both inputs to have the same layout, which is ensured by conversions inserted during Fhelipe's layout assignment pass (Sec.5.4).
To simplify compaction, Fhelipe keeps gap elements set to 0. Thus, sequences of operations that introduce gaps (shrink, stride, sum) must mask out discarded elements.This is done by multiplying with an unencrypted vector of 0 and 1 values.

Analytical Layout Assignment
Fhelipe assigns tensor layouts analytically, without any profiling.It uses a forward pass that assigns initial layouts, enhanced with backtracking to reconsider and improve layout choices.
The forward pass traverses the dataflow graph in topological order, starting from program inputs.It assigns initial layouts using a simple procedure: (1) each program input is assigned row-major layout, (2) each unary operation consumes its input in its current layout, which determines the layout of its output, and (3) each operation that produces gaps (stride, shrink, sum) performs compaction (Sec.5.2) to fill them when possible.
So far, this requires no layout conversions.However, binary operations (+ and *) need both inputs to have the same layout.So, when the pass reaches a binary operation with mismatched input layouts, it initiates backtracking to insert a layout conversion.
Backtracking independently considers converting each of the two inputs.For each, backtracking inserts a conversion at the input, and then attempts to hoist that conversion earlier in the program, where it may become cheaper or unnecessary.Each step of hoisting moves the conversion from the output of an operation to its inputs by deterministically finding input layouts that would produce the output in the desired layout.Hoisting proceeds greedily while the cost of the conversion decreases or stays the same (in terms of rotate groups, Sec.5.5).At the end, backtracking is left with two options for resolving the mismatch; it picks the cheaper one.
Backtracking works well because the cost of layout conversions varies drastically.In the best case, the conversion can be hoisted all the way to a program input, where it can be completely avoided by just changing the input's initial layout.But there are also other cases where the conversion can be made cheaper.For example, consider a matrix-vector multiply  •  =  (as in Fig. 3) where  and  have mismatched layouts:  is in row-major format (as in Fig. 3b), but  is in column-major format (like  in Fig. 3c).At the element-wise multiply, converting either inputÐ or the replicated  vectorÐis equally expensive.But hoisting the conversion before 's replication makes it much cheaper, as it requires permuting =128 times fewer elements.
Our implementation traverses only linear chains of operations (fan-in and fan-out of 1) to keep runtime linear.While more sophisticated implementations are possible (e.g., backtracking through the entire graph), we find that this suffices to find efficient layouts for all the applications we study.

Permutations
Even with a good layout assignment, layout conversions are sometimes needed.Fhelipe implements layout conversions and tensor shifts using a unified framework for permuting vector elements.Single-stage permutations: In principle, any permutation can be implemented as a dataflow graph with multiplicative depth 1.Let a rotate group be a subset of vector elements that the permutation rotates by the same amount.Then, we implement a single-stage permutation by (1) isolating each group into a separate vector via masking, (2) rotating each group vector individually, and (3) summing the rotated vectors.Unfortunately, permutations can have  =  () rotate groups, resulting in  () vector operations taking  ( 2 ) time.Decomposed permutations: To reduce runtime, we decompose permutations into a sequence of stages.Each stage is a permutation with a small number of rotate groups.Picking the number of stages is non-trivial: adding stages reduces the number of operations but increases multiplicative depth, as each stage performs masking.Thus, using too many stages can hurt performance by forcing more frequent bootstrapping.Empirically, we find that limiting stages to   =16 groups balances work and depth, and works well in practice.
For a given number of stages , the best general permutation algorithm requires  2/ groups per stage [34].But this is far worse than the theoretical lower bound of  1/ , especially when  ≪ .
To avoid this inefficiency, we propose decomposition algorithms, described below, that exploit the structure in the permutations induced by each operation.Decomposing layout conversions: Layout conversions reorder the bits of the layout, with a conversion that moves  bits requiring  = 2  groups.Fhelipe decomposes conversions so that each stage (except the last) moves   = log 2 (  ) = 4 bits of the layout.Each stage can be constructed greedily so that it reduces  by at least   − 1, thus reducing  by at least a factor of (  /2).As a result, Fhelipe's layout conversions perform at most 2× more rotates than the theoretical lower bound.Decomposing shifts: A tensor shift by  moves index  to index ≡  +  (mod 2  ).Fhelipe decomposes shifts so that the first  stages move  to index ≡  +  (mod 2   ): the first  stages add the first   bits of  + .Fhelipe chooses   greedily under the constraint that no stage exceeds   groups.
We compare our tensor shift algorithm against the lower bound empirically (obtaining an analytical bound is hard because the behavior of shifts has a complex dependence on both the layout and the shift amount).For =32K, Fhelipe's shift algorithm performs on average 1.68× (standard deviation 14.0%, max 3.10×) more rotates than the theoretical lower bound.We compute this by randomly sampling 200,000 layouts and simulating all possible shift amounts (results don't change after about 2,000 samples).

CNN layer with striding:
A convolutional neural network (CNN) layer processes an input activation tensor with  channels of  ×  elements, and produces an output activation tensor with  channels of  ×  elements.The convolution weights are  ×  filters of  ×  elements.
A key part of CNNs are bottleneck layers, where each  ×  output channel is smaller than each  ×  input channel, and the output has more channels than the input ( > ).Striding is a common way to achieve this: striding by  discards all but every -th row and column, so each output channel is  / ×  /.
Listing 3 shows the Fhelipe code for a CNN layer with striding, representative of ResNet.This code reuses conv2d from Listing 2. Fig. 7 shows Fhelipe's implementation when the input activation is 2 × 4 × 4, weights are 4 × 2 × 2 × 2, and the stride is =2, resulting in 4 × 2 × 2 output activations.For simplicity, each ciphertext holds =16 elements.The dataflow graph in Fig. 7 shows the tensor operations.Some operations list the corresponding Fhelipe code and ciphertext operations.The drawings on the right show the tensors, layouts, and ciphertext contents like in Fig. 6.Dashed lines separate graph regions with different layouts (labeled 1 ś 5 ).We emphasize which layout bits change on the right.
The input tensor is in row-major layout and takes 2 ciphertexts 1 .Convolution first shifts the input to produce 4 tensors, each aligned with an element of the 2 × 2 filters.This does not change layouts.Then, each tensor is replicated =4 times.This adds  1:0 as ciphertext-selecting bits, so each tensor now takes 8 ciphertexts 2 .Next, the tensors are multiplied by the weights and added together; layouts remain the same.Last, the combined tensor is summed along its =2 dimension: as  0 is ciphertext-selecting, this involves summing ciphertexts pair-wise and does not introduce gaps 3 .
Striding is challenging in FHE, because it creates gaps.Here, striding with =2 discards every other element of ℎ and , replacing their least significant layout bits ℎ 0 and  0 with gaps  4 .Note how ℎ 1 and  1 in the input become ℎ 0 and  0 in the output.Fhelipe automatically fills these gaps through compaction 5 : ciphertext-selecting bits  1 and  0 fill the  bits, reducing the tensor from 4 to 1 ciphertext (just as in Fig. 6). Lee et al.'s multiplexed-convolutions layout is a specific implementation of this technique for ResNet [52]; however, Fhelipe generalizes this technique, and is the first to apply it automatically.
Finally, a non-linear activation function (ReLU) produces the output activations.In FHE, nonlinear functions are approximated with polynomials, which consist only of element-wise adds and multiplies.These element-wise operations do not change the layout.Fhelipe allows implementing any polynomial; prior work has proposed a range of accurate [53] and cheap [41,62,67] activation functions for FHE.
Due to compaction, the output layout (ℎ 0 ,  1 ,  0 ,  0 ) has the bits of dimension  in non-consecutive slot-bits.Thus, the next layer must use that layout for its input.With Fhelipe, this happens automatically and without any conversions: the shifts in Fig. 7 induce different rotations (and masking if needed); Fhelipe automatically chooses the right format for filter weights; and the sum-reduction on  is done with rotates and adds.2. LogReg: Fig. 8 shows Fhelipe's layouts in logistic regression training, an end-to-end FHE application (Sec.7).Here, we show full tensor sizes and ciphertext layouts instead of using reduced Bootstrap resets a ciphertext back to the initial  0 , allowing for arbitrarily deep FHE programs.Bootstraps are very expensive, so minimizing them is crucial for performance.We propose a practical algorithm for placing bootstraps automatically.
6.1 Placing Bootstraps Is Challenging Fig. 10 analyzes a simplified ResNet block with the input starting at =1.As the graph is 2 multiplies deep (Fig. this forces a bootstrap.Then, it is best to bootstrap right after the batch normalization (Fig. 10b): this bootstraps only one ciphertext, and produces an output at a high level ( 0 − 1), which reduces the need for bootstraps in subsequent layers.Systematically finding the above bootstrap placement is not easy (in fact, optimal placement is NP-hard [9]).As a comparison point, consider lazy bootstrapping: bootstrapping ciphertexts right before they run out of levels.While simple, this approach performs poorly (Fig. 10c).First, it inserts 3× more bootstraps, due to bootstrapping during the convolution, where computation is wide.Second, it produces an output at =0, which forces an immediate bootstrap before the subsequent multiply.As a result, Fhelipe outperforms lazy bootstrapping by gmean 3.5× (Sec.8.2).

Base Algorithm
We propose a novel heuristic that lets us place bootstraps using dynamic programming in close to linear time.
We define the depth of a node as the maximum number of noise trims (multiplies) on a path from an input to the node.Depth partitions the nodes of the graph, as shown in Fig. 10a.Depth and level are closely related: if there was no need for bootstraps, nodes at depth  would be at level  =  0 − .
Fhelipe's heuristic is to make bootstrapping decisions at depth boundaries: bootstrap either all edges crossing a boundary, or none of them.This avoids one of the main pitfalls we saw in Fig. 10c: lazy bootstrapping bootstraps all but one of the edges crossing the 1ś2 depth boundary (the residual connection), producing an output at level 0; had it bootstrapped all of them, the output would have been at level  0 =10.
Then, Fhelipe uses dynamic programming to find the best boundaries to bootstrap at.Let (1)  [] be the minimum cost of computing all nodes up to depth , (2)  [] the cost of bootstrapping all ciphertexts crossing boundary , and (3)  [] [ ] the cost of computing all values at depth  at level .We compute  [] recursively by choosing the best option for the last bootstrapped boundary : [ −] guarantees that no node up to depth  − falls below level 0; bootstrapping the ciphertexts crossing boundary  −  ( [ − ]), together with  ≤  0 , guarantees the same for the nodes between depths  −  and .
For a program with maximum depth , computing  [] takes  ( •  0 ) time; since  0 is a small constant (≈10), this is practically linear in the size of the program.We obtain  [] and  [] [ ] from CraterLake's per-operation costs [75]; other cost models can be used as well.

Additional Optimizations
Narrowing depth boundaries: Any node at depth  that does not trim noise can be treated as having depth +1 without violating correctness.For instance, the rotates in Fig. 10a can be treated as having depth 2 instead of 1.This reduces  [1], the cost of bootstrapping at the 1ś2 depth boundary, from 3 ciphertexts to 1. Fhelipe uses this to reduce each  [] by running a simple min-cut algorithm.Omitting shortcut bootstraps: Not all edges crossing a boundary need to be bootstrapped.This is common for shortcut edges that skip over multiple depths, like the residual connection in Fig. 10.Specifically, shortcuts that go from a higher-to a lower-level node do not need to be bootstrapped.Fhelipe exploits this greedily omitting such shortcuts from  [].To implement this correctly, Fhelipe tracks the mapping from depth to level for each  []; using persistent append-only lists [64], this increases runtime only by a factor of log .

METHODOLOGY 7.1 Benchmarks
We evaluate Fhelipe on a diverse set of challenging benchmarks.Table 3 summarizes their features.Large FHE applications: Fhelipe seeks to compile large applications, which exacerbate programmability issues.But prior FHE compilers use simple benchmarks, like small kernels [23,58,79] with tens of scalar operations or shallow neural networks that require no bootstrapping [25,26].To test Fhelipe's capabilities against well-optimized baselines, we use three large FHE programs that have been manually developed by FHE experts.Each program is the state-of-the-art in its domain, and is beyond the reach of prior FHE compilers: (1) ResNet-20 is one of the most complex neural networks to be ported to FHE.Our manual baseline is Lee et al.'s implementation [52], which uses state-of-the-art layouts.ResNet is a deep convolutional network with a non-linear structure, including skip connections that complicate data layouts.It approximates ReLU activation functions with high-degree polynomials, which achieve high accuracy but are much costlier than the low-degree approximations used by simpler FHE networks [13,41].As a result, ResNet-20 requires frequent bootstrapping.Each execution performs one inference using images from the CIFAR-10 dataset [48].
(2) RNN is an NLP benchmark that performs sentiment analysis using a Recurrent Neural Network [28].Our manual baseline follows Podschwadt's algorithm [71], enhanced with the data layouts proposed by Samardzic et al. [75].RNN processes a sequence of 200 word embeddings   , and incorporates each in its hidden state following ℎ +1 =  ( ℎℎ ℎ  +  ℎ   + ). (•) is a degree-3 approximation of tanh(•), and   and ℎ  are both of dimension 128.The chain of  ℎℎ ℎ  matrix-vector multiplies has a similar structure to the workload in Sec.2.2.We use the IMDB dataset [57].
(3) LogReg performs logistic regression to train a linear binary classifier.LogReg is one of the few FHE applications that trains an ML model, instead of performing inference.Our manual baseline is state-of-the-art HELR [37].LogReg performs 32 iterations of Nesterov Accelerated Gradient Descent [73] with batch size 1024 and 197 features per sample; the sigmoid activation is approximated by a degree-7 polynomial.We use the MNIST dataset [49].
Table 3 shows that these applications perform tens to hundreds of millions of scalar operations (when unencrypted), and their multiplicative depth is in the hundreds, requiring frequent bootstraps.Shallow neural network: Since prior compilers do not perform bootstrapping, we also use a shallow neural network that does not require it: (4) LoLa-MNIST is a LeNet-style network from Low-Latency CryptoNets (LoLa-Dense) [13] that uses sophisticated layouts.It has unencrypted weights and uses the MNIST dataset [49].
Tensor kernels: Finally, we include three tensor kernels that would be hard to code manually in FHE.These kernels have no prior implementation, and showcase Fhelipe's generality: (5) FFT computes the Fast Fourier Transform of a vector of 128K samples.(6) TTM computes the third-order tensor-matrix product    =       [56]; all dimensions have size 64.

Compared Systems
We compare Fhelipe against carefully ported manual baselines and CHET+, an extension of the CHET compiler that incorporates EVA [25] and other improvements.CHET+ is representative of state-of-the-art FHE compilers; in Sec.8.2, we also compare with other relevant prior work, including FHE-Booster [83] and HeLayers [6].
Fhelipe: We implement Fhelipe in 21,000 lines of Python and C++ code.Fhelipe compiles all applications in under a minute using a single CPU thread (Table 3).Compilation time scales linearly with program size.Fhelipe automatically chooses the  and  CKKS parameters to meet a user-provided security level (128-bit security by default) [8].It uses =32K as it is the smallest  that allows for bootstrapping with 128-bit security.Then, benchmarks with bootstrapping (ResNet-20, RNN, LogReg, FFT) use the maximum  ciphertext modulus bitwidth allowed by the security level (=1,552 for 128-bit security), whereas benchmarks without bootstrapping use the minimum  sufficient to cover their multiplicative depth.LoLa-MNIST uses =16K to match the manual implementation.
Fhelipe leaves to the user the choice of the application fixed-point scale .For the evaluation, applications with manual baselines use the scales selected by prior work (between 35 and 45 bits), whereas tensor kernels use 45-bit scales (Table 3).
Fhelipe uses Lattigo's state-of-the-art bootstrapping algorithm [2], which uses variable scales and consumes 742 bits of modulus per bootstrap.Fhelipe also uses multi-digit keyswitching (an important optimization [30,47,75]), which consumes 305 bits.This leaves  0 ≈ 10 levels of application computation for applications with bootstraps (the exact value depends on the application's scale).Manual baselines: To perform a controlled comparison, we port manual baselines to a common framework (each baseline uses a different FHE library and bootstrap algorithm).We modify Fhelipe to expose a vector interface, disabling all layout passes and automatic bootstrapping.With our contributions removed, this compiler is essentially a reimplementation of EVA [25] that targets the same backends as Fhelipe.When the original baseline implementations use different bootstrapping algorithms that leave a different number of usable levels, we have to place bootstraps anew.We insert bootstrap manually at natural chokepoints, following the intuitions in the original papers.All of the manual reimplementations have better performance than reported in the original papers.CHET+: CHET, a state-of-the-art tensor FHE compiler, cannot compile most of our benchmarks.First, CHET provides only a few coarse-grained kernels (e.g., matrix-vector multiply), and our benchmarks have operations that cannot be expressed with this limited interface (e.g., LogReg's gradient descent).Second, CHET does not support bootstrapping.
To compare Fhelipe with CHET's approach, we extend CHET to CHET+.CHET+ includes manual implementations of the additional kernels necessary for our applications, and automatic lazy bootstrapping at kernel boundaries (i.e., bootstrapping tensors between CHET's kernels right before they run out of noise budget).
We implement CHET+ by modifying Fhelipe: we group Fhelipe's fine-grained operations (e.g., replicate, *, reduce) into CHET kernels (e.g., matrix-vector multiply), and convert tensors to CHET's row-major layout after each kernel.We also disable optimizations that are unique to Fhelipe: repacking, replicating data inside of ciphertexts, and decomposing permutations into stages.This implements CHET's approach to layouts while allowing a controlled comparison.Like all other systems, CHET+ uses EVA waterline rescaling, so it supersedes the CHET+EVA combination in [25].

Platforms
We evaluate Fhelipe on CraterLake [75], a state-of-the-art FHE accelerator, and on a CPU.For the CPU results, we use the Lattigo state-of-the-art FHE library [2], and run experiments on a single core of a 3.5 GHz AMD Zen2 Threadripper PRO 3975WX CPU (Lattigo is single-threaded).For the CraterLake results, we target the CraterLake configuration in [75].CraterLake uses double-prime rescaling, using two moduli per level, to support scales larger than its 28-bit wide datapath [45,75].We use CraterLake's backend compiler and simulator, which the authors shared with us.

Fhelipe Achieves High Performance
Table 4 compares the performance of Fhelipe, state-of-the-art manual baselines, and CHET+ (Sec.7) on CraterLake.On the four machine learning applications, Fhelipe outperforms the manual baselines by gmean 2.5× and CHET+ by 18.5×; on the three tensor kernels, Fhelipe outperforms CHET+ by up to 7,600×.Further, Fhelipe matches or outperforms the manual baselines across all benchmarks.To help us analyze the performance of these different implementations, Fig. 11 shows the percentage of execution that each application and system spends on bootstrap and nonbootstrap computation.On the large applications (ResNet-20, LogReg, RNN), the welloptimized implementations (Fhelipe and manual) are dominated by bootstrapping, taking up 88%ś99% of total runtime.
Although CHET+ spends similar time on bootstrapping as Fhelipe (1×ś1.3×across all applications), CHET+ performs much more non-bootstrap compute due to poorly utilizing ciphertext slots and frequently permuting ciphertext elements to keep tensors in its restrictive row-major layouts.On applications without bootstrapping (LoLa-MNIST, TTM, MTTKRP) these layout inefficiencies translate directly into CHET+'s large end-to-end slowdowns.ResNet-20: Fhelipe matches the performance of Lee et al.'s heavily-optimized manual ResNet-20 [52].This implementation relies on complex layouts to replicate data within ciphertexts and fill gaps after strided convolutions (as in Fig. 6); an earlier version from the same authors [53] lacked these optimizations and performed 10× worse.Yet, these optimizations are naturally captured in Fhelipe's layout representation, and Fhelipe performs them automatically.
CHET+ incurs 58× more non-bootstrap compute because (1) it does not replicate data within ciphertexts, and (2) it needs to reshuffle data after each strided convolution.Further, CHET+'s naïve bootstrapping places 1.3× more bootstraps.RNN: RNN performs a sequence of matrix-vector multiplies, similar to the example in Sec.2.2.Both Fhelipe and manual achieve good performance by alternating between row-major and column-major layouts; CHET+ is 5.1× slower because it uses only row-major layouts.LogReg: Fhelipe outperforms manual LogReg by 12.3× mainly due to performing 14× fewer bootstraps.First, thanks to compaction, Fhelipe needs to bootstrap only 1 ciphertext per tensor instead of 7. As we saw in Sec.5.6, these compactions are cheap and add no overheads.Second, Fhelipe bootstraps only 1 tensor per iteration instead of 2 due to bootstrapping at depth boundaries that avoid shortcut bootstraps (Sec.6.3).
CHET+ also removes gaps eagerly, and so performs only 6% more bootstraps than Fhelipe.However, CHET+ is 32.5× slower overall because its limited row-major layouts cause poor utilization, as little as 1 /256 of ciphertext slots; Fhelipe packs densely, as we saw in Fig. 8. LoLa-MNIST: As LoLa-MNIST requires no bootstrapping, it stresses the efficiency of layouts more heavily than the large applications.Fhelipe outperforms manual by 3.2× due to manual missing opportunities for data packing and replication, and CHET+ by 322× due to CHET+ incurring an expensive layout conversion after the first strided convolution.
Note LoLa-MNIST uses only non-power-of-2 tensor dimensions.This shows that the padding overheads incurred by Fhelipe's bit-permutation layout (Sec.5.1) are far outweighed by the efficiency gains from the additional data packing opportunities that this representation provides.Tensor kernels: As the tensor kernels have no prior manual implementations, we compare Fhelipe only to CHET+.FFT operates on one-dimensional vectors of fixed size and its dataflow graph is a simple sequence of fixed-width stages.This stresses neither layouts, nor bootstrap placement, so CHET+ is only 1.1× slower than Fhelipe.
TTM and MTTKRP stress layouts the most, as they replicate and sum across multiple dimensions, and do not bootstrap (like LoLa-MNIST).CHET+ falters because it does not replicate data within ciphertexts, using as little as 1 /512 of ciphertext slots, and because it extensively shuffles data to keep tensors in row-major layout.As a result, CHET+ is 7,600× slower on TTM and 5,600× on MTTKRP.Layouts: To study the importance of using flexible layouts independently of CHET+, Table 6 compares with Row-Major, a version of Fhelipe that restricts all tensors to row-major layouts.Row-Major differs from CHET+ in that it keeps all other Fhelipe features (bootstrap placement, decomposed permutations, replicating in ciphertexts) and in that it converts to row-major after every Fhelipe operation, whereas CHET+ converts only after each of its coarse-grained kernels.Row-Major incurs gmean 22.8× slowdown.This is due to layout conversions, e.g., to remove gaps after each sum and stride.The slowdowns of Row-Major and CHET+ are highly correlated across applications.However, Row-Major outperforms CHET+ on TTM and MTTKRP due to using Fhelipe's efficient algorithm for decomposing layout conversions into stages (Sec.5.5).
We've so far focused our layout comparisons to CHET, but HECO and HeLayers also select layouts automatically.HECO [79] uses only column-major layouts (and lacks other Fhelipe features, like our decomposed layout conversions), so it would suffer large overheads on these benchmarks.
HeLayers [6] supports a wider range of layouts than CHET or HECO.Since implementing its approach within Fhelipe would be hard, we compare using a single representative benchmark, ResNet-20.We manually apply HeLayers' layouts to ResNet-20, following the exact implementation of convolutions and striding in [6] and selecting the layout that maximizes performance.We found its performance to be 8.3× worse than Fhelipe's (on CraterLake).
This slowdown is due to gaps in HeLayers' layouts leading to 2×ś16× more bootstraps per tensor.HeLayers' layouts can be viewed as a subset of Fhelipe's, where each dimension has a tiling specifying its slot-selecting bits.However, HeLayers forces the same tiling to be used through the entire computation.So gaps introduced by, for example, summing across input channels (Sec.5.6) cannot be immediately filled, as the vacated slot-selecting bits are coupled to the summed dimension.
Fhelipe uses a fundamentally different approach from HeLayers: rather than fixing one layout and using profiling to find the best single layout, Fhelipe assigns different layouts freely throughout the computation, and implements each tensor operation for its specific layout.This lets Fhelipe simultaneously (1) keep data packed throughout by using compaction, and (2) avoid expensive layout conversions that add work and depth.In ResNet-20, the first effect is crucial, as keeping data packed drastically reduces bootstrapping.This highlights the advantage of our analytical approach.

Fhelipe Preserves Accuracy
The unencrypted versions of ResNet-20, RNN, and LogReg achieve accuracies of 91%, 78%, and 97%, which are typical for the datasets they use.Table 7 shows that the Fhelipe and manual versions match the accuracy of their unencrypted counterparts.We run enough samples to achieve 95-th percentile confidence intervals below ±1% [38].This result is expected, since Fhelipe uses the same scales as the manual versions.Nonetheless, Fhelipe versions have a different computation graph, and this shows that our transformations are correct and do not impact accuracy.
Table 7 also shows the number of error-free mantissa bits measured against the unencrypted computation with 64-bit floating-point values.This conveys the maximum absolute error of the output.(e.g., 13 error-free mantissa bits means absolute error < 2 −13 across all output slots).Fhelipe matches the manual versions on this finer-grained metric as well.

CONCLUSION
FHE is enticing but very hard to program.We have presented Fhelipe, a novel FHE compiler that is the first to address FHE's key remaining programmability challenge, automating data layouts to use FHE's huge vector ciphertexts well.Fhelipe also automatically manages noise end-to-end by placing bootstraps efficiently.We show that these contributions enable programs written in a simple tensor language to match or outperform hand-optimized FHE circuits, widely outperform prior compiler techniques, and dramatically simplify programming.

Fig. 1 .
Fig. 1.Overview of FHE execution.FHE computations are circuits that are hard to write manually.An FHE compiler automatically produces an FHE circuit from a highlevel program.

Fig. 11 .
Fig. 11.Breakdown of execution time between bootstrap and non-bootstrap computation across benchmarks.
Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic Encryption . Proc.ACM Program.Lang., Vol. 8, No. PLDI, Article 152.Publication date: June 2024.A 10 A 11 A 12 A 13 A 20 A 21 A 22 A 23 A 30 A 31 A 32 A 33 0 y 0 y 0 y 0 y 1 y 1 y 1 y 1 y 2 y 2 y 2 y 2 y 3 y 3 y 3 y 3 B 00 B 10 B 20 B 30 B 01 B 11 B 21 B 31 B 02 B 12 B 22 B 32 B 03 B 13 B 23 B 33 y

Table 1 .
Prior FHE compilers have key limitations.By contrast, Fhelipe supports automatic layout assignment, automatic bootstrap placement, programs with millions of operations, and a general tensor-based interface.

Table 3 .
Characteristics of our benchmarks: fixed-point scale  and total scalar operations; Fhelipe's multiplicative depth, compilation time, and lines of application code; lines of code in the manual implementation.

Table 4 .
Performance on CraterLake for Fhelipe, Manual, and CHET+ baselines.All systems use EVA waterline rescaling.

Table 5 .
Performance on CPU for Fhelipe, and speedups over Manual and CHET+.tout denotes runs that timed out after 5h.

Table 6 .
Fhelipe speedups over alternative bootstrap placement algorithms (Lazy and FHE-Booster); and over a version of Fhelipe that uses only row-major layouts.

Table 7 .
Prediction accuracy (with 95% confidence intervals) and number of error-free mantissa bits compared to unencrypted doubleprecision computation across all slots of the output (higher is better) for Fhelipe and manual baselines.