Towards Generic MPC Compilers via Variable Instruction Set Architectures (VISAs)

In MPC, we usually represent programs as circuits. This is a poor fit for programs that use complex control flow, as it is costly to compile control flow to circuits. This motivated prior work to emulate CPUs inside MPC. Emulated CPUs can run complex programs, but they introduce high overhead due to the need to evaluate not just the program, but also the machinery of the CPU, including fetching, decoding, and executing instructions, accessing RAM, etc. Thus, both circuits and CPU emulation seem a poor fit for general MPC. The former cannot scale to arbitrary programs; the latter incurs high per-operation overhead. We propose variable instruction set architectures (VISAs), an approach that inherits the best features of both circuits and CPU emulation. Unlike a CPU, a VISA machine repeatedly executes entire program fragments, not individual instructions. By considering larger building blocks, we avoid most of the machinery associated with CPU emulation: we directly handle each fragment as a circuit. We instantiated a VISA machine via garbled circuits (GC), yielding constant-round 2PC for arbitrary assembly programs. We use improved branching (Stacked Garbling, Heath and Kolesnikov, Crypto 2020) and recent Garbled RAM (GRAM) (Heath et al., Eurocrypt 2022). Composing these securely and efficiently is intricate, and is one of our main contributions. We implemented our approach and ran it on common programs, including Dijkstra's and Knuth-Morris-Pratt. Our 2PC VISA machine executes assembly instructions at 300Hz to 4000Hz, depending on the target program. We significantly outperform the state-of-the-art CPU-based approach (Wang et al., ESORICS 2016, whose tool we re-benchmarked on our setup). We run in constant rounds, use 6 X less bandwidth, and run more than 40 X faster on a low-latency network. With 50ms (resp. 100ms) latency, we are 898 X (resp. 1585 X) faster on the same setup. While our focus is MPC, the VISA model also benefits CPU-emulation-based Zero-Knowledge proof compilers, such as ZEE and EZEE (Heath et al., Oakland'21 and Yang et al., EuroS&P'22).


INTRODUCTION
Secure multi-party computation (MPC) allows mutually untrusting parties to execute programs on their private inputs while revealing only the output.MPC has become relevant in academia and industry.It has been commercially deployed in online auctions, electronic voting, financial technology, and has found many use cases in medicine, privacy-preserving machine learning, and distributed databases.
Typically in MPC, we encode programs as circuits.While any bounded program can be compiled to a circuit, the compiled circuit is often much larger than the source program.Real world programs (1) access large arrays of data and (2) use complex control flow.Compiling these two program features often results in huge circuits, and MPC cost scales with the size of the circuit.If we wish to enable secure computation of real-world programs, we must circumvent the cost imposed by compiling these features to circuits.
While the issue of array access can be resolved via oblivious RAM (ORAM) [7] or garbled RAM (GRAM) [22], complex control flow has gone largely unaddressed.
Straight-line execution.Indeed, most existing MPC tools "solve" the control flow problem by disallowing complex control flow.Most existing MPC toolchains require that the programmer handannotate each loop with a hard-coded upper bound on the number of loop iterations [9].With these annotations, the program becomes a simple straight-line program, compatible with the circuit model.A compiler can now unroll each loop precisely the specified number of times, then compile each iteration into gates.
This approach is problematic.At best, annotating programs is an annoyance.At worst, hard-coded loop bounds ruin performance, 1 For performance, Dijkstra's algorithm may be implemented with a priority queue containing partial solutions sorted by distance from the start node.Standard Dijkstra is based on a simple array, as is also done in [28].We use standard Dijkstra for illustration and direct performance comparison with [28].ObliVM [21] showed that for Dijkstra's algorithm and if | | and || are public, the straight-line approach can reclaim the loop asymptotics via loop coalescing.Using loop coalescing, we can flatten the nested loop on lines 13-33 into a single loop with an internal conditional.Then, the number of iterations of this top level loop is a function of | | and ||, so it is possible to properly bound the loop.See further discussion in Section 3.
While loop coalescing can solve this particular problem, it places a significant burden on the programmer: the programmer must now reason about and properly specify upper bounds on coalesced loops.This may be expensive if | | and || are secret, such as if Dijkstra's is nested inside another data-dependent loop, requiring costly further coalescing or excessive padding.This syntactic transformation produces expensive code that is difficult to further optimize.CPU emulation.CPU emulation correctly implements Dijkstra's asymptotics 2 , but incurs significant concrete cost.
The state-of-the-art CPU emulator implements a sufficient subset of the MIPS instruction set [28] to handle Dijkstra's.This CPU stores the compiled assembly program, the register file, and the main memory in three separate RAMs.[28] implements RAM using either Circuit ORAM [26] or trivial linear scans, depending on the size of the needed array.Their CPU proceeds by continually fetching and executing instructions.
Storing the program in RAM and applying the fetch-and-execute paradigm discards all useful static information, some of which [28] manually reclaims by implementing various heuristics, such as periodic (rather than per-instruction) RAM access.Even applying this heuristic, their number of main memory accesses is suboptimal.Further, they must always access smaller memories to fetch instructions and to read/write registers.Their ALU decodes the instruction and conditionally executes the operation for each instruction type that is statically possible at a given step.As a result, each CPU step is a large circuit that often improves on the circuit-based computation only for problem instances where MPC is impractical.
Our approach, discussed next, systematically optimizes away many of the principal inefficiencies of [28] and results in significantly improved performance.For instance, for Dijkstra's with 100 nodes and 300 edges and when run on the same setup, our VISA machine uses 5.8× fewer RAM accesses, consumes 7.3× less bandwidth, and runs 44.9× faster.We are 1585× faster on a 100mslatency network.
Our solution: VISA machines.The state of the art presents a dichotomy: CPU emulation or straight-line programs.
In this work, we suggest and explore a hybrid approach to handling arbitrary programs inside MPC.Our variable instruction set architecture machine, or VISA machine, handles programs with arbitrary control flow, but avoids most of the overhead of the CPU emulation approach.It uses the statically available context to optimize the scope (and hence the cost) of each execution step.
In short, a VISA machine is distinct from a CPU in that it does not repeatedly execute instructions, but rather repeatedly executes entire fragments of the source program.Each fragment is an arbitrarily long straight-line portion of the source program text.The basic advantage of this is that we can cheaply handle each fragment as a circuit.While we still need CPU-like machinery to coordinate the execution of the fragments and ensure privacy, the amount of needed machinery is substantially reduced.

Contribution
We propose variable instruction set architectures, a basic approach to evaluating arbitrary programs inside MPC.We believe that VISAs are the sensible approach to executing arbitrary programs in MPC.VISAs do not limit the programmer to straight-line programs, and they do not incur the high overhead of a basic CPU.A VISA adapts to the target program of interest, an appropriate choice for MPC where we generally assume that the parties agree on a program.
In more detail, we: • Introduce and motivate the VISA model.
• Construct a complete VISA-based secure two-party computation (2PC) toolchain for assembly programs.Our toolchain is implemented via garbled circuits (GC).• Resolve technical issues needed to combine core components of a GC-based VISA machine: GC conditional branching [10,12] and Garbled RAM [13].• Formalize our instantiation as a garbling scheme [2] and prove the resulting formalism secure.Our garbling scheme securely evaluates arbitrary assembly programs written in our ISA.Using garbling schemes as the underlying mechanism has two key benefits.
-First, we dramatically decrease the number of communication rounds, resulting in orders of magnitude improvement (see Section 7.4.3).Prior work [17,28] used tens of rounds per CPU step, while we require one message plus an OT for the entire 2PC.-Second, our technique can be elevated to the covert, PVC, and malicious models using standard techniques.• We implemented VISA machine including, significantly, the first implementation of Garbled RAM [13].
• Experimentally evaluate performance of our toolchain.We ran our VISA machine on a number of assembly benchmarks, including Dijkstra's, Knuth-Morris-Pratt, and a private set intersection benchmark from [28].Our results indicate significant improvement over the prior best approach to arbitrary assembly programs [28]: we run in constant rounds, use 4-7× less bandwidth, use 5-10× fewer RAM accesses, and run 40-70× faster (up to 1585× with 100ms latency), yielding a machine that executes assembly instructions at 300-4000Hz.We also experimentally show our work, as expected, overtakes circuit-based 2PC (EMP [27]) even for small programs with non-trivial control flow.• We plan to open source and maintain a cleaned version of our prototype toolchain.• While our focus is on MPC, the VISA model also directly applies to CPU-emulation-based Zero-Knowledge Proof (ZKP) compilers, such as ZEE and EZEE [16,29].Indeed, they face similar problems of more efficient CPU design (e.g., fragmentation and stacking), ZK ORAM integration with branching, etc., and the VISA approach is similarly beneficial to ZKP compiler work.We leave specific instantiations of ZKP VISA as exciting future work.
Recent breakthrough GC and MPC improvements on free branching [10,12,14,15] and efficient GRAM [13] removed fundamental technical roadblocks needed to move away from straight-line circuit execution.We believe that our hybrid approach -contextual fragment-based execution engines -will underlie the next generation of 2PC and MPC toolchains.This paper initiates this direction and sets the stage for future cryptographic and interdisciplinary work that will likely involve programming language, static analysis, and compiler techniques, and that will interface with high-level programming languages.

OVERVIEW
We at a high level introduce our model and explain the fundamental benefits of our approach.We then introduce lower-level technical challenges and briefly outline our approach to solving them.
Our basic observation is that CPU emulation is a blunt generic mechanism: CPUs in cleartext machines are static devices that can execute each step of any program.But in MPC, the program is public, and there is no need to use a fixed generic set of instructions.Instead, we can derive our machine's 'instruction types', which we call fragments, from the target program itself.
Each fragment can be arbitrarily large and complex, so long as it does not contain data-dependent loops.We can generate custom circuitry tailored to each fragment, avoiding the need to mechanistically execute the fragment one instruction at a time.Thus, once our machine enters a fragment, we pay essentially no overhead to execute that fragment.In this sense, we obtain the benefit of straight-line execution.
At the same time, our machine dynamically dispatches over the fragments, so we can handle all possible execution paths.In this sense, we obtain the benefit of CPU emulation.
Our execution engine does not necessarily need to dynamically dispatch over each program fragment at each step.At each step it is sufficient to only guarantee execution of fragments that may occur at this step.In many useful programs, this active set is much smaller and consists of cheaper fragments than the full set.
Program fragments are generated by a compiler.There are many choices for how to fragment a program, and good fragmentation is crucial to performance.We discuss related trade-offs (see Section 5.4).

Notation
Our execution engine repeatedly conditionally dispatches over varying sets of fragments chosen from the target program.We call the specification of a machine that operates this way a variable instruction set architecture (VISA).A VISA machine instantiates a VISA specification.Our VISA machine, which we call GAR, is implemented via GC; of course, one could implement a VISA machine from different primitives, such as a secret-sharing-based protocol and off-the-shelf ORAM.
At each step , a VISA machine can execute any fragment in the active set of step .We compose each fragment from many base instructions in the program text.Note we thus consider two kinds of instructions: base instructions are typical low-level assembly instructions, whereas fragments are the instructions of a VISA and are composed from multiple base instructions.Fragments are automatically chosen by a type of compiler that we call a fragmentation strategy; our GAR construction includes a built-in fragmentation strategy.
In the remainder of this section, we explain and motivate VISA machines in more detail.We explain our advantages by referring to Dijkstra's algorithm (Figure 1).

VISA Advantages
VISA machines do not repeatedly execute instructions, but rather repeatedly execute entire fragments of the source program.This leads to several important advantages: Free register file.As each fragment is a straight-line piece of code, we do not need to dynamically store and access local variables from a register file.Instead, like the straight-line approach, a VISA machine routes arguments to operations directly and without cryptographic cost.
We still pay to route the content of the register file between fragments, but within a single fragment, the register file is free.
Example 2.1.Consider line 18 of Dijkstra's (Figure 1).Under CPU emulation, this simple assignment requires reading j from and writing bestj to the register file.In practice, these would be implemented by linear scans of a modest array.Linear scans are expensive.As a reference point, suppose the register file holds 16 32-bit registers.Using state-of-the-art GC, each linear scan of this register file costs ≈ 16KB of communication.In the CPU emulation approach, this cost is paid multiple times per CPU cycle.In our VISA machine, this overhead is erased: to handle line 18 the parties may simply agree to name certain wires in the fragment circuit bestj.
No instruction memory.Programs execute fewer fragments than they do base instructions.Thus, when the VISA machine dynamically decides which fragment to execute next, the space of choices is smaller.This means that the VISA machine does not need to store fragments in an instruction memory.Instead, we conditionally dispatch over an integer that indicates which of the small number of statically known fragments should be executed next.This eliminates many usages of ORAM/GRAM.Example 2.2.In our ISA, Dijkstra's has 56 instructions3 but only 7 fragments.(Our actual fragmentation is more nuanced; see Section 5.4.)At each step, we conditionally execute only those fragments that are possible.As a simple example, on the first cycle of Dijkstra's, our VISA machine unconditionally executes the fragment on lines 4-12, since this is statically the only fragment possible.We track the fragments that are possible at each step by tracing the target program's control flow graph.
Fewer conditional choices.Each fragment implements a larger portion of the overall execution than does each instruction.This is significant because there is overhead associated with conditionally executing code inside MPC, whether classically or by stacking [10].Since we execute fewer fragments than CPU emulation executes instructions, we make fewer conditional decisions, and hence pay the overhead of conditional branching fewer times.With SGC, this advantage manifests in the fact that we need fewer SGC multiplexer gadgets [10,12].Importantly, for small branches, these gadgets dominate the cost of SGC.Fewer data RAM accesses.Since each fragment is static, we know precisely how many times each fragment must move data to/from main memory.This allows a VISA machine to access memory less often than a CPU, since in a CPU it is possible that each instruction is a memory access.
Example 2.4.Consider again line 18 of Dijkstra's.Under CPU emulation, the CPU cannot statically deduce that the current instruction is not a RAM access, so when emulating line 18, it must perform a RAM access.Our VISA machine eliminates this access.
The sum advantage of our approach as compared to CPU emulation is well illustrated by again considering line 18 of Dijkstra's.Under CPU emulation, this instruction will involve fetching and decoding the instruction, linearly scanning the register file multiple times, conditionally executing the various instruction types, and accessing main memory.Each of these actions are expensive.In our VISA-based approach, line 18 is free of cryptographic cost.

VISA Technical Challenges and Solutions
Our core contribution is the introduction of VISA-based MPC.Efficiently implementing an MPC VISA machine presents crypto-and system-technical challenges; we discuss the main challenges here.
Managing the active set.Inside a fragment, we have full static knowledge of the straight-line code, so we can directly and efficiently compile the code to a circuit.However, a VISA machine must conditionally execute fragments in the active set at each step.
The cost of this conditional dispatch is greatly improved thanks to the recent line of work on MPC conditional branching, in particular Stacked Garbling (SGC) [10,12].By integrating SGC, we can conditionally dispatch over active set fragments with communication proportional to a single (largest) fragment.Although SGC improves communication, it still requires computation: for  fragments, the computational cost scales with  ( log ) [12].Thus, we must not allow the active set to grow too large.
Further, SGC-based conditional branching incurs communication cost that scales with the size of the conditional's interface, i.e., the number of input/output wires, with additional factor dependent on the number of branches .This cost imposes constraints on the efficiency of using small fragments, and impacts the utility of breaking down fragments, e.g., in alignment with RAM accesses.
In this work, we do not significantly optimize fragments, leaving it as crucial and significant future work.Our fragments are syntactically derived from the control flow structure of the assembly program.This choice is sufficient for modest programs.We envision that future work can use compiler techniques and static analysis to more intelligently select fragments.For example, a fragment can be split into pieces, or multiple fragments can be combined into one.We emphasize the complexity of this problem space: a good solution should simultaneously consider the size of each active set, the size of fragments, the number of RAM accesses, the per-fragment overhead, such as the size of the interface to SGC, etc.
Stacked Garbling with RAM access.Using SGC to conditionally evaluate fragments introduces a subtle technical challenge in handling RAM accesses within fragments.For multiple technical reasons, it is not possible or desired to access RAM directly from inside an SGC conditional branch.This is primarily because GRAM and ORAM reveal random-looking access patterns to the parties.If an access comes from an inactive conditional SGC branch, then SGC's optimization will reveal information incompatible with the normal access pattern of the GRAM/ORAM.Thus, this use is insecure, as it allows the GC evaluator to identify the active branch in a conditional.See detailed discussion in Section 6.Other issues include the increased computational cost of processing GRAM's expensive access procedure in each branch.Similar concerns may apply to accessing other types of resources, such as stacks, queues, expensive procedure calls (e.g.non-black-box crypto primitives), or recent improved and unstackable GC techniques [11].
In Section 6, we design a novel mechanism for efficiently and securely handling RAM accesses from within SGC branches.In short, our mechanism allows us to cheaply escape the conditional branch, access the resource, and then re-enter that same branch.Each branch can access a resource multiple times.Our mechanism allows fragments to access RAM without paying high cost for SGC gadgets.
We also note the following lower-level contributions: Entire system and security proof.We package our approach as a garbling scheme and prove it secure.Implementation.Our system is a non-trivial systems-technical undertaking.

RELATED WORK
In our review of related work, we focus on prior general purpose MPC tools.
Straight-line execution tools.The vast majority of MPC tools use straight-line execution, e.g.[1,4,20,23,27,32].These tools require that each program loop has a hard-coded upper bound.CBMC-GC goes one step further by trying to infer loop bounds automatically, but still ultimately models the program as a straight-line circuit [6].
Straight-line execution cannot suitably support arbitrary programs where the number of loop iterations depends on the data.We note that [9] is an excellent systematization of knowledge that explores the pros and cons of such tools.
CPU emulation tools.We consider two works that operated in the CPU emulation paradigm [17,28].[17] used SPDZ to implement a CPU-emulation-based protocol for malicious adversaries.While their online efficiency is competitive with the total cost of [28], their offline efficiency is ≈ 100× slower.In our evaluation (Section 7), we accordingly focus our comparison on [28].We described [28]'s approach in Section 1, and we compare to their performance in Section 7.
[28]'s uses Circuit ORAM [26], which could be modularly swapped for a different ORAM, such as [5], correspondingly affecting (improving) performance.We only compare to the existing system [28].Constant-round complexity (and hence using EpiGRAM) is essential for CPU-emulation and VISA MPC due to the sequential nature of RAM accesses in these models.Interactive ORAMs incur latency cost proportional to the (large) number of steps of a typical program (cf.discussion in Section 7.4.3).Further, GRAM can be easily and cheaply upgraded to stronger security models, e.g.covert or malicious, using existing techniques.Such an upgrade for ORAM constructions, including [5], is a challenge.
We note that TinyGarble implemented a MIPS ALU, but did not build on this to implement a working CPU emulation tool [24].For example, they do not integrate RAM support to their prototype.Their main contributions are (1) better management of plaintext function by avoiding unrolling it into a plaintext circuit, and (2) applying hardware synthesis tools to reduce the size of the MIPS CPU, improving over naïve by up to 14.95%.
Loop coalescing.Loop coalescing is a compiler technique explored in the MPC context by [21] (and in the proof system context by [25]).The basic idea is to combine the bodies of loops into a single loop with an internal conditional.[21,25] show that this can improve MPC (resp.proof system) performance by reducing the number of hard-coded loop bounds in the program (cf.Section 1.1).The technique does not suggest (nor do [21,25] explore) further optimization, such as fragment design.
There are common characteristics of loop coalescing and VISA.Both techniques conditionally dispatch over program fragments.
Crucially, VISA approaches MPC optimization holistically, providing a clean abstraction and vocabulary for general optimization of oblivious programs (e.g.include stacking, GRAM, our new gadgets, etc.) and for expressing optimization constraints.Indeed, VISA emphasizes fragment design as a crucial optimization problem.VISA also provides a convenient vocabulary for discussing low level details, such as the size of a register file and managing the active set.See further discussion in Sections 5.3 and 5.4.In contrast, coalescing is a source code transformation, and is at the wrong level of abstraction for fragmentation and low-level optimization.

PRELIMINARIES
We implement our VISA machine using garbled circuits (GC).GC allows for powerful protocols that achieve secure computation in only a constant number of protocol rounds.We build on the half-gates GC technique [33], which requires that the parties communicate two ciphertexts per AND gate and zero ciphertexts per XOR gate [18].
We combine the basic [33] scheme with recent improvements in Garbled RAM [13] and with Stacked Garbling [10,12].Garbled RAM is needed when accessing data from the VISA machine's main memory, and Stacked Garbling improves the communication consumption incurred when conditionally handling fragments.
We use these GC improvements heavily, and we overcome technical problems needed to compose them.

Garbled RAM
Compiling large arrays to Boolean circuits is infeasible.The problem is that on each array access, the circuit must touch each element of the array.Hence, on each access we pay cost proportional to the size of the array.Garbled RAM (GRAM) [22] equips GC with randomaccess arrays that incur only sublinear cost.GRAM preserves GC's important constant-round property.
A recent GRAM, called EpiGRAM, dramatically improved the concrete cost of the technique [13].We implemented EpiGRAM, and we use it to instantiate our VISA machine's main memory.
Our formalism manipulates GRAM directly by using two gates provided by EpiGRAM: • An ARRAY gate takes as input public natural numbers  and .It outputs a zero-initialized size- array of width- elements.We initialize all of our arrays width  = 32.

Stacked Garbling (SGC)
Until recent breakthrough work [10,12], GC techniques required communication proportional to the computed program, including inactive branches.SGC [10,12] achieves communication proportional to only the single longest execution path of the program.This improvement is a boon to our approach, because we repeatedly conditionally evaluate the target program's fragments.SGC greatly improves the communication cost of fragments (see Section 7).

Cryptographic Assumptions
Our garbling scheme (Section 6.2) is secure under a typical GC assumption: We assume that the function  is a circular correlation robust hash function [3,33].
As is standard in MPC (e.g., [8,28]), total runtime, i.e., the number of CPU emulation steps, is public.If desired, the steps can be padded.
We consider security in the presence of a semi-honest adversary.Since our construction is a garbling scheme, its security can be extended into covert, public verifiable covert (PVC), malicious models using standard techniques.

OUR VISA
The general idea of a VISA is agnostic of low-level details.Of course, it is interesting to instantiate and experiment with a specific architecture.We formalize our specific VISA here.
Our VISA is built on top of a base ISA.Our base ISA is indeed basic, providing primitive instructions that (1) perform algebraic operations, (2) achieve dynamic control flow, and (3) read/write main memory.We first formalize this base ISA.We choose a custom base ISA for simplicity of presentation and implementation; it may be desirable in future work to replace the base ISA with an off-theshelf ISA, such as MIPS.
Once we establish the base ISA, we formalize our VISA, which essentially aggregates base instructions into fragments.

Base ISA
The base ISA specifies the instructions that can appear in our supported assembly programs.We emphasize that we do not execute these instructions one by one; rather, our VISA groups base instructions into fragments, and our VISA machine treats fragments as its atomic units of computation.
The base ISA formalizes both the syntax and the semantics of instructions.Our instructions each provide a simple mechanism for performing algebra, achieving control flow, or accessing memory.To define instruction semantics, we define an abstract machine that executes instructions.Our ISA simultaneously defines our instruction set and the abstract machine that runs them.

Syntax Semantics
Algebra Each instruction type handles between zero and three arguments.In general, arguments refer to registers, but some arguments, denoted {•}, can also optionally be immediates (i.e., compile-time constants).val is a helper function that resolves an argument that can be either a register or an immediate.Unless the semantics otherwise mention an effect on the pc, each instruction also increments the pc.The symbol < with an overset  (resp.) denotes a comparison where the arguments are treated as an unsigned (resp.signed) integers.Definition 5.1 (Base ISA).Our instruction set is formally defined in Figure 2. The semantics of instructions are defined by reference to an abstract machine with a program counter pc, a register file R, a main memory M, and a program P. pc is a 32-bit index that indicates which base instruction to execute next.R is a length- array of 32-bit integers.M is a length- array of 32-bit integers.P is an array of instructions.Both  and  are configurable parameters of the abstract machine.A machine is initialized with an arbitrary program.At initialization, pc, R, and M are zero initialized.At each step, the machine updates itself based on the semantics of instruction P [pc].
In our implementation, we instantiate a machine with a size-13 register file; we vary the size of RAM depending on the requirements of the executed program.
We emphasize that while both the register file and the memory are key-value data structures, our VISA machine handles them very differently.Our memory supports dynamic access and is implemented using Garbled RAM.On the other hand, our register file does not need to implement dynamic access: each usage of the register file is statically specified by an instruction, so each register is essentially just a named collection of 32 circuit wires.Inside a fragment, accessing the register file is free.

Fragments
As discussed and motivated in Section 2, batching multiple instructions by creating fragments resolves the bulk of the cost of the CPU emulation approach.

Definition 5.2 (Fragment).
A fragment is a straight-line sequence of base ISA instructions where only the final instruction may be a Control Flow instruction (cf. Figure 2).Definition 5.2 coincides with the notion of a program basic block.We still elect to use new terminology because the notion of a fragment can be (and, we expect, will be) generalized, for example by allowing intra-fragment control flow.The only limitation in extending the above definition is that a fragment should never contain a data-dependent loop, since this would break the straight-line nature of the fragment.For simplicity, we do not explore this direction here, but we believe that this can be exploited heavily in future work.
We now define the syntax/semantics of our VISA.
Definition 5.3 (Our VISA).Like our base ISA, a VISA is a set of instructions together with the abstract machine that executes them.A VISA instruction is a fragment (Definition 5.2).The VISA abstract machine is identical to the base ISA machine, except that the program P consists of fragments, and at each step the machine executes the semantics of the current fragment P [pc].
Remark 1. Note, a VISA program is thus viewed as including the corresponding variable instruction set.A VISA then specifies the interpretation of the program.A VISA machine instantiates the (secure) execution of the program.While a full toolchain starts from programs written in a base ISA, the VISA definition is about programs that have been fragmented.In practice, the VISA machine toolchain will generate the fragmentation and hence the program's instruction set.

While the above specification indicates an array lookup P [pc],
our instantiation dispatches fragments via conditional branching.Note that to achieve the prescribed semantics, we do not need to conditionally dispatch over each fragment at each step.In general, not all fragments will be possible at a given step.We reduce the number of conditionally dispatched fragments by considering a control flow graph (CFG) representation of the target program.We maintain a set of pointers into the CFG that indicates the set of possible pc values.At each step, our VISA machine only dispatches over those fragments that are currently pointed to.

Memory Hierarchy
A VISA introduces the opportunity to distinguish three types of memory: • Main memory.Most program state is stored in a large main memory that is accessed dynamically at high cost.• Persistent registers.The local state of a VISA machine is held in persistent registers.Inside the fragment, these registers are free.However, to conditionally dispatch over fragments, this local state must be passed to each branch.SGC imposes cost for each bit that crosses the interface to/from the conditional.It is sensible to store frequently used data in persistent registers, but the number of these registers should be kept in check.• Local registers.Since register access is free inside a fragment, a VISA program can introduce arbitrary numbers of local registers, allowing the fragment to store a large state without paying for it.At the exit of the fragment, the content of local registers is lost.Allocating data to these levels of memory is a large and interesting optimization space.We use 13 persistent registers and a RAM of size up to 2 13 32-bit words in our experiments.

Fragment Generation
As discussed in Section 2, the choice of strategy for breaking a program into fragments can dramatically affect performance.In this work, we align fragments with program basic blocks (i.e., each control flow instruction maps to a fragment), with one exception: we introduce extra fragments for RAM accesses such that each fragment has at most one RAM access.We found that this simple strategy reduces the overall number of RAM accesses 4 , which we found is the performance bottleneck.
Note that for simplicity of presentation, Figure 1 does not show the extra fragments resulting from RAM accesses.Our actual fragmentation has 14 fragments.
While we leave further in-depth exploration of intelligently selecting fragments as significant future work, we outline several guidelines for such strategies.We note that these guidelines sometimes contradict one another, as fragment optimization is a challenging problem.
Generate fragments such that each conditional dispatch is over fragments of similar size and with a similar number of RAM accesses.SGC, and other approaches to MPC free branching [14,15], achieves communication cost proportional to the single most expensive branch.To best take advantage of free branching, ensure that branches have similar cost.This can be achieved, e.g., by splitting large program basic blocks into more than one fragment and/or by merging multiple basic blocks into a single fragment.
RAM access is an expensive resource; an unbalanced allocation across dispatched fragments misses an opportunity to amortize accesses.
Prefer larger fragments.This reduces the number of VISA machine steps.Hence, larger fragments further reduce the amount of CPU-emulation-style machinery.
Compress the interface to each fragment.As explained in Section 5.3, we pay to transport the content of persistent registers into and out of branches.Using compiler techniques to reduce the number of needed persistent registers will reduce cost.
Prefer fragmentation that leads to smaller active sets.SGC computational and interface costs scale with the number of branches, so we should seek to reduce the number of branches per step (i.e., to shrink each active set).One way this guideline might be achieved is by artificially introducing periodicity into a program's execution.For instance, we can split each loop into a number of fragments that is a power of two.Without periodicity in consecutive loops the active set will tend to grow with each step until it includes each program fragment.Artificially introducing periodicity groups fragments into "congruence classes" and ensures that most fragments never coincide in the same active set.[28] considered a similar technique in their MIPS processor.Introducing periodicity for fragments introduces further opportunities to align code and amortize cost.

GAR: OUR VISA MACHINE
This section introduces GAR (Garbled Assembly with RAM) our implementation of the VISA machine.GAR is formalized as a garbling scheme [2].As already mentioned, GAR conditionally dispatches fragments using SGC and implements main memory via GRAM.
We first discuss technical issues and our solution in combining our two main building blocks, SGC and GRAM.Then, in Section 6.2 we present the GAR scheme and state the main security theorem (proofs are presented in the Appendix).

SGC with GRAM
The incompatibility of SGC and GRAM.SGC is compatible with many, but not all GC techniques.SGC requires that the string of material encoding each branch be indistinguishable from a uniform string.This restriction is needed to mask from the GC evaluator the identity of the conditional's active branch: if a branch is inactive, SGC arranges that the evaluator obtains uniform garbage material.
Unfortunately, GRAM's material is distinguishable from a uniform string.In short, GRAM will one-by-one reveal to the evaluator RAM indices that are randomly generated without replacement [13].These revealed indices are indistinguishable from a uniform permutation, but not from a uniform string.Thus it is not secure to use GRAM's ACCESS gate inside an SGC conditional.
SGC's uniform string requirement and GRAM's revealed uniform permutations seem somewhat inherent to the techniques, and it is not clear that we can revise these techniques to make them compatible with one another.Even if it were possible to make the two techniques compatible, it would not be desirable.SGC requires that each party garble each branch multiple times, introducing added computational cost.Since the GRAM access procedure is large, we would like to avoid repeatedly garbling it.It is more pragmatic to simply garble each access once, as we end up doing.
Our approach.One way we could handle RAM access in a VISA machine would be to place each RAM access instruction in its own single-instruction fragment.While correct and secure, the approach violates several of our guidelines for program fragmentation (Section 5.4), and is undesirable for a number of performance reasons.In particular, the resulting fragments are smaller, more numerous, and each RAM access will service a smaller fragment.Ultimately, this discards many of the VISA's benefits.
A much better way would be to temporarily escape a fragment just to perform the RAM access, then re-enter that fragment.This is the approach we take.We design a new scheme that allows us to temporarily escape an SGC conditional branch (i.e., a fragment), perform the access, then re-enter that same branch.Because we escape the SGC branch before accessing RAM, we avoid SGC's uniform string requirement.Thus, RAM access is simulatable.Crucially for performance, our gadgets escape, and not fully exit SGC, and transfer across the SGC interface only those specific bits that are directly related to the RAM access.Thus, we do not, for example, pay to transfer the full register file on each RAM access.
Instrumenting GRAM access in SGC.SGC uses two garbled gadgets, the demux and the mux, to enter and exit a conditional, respectively.Each of these gadgets handles branch input/output wireby-wire, where each wire is (indirectly) connected from the outside of the conditional to the internal circuit of each branch.We refer to each of these wire connections as a port of the demux/mux.There is one port per external wire.
Our observation is that, in contrast with standard SGC, the demux/mux need not be evaluated in one shot at the very beginning/end of the conditional.Instead, the GC evaluator can process ports of the gadgets in an arbitrary order, so long as data dependencies in the circuit are satisfied.
This in particular means that the evaluator can (1) process input to a branch by handling only some ports of the demux, (2) evaluate some gates in that branch, generating input to a RAM query, (3) feed the RAM query through ports in the mux to temporarily escape the branch, (4) execute the RAM access outside of SGC, in plain GC, (5) feed the RAM result through ports of the demux back into the branch, and ( 6) continue evaluation of the branch.
Interestingly, the GC generator's order of building the corresponding GC material is different.Because each branch must be generated from a seed (this is a key trick behind SGC's improvement), the generator garbles each branch all at once, before any RAM accesses are handled.As part of doing so, he assigns uniformly random GC labels to the branch side of each demux port.Only once each branch is fully generated, does he generate GC for RAM access(es).Labels of these GCs match the labels of the ports of the SGC conditional.Finally, he generates the GC material for the demux and mux.
Our modification to SGC still uses the main ideas of Stacked Garbling [10]: our GC generator garbles each branch starting from a distinct PRG seed and then stacks the material together using XOR.Our GC evaluator can decrypt the seed for each inactive branch and hence can reconstruct their garblings, unstack the material for the active branch, and evaluate.I.e., our scheme retains the important communication advantage of SGC.
Next, we formalize our full garbling scheme GAR, which includes the above trick.

Our Scheme: Formalization and Theorems
We formalize our VISA machine as a garbling scheme [2].SGC [12] and GRAM [13] are also formalized as garbling schemes; our scheme reorganizes and adjusts their procedures, making them compatible with each other and with our VISA (Section 5).
At a high level, our scheme should be understood as a new SGC scheme equipped with black-box GRAM.As an aside, it is possible to replace black-box GRAM with other garbled resources, for example a stack or queue [31].
Program description.A garbling scheme securely handles any program from some specified language.Our goal is to support programs expressed in our base ISA (Figure 2).At the lowest level, we have primitive support for AND gates [33], XOR gates [18], SWITCH statements [12], and ARRAY and ACCESS gates [13].The semantics of XOR and AND gates are natural; ARRAY and ACCESS gate semantics are specified in Section 4.1.A SWITCH executes only the indicated branch and outputs the result.We group instructions from our base ISA, then compile these to our low level primitives.Thus, our formal garbling scheme consists of three major steps: • Compile base ISA program to VISA program.Our scheme first groups base ISA instructions into fragments using the strategy described in Section 5.4.• Compile VISA program to primitives.We compile each fragment primitive operation using standard techniques.
Each basic instruction has a corresponding straight-line circuit, and our scheme stitches together each of the circuits in the fragment.To conditionally dispatch, the scheme wraps the fragment circuits in a SWITCH.• Evaluate primitives via GC.The most interesting step is the evaluation of primitives, which is explained below.
Note that the first two steps of our handling are quite modular.It is easy to replace the ISA to VISA compiler with one that, for example, more intelligently selects fragments.Similarly, we could replace the compiler from fragments to circuits with more sophisticated techniques.From here, our scheme focuses on the handling of primitives, which is its crypto-technical component.

Definition 6.1 (Primitive Circuit Program).
A primitive circuit program is a circuit consisting of AND gates, XOR gates, ARRAY gates, ACCESS gates, and SWITCH statements.ARRAY and ACCESS gates are defined in Section 4.1.A SWITCH statement is recursively parameterized over  primitive circuit programs and ⌈log 2 ⌉ wires that indicate which branch to execute.Note that an ACCESS gate is allowed inside a SWITCH statement.
Our GAR scheme handles arbitrary assembly programs by appropriately implementing the above circuit primitives.We note that our assembly compiler generates restricted classes of circuit programs, and we need not handle them in full generality of Definition 6.1.For example, the resulting primitive program will not feature nested conditionals, and the ARRAY gate will be used exactly once to initialize main memory MEM at the start of the program.Furthermore, each ACCESS gate will be parameterized by the specific array MEM.Looking ahead, our formalism will handle only the relevant special forms of primitive circuit programs.
We are now ready to present our main construction, the GAR5 (Garbled Assembly with RAM) garbling scheme [2].

Construction 1 (GAR). GAR consists of three components:
• A fragmentation strategy that specifies how to convert a base ISA program into a VISA program.GAR uses the strategy discussed in Section 5.4; we do not formally specify further.• A compiler that transforms a VISA program (Definition 5.3) into a primitive circuit program (Definition 6.1); because each fragment has no data-dependent control flow, compiling each fragment to a primitive circuit program is straight-forward, and we do not specify further.• A garbling scheme that securely executes primitive circuit programs (Definition 6.1).
The GAR garbling scheme is the tuple of procedures: (GAR.ev, GAR.Ev, GAR.Gb, GAR.En, GAR.De) Note, GAR's functionality goes beyond simply instantiating a VISA machine; in particular GAR fragments programs written in base ISA and generates VISA programs.We could have treated this functionality separately as part of our toolchain.
We formally present procedures of the GAR garbling scheme in Appendix A, see Figures 10 to 13.Here, we review them at a high level.
GAR.ev.This procedure defines the semantics of primitives (Definition 6.1).The semantics of AND gates and XOR gates are natural.ARRAY and ACCESS gate semantics are specified in Section 4.1.A SWITCH executes only the indicated branch and outputs the result.GAR.Ev.This procedure specifies the GC evaluator's handling.In short, the handling of primitives is inherited from prior work [13,33].The exception is our new SWITCH primitive, which supports RAM ACCESS inside its branches.We discussed our method for handling ACCESS gates from within a SWITCH in Section 6.1.

GAR.
Gb.This procedure specifies the GC generator's handling.Again, the handling of primitives is inherited from prior work [13,33].See Section 6.1 for the handling of our SWITCH primitive.
GAR.En.This procedure specifies how cleartext GC inputs are mapped to GC labels.The procedure is standard: on each input wire, a zero maps to one label and a one maps to a different label.
GAR.De.This procedure specifies how output GC labels are mapped to cleartext outputs.The procedure is standard.
GAR meets the standard garbling scheme definitions of correctness, authenticity, obliviousness and privacy [2].Meeting these is sufficient to instantiate 2PC/MPC protocols.We state the definitions and prove that GAR meets them in the full version of this paper.Theorems in the full version imply the following: Theorem 6.2 (Main).Assuming a circular correlation robust hash function, GAR's garbling scheme is correct, authentic, oblivious, and private.

EVALUATION 7.1 Implementation and Testing Environment
We implemented GAR and used it to instantiate a semi-honest 2PC protocol in ≈ 5200 lines of C/C++.We instantiated Oblivious Transfer and Network I/O using the EMP Toolkit [27].We ran our experiments on two m6i.16xlarge6machines in the same region of an Amazon EC2 cluster.One machine ran the GC generator and the other ran the GC evaluator.We also ran [28]'s implementation on the exact same setup to establish our baseline.
We configured both systems with the same inputs and with RAM of the same size.GAR handles the program written in our assembly language.[28] takes a MIPS binary compiled by an off-the shelf compiler, thus placing them somewhat at a disadvantage.

Benchmarks and Metrics
As explained in Section 1 and further demonstrated in Section 7.4.5, straight-line execution is not feasible for programs with complex control flow.Accordingly, we focus comparison on [28]'s CPU emulation approach.We demonstrate significant improvement on three programs.
• Private Set Intersection (PSI): Two parties each hold a sorted integer array and wish to compute the number of common elements.While fast tailored PSI protocols exist, we use this benchmark for direct comparison with prior work [28].
Program # Mem Ent Time (s) Comm.(GB) # RAM Accesses [28] Ours Impr.[28] Ours Impr.[ Figure 3: Comparison of our GAR system with [28].We run PSI and Dijkstra's for a range of input sizes.We ran both GAR and [28] on the same hardware setup.Our approach substantially improves wall-clock time, communication consumption, and RAM usage.Our count of [28]'s RAM accesses does not include instruction fetching; We list them separately in Figure 5.  • Dijkstra's shortest path (Dijkstra): One party holds a directed graph while the other holds a pair of source and destination nodes (This is the setting of [28]; other input configurations, e.g., all inputs secret-shared, incur no extra cost).Parties wish to compute the shortest path between the two nodes.This benchmark was used by [28] and [21].• Knuth-Morris-Pratt string search (KMP): One party inputs a pattern string and the other inputs a search string.They wish to compute the number of occurrences of the pattern in the search string.This benchmark was suggested by [20].

𝐺's
We note that our reported runtimes for [28] are in some cases slower than what was reported in [28] itself.We believe this is due to the fact that program runtime is variable and depends on the program input.Crucially, we ran our GAR system on the same input as [28], and thus our reported numbers are directly comparable.
Assembly code for each benchmark is included in the full version.We report the following metrics: • Wall-clock time: Wall-clock time includes the time needed for the GC generator to garble, for network transmission, and for the GC evaluator to evaluate.Figure 6 provides a breakdown of these three metrics.• Communication: Both [28] and GAR communicate through one TCP/IP connection.We directly measure communication from the TCP port and report the amount of transmitted data.• # RAM accesses: RAM accesses are the most expensive operation in our approach.Recall that [28] uses ORAM while GAR uses GRAM.We report the number of times we and [28] access RAM.Recall that [28] uses RAM to fetch instructions; we do not.Figure 3 does not include [28]'s RAM accesses to fetch instructions.We list them separately in Figure 5.
PSI.We ran the PSI benchmark on three different pairs of input arrays: two 64-element arrays, two 256-element arrays, and two 1024-element arrays.The PSI program primarily consists of a loop that compares a single element from each array.Our VISA approach captures PSI's loop in a single fragment.This results in simple control flow and high performance.In total, we use only three fragments: one that initializes state before the loop, one that implements the body of the loop, and one that handles the end of the program.Because the loop is captured by one fragment, our approach uses precisely the number of RAM accesses that are prescribed by the program's execution path.
GAR is 44-70× faster than [28] on each input, uses 4-6× less bandwidth, and uses ≈ 10× fewer RAM accesses.Dijkstra's.We ran Dijkstra's (Figure 1) on a graph with 40, 60, 80, and 100 nodes.For a graph with  nodes, we set the number of edges || = 3.The sparse graph is stored in the adjacency list.Our program is split into 14 fragments.
For each input, GAR is 42-45× faster than the baseline [28], uses 6-7× less bandwidth, and uses 5-6× fewer RAM accesses.) illustrates that we reduce the number of execution steps by an order of magnitude.Our Hz rate is base instructions per second.

Performance Breakdown and Discussion
7.4.1 Breakdown of Wall-Clock Time. Figure 6 breaks down the wall-clock-time for each of our runs of PSI and Dijkstra.No one cost clearly stands out as the bottleneck.We note that we did not stream the GC from the generator to the evaluator; we expect that proper streaming would allow to overlap garbling and evaluation with transmission, essentially eliminating the separate cost of garbling and evaluation.We also include detailed GAR costs in Figure 4.

SGC Savings.
Recall that we use SGC to stack GC material from fragments in the active set.In our Dijkstra experiments, we observed that SGC improved communication by roughly 3×, excluding the cost of GRAM access.We expect that this improvement will become more significant for larger and more complex programs where the total number of fragments and likely the size of the active set will be larger.

Communication
Rounds and Latency Impact.GAR is implemented via a garbling scheme, and our instantiation in the semihonest model only requires performing (parallel) OTs and sending a single message from the generator to the evaluator.In contrast, [28]'s ORAM-based CPU uses multiple rounds of communication per RAM access.
This distinction is not significant in the ultra-low latency setup we have explored so far, but even modest latency harshly penalizes multi-round approaches.We evaluate the impact by executing one program (PSI-256) with various latencies.We used the Linux traffic control tool tc to configure a network with 2Gbps bandwidth and 0/50/100ms latencies.Figure 8 tabulates wall-clock time performance.(In this experiment we run both parties on a single machine, so measurements in Figure 8 are not identical to our other experiments.)With higher latency, GAR's execution speed is almost unchanged, but [28] becomes significantly slower; GAR's advantage grows from 27× on a 0ms latency network to 898× (resp.1585×) with 50ms (resp.100ms) latency.
7.4.4Active Set Sizes.GAR (and VISA) performance declines with the increase of sizes of active sets (i.e., sets of fragments that can be possibly executed in the corresponding step).Let  be the total number of fragments.In our experiments we observe that the active set size starts with 1 and quickly grows to  − 1 as execution proceeds:  = 4 (resp.15 and 11) for PSI (resp.Dijkstra and KMP).We view active set optimization as crucial future work.

Comparison with
Straight-Line Circuit Evaluation.Finally, we illustrate the advantage of the VISA approach over straightline circuit evaluation by comparing with the semi-honest 2PC of the widely adopted EMP Toolkit [27].We implemented the Knuth-Morris-Pratt (KMP) string-searching algorithm in both GAR and EMP.(We do not include [28]'s performance, because their repository did not include this benchmark.)Figure 9 tabulates the results, and Figure 4 presents a fine-grained analysis of GAR's performance results.KMP searches for occurrences of a length- pattern held by  in a length- string held by  and outputs the number of occurrences.An important feature of KMP is its  ( +) time complexity, rather than the naive  ( •).Circuits are not a suitable representation as KMP contains an inner loop that must be pessimistically unrolled a total of  ( • ) times when in fact only  ( + ) total iterations are needed.

A FULL GARBLING SCHEME
We provide the formal procedures for GAR. Figure 12 lists the scheme procedures (i.e., Construction 1) of GAR. Figure 13 explains how we handle ACCESS gates internal to SGC branches.Figure 10 and Figure 11 are unrolled modifications of the COND gate procedures from LogStack.
In Definition 6.1, we define primitive programs as having explicit AND gates, XOR gates, etc.For brevity and to closely match the procedures of [12], which are conceptually quite similar, we use slightly different syntax our figures.I.e., a netlist is a sequence of AND and XOR gates.Netlists are handled via the [33] garbling scheme.'Cond' statements correspond to the SWITCH keyword.'Seq' statements denote two circuit components that are run in sequence.We emphasize that this language-level difference does not change the meaning of primitive circuit programs.Main memory is a length-2  GRAM with 32-bit entries.GbCond follows the structure of LogStack's procedure of the same name.Our colored boxes highlight diffences as compared to LogStack, and the green box highlights the most important modification.Figure 12: GAR's garbling scheme.The included algorithms are typical except for the handling of conditionals.Ev and Gb delegate the core of our approach: EvCond (Figure 11) and GbCond (Figure 10).En and De are not listed as they are standard.

Figure 1 :
Figure 1: Dijkstra's algorithm written in C. Each vertical line on the left denotes a contiguous string of instructions that are grouped into a fragment.I.e., this program has seven fragments.

Figure 2 :
Figure2: Our base ISA.Each instruction type handles between zero and three arguments.In general, arguments refer to registers, but some arguments, denoted {•}, can also optionally be immediates (i.e., compile-time constants).val is a helper function that resolves an argument that can be either a register or an immediate.Unless the semantics otherwise mention an effect on the pc, each instruction also increments the pc.The symbol < with an overset  (resp.) denotes a comparison where the arguments are treated as an unsigned (resp.signed) integers.

Figure 4 :
Figure 4: GAR's evaluation on KMP with different inputs and GRAM sizes.

⊲⊲⊲⊲⊲⊲⊲⊲⊲Figure 10 :
Figure 10: The algorithm for garbling a conditional with  branches where each branch has at most  RAM accesses.Main memory is a length-2  GRAM with 32-bit entries.GbCond follows the structure of LogStack's procedure of the same name.Our colored boxes highlight diffences as compared to LogStack, and the green box highlights the most important modification.

⊲⊲⊲Figure 13 :
Figure 13: Variants for Gb and Ev.These variants are called inside GbCond and EvCond.We implicitly use function * to denote function from where the underlying calling to Gb (resp.Ev) is replace by Gb * (resp.Ev * ).

•
An ACCESS gate performs an array access.The gate accepts as input (1) an array , (2) log 2  bits that together encode an array index , (3) a bit rw that indicates if this is a read or a write, and (4) a -bit value  that indicates what to store in the array if this is a write.As output, the gate yields (1) [] and (2) the updated array where the content of index  has been replaced by  iff rw = 1.