Abstract
We present a High Level Synthesis compiler that automatically obtains a multi-chip accelerator system from a single-threaded sequential C/C++ application. Invoking the multi-chip accelerator is functionally identical to invoking the single-threaded sequential code the multi-chip accelerator is compiled from. Therefore, software development for using the multi-chip accelerator hardware is simplified, but the multi-chip accelerator can exhibit extremely high parallelism. We have implemented, tested, and verified our push-button system design model on multiple field-programmable gate arrays (FPGAs) of the Amazon Web Services EC2 F1 instances platform, using, as an example, a sequential-natured DES key search application that does not have any DOALL loops and that tries each candidate key in order and stops as soon as a correct key is found. An 8- FPGA accelerator produced by our compiler achieves 44,600 times better performance than an x86 Xeon CPU executing the sequential single-threaded C program the accelerator was compiled from. New features of our compiler system include: an ability to parallelize outer loops with loop-carried control dependences, an ability to pipeline an outer loop without fully unrolling its inner loops, and fully automated deployment, execution and termination of multi-FPGA application-specific accelerators in the AWS cloud, without requiring any manual steps.
1 INTRODUCTION
The sequential programming abstraction is known to increase programmer productivity and (through behavioral or high-level synthesis (HLS)) hardware designer productivity as well, as compared to parallel programming and traditional parallel hardware design, e.g., because the sequential programming abstraction is deterministic and is free from difficult, hard-to-debug race conditions, and because the sequential programming abstraction relieves the programmer from the extra burden of proving that a parallel version of the program is equivalent to a sequential specification. Today, the sequential programming abstraction is mature: The C language and, in particular, the C++ language allow a programmer to express a sequential algorithm at a fairly high level, which is then reliably compiled by mature optimizing compilers into the very efficient sequential object code of an industry standard processor.
The sequential programming abstraction has often been associated with the Von Neumann computational model, which processes data sequentially, one word at a time: in particular, John Backus in his 1977 Turing award lecture remarked that the Von Neumann computation model has imposed “an intellectual bottleneck that has kept us tied to word-at-a-time thinking” [17]. New parallel programming models have been proposed throughout the years (e.g., [12, 36, 37, 38, 39]) to replace the sequential abstraction model and to overcome this word-at-a-time thinking. However, we believe that the Von Neumann computational model does not fairly reflect the inherent parallelism within the sequential abstraction: with sufficient hardware resources, parallelism in the sequential abstraction can indeed be very large, even in sequential-natured applications (see, for instance [44, 83]). Instructions of a sequential program do not need to be executed one after the other, they merely need to appear as if they were executed one after the other. Indeed, only the instructions in the execution trace of a sequential program that depend on each other should be executed sequentially: independent instructions can be executed in parallel. Parallel execution of a sequential program can be completed in a time period almost as short as the length of the critical path in the execution trace, i.e., almost in optimal time, except in some known corner cases [79]. The authors believe that parallel hardware accelerator design with sequential code involves understanding the concept of minimum initiation interval (MII) in an outermost loop of a program (MII is the minimum number of cycles between the start of iteration n and the start of iteration \( n+1 \)), seeing the maximum parallelism that already exists in the sequential code if sufficient resources are provided, and then, if needed, improving the sequential code to reduce this MII further, to obtain an even higher performance hardware accelerator.
Hence, sequential code is merely an abstraction, which is useful because of its simplicity: It is not the same as the Von Neumann computational model. In the present work, we are in fact proposing the inherent parallelism in sequential code as a parallel programming model for creating application-specific hardware accelerator systems. We will describe some of our new High Level Synthesis compiler contributions that actually approach parallelism limits in sequential code.
We have created a high level synthesis compiler system where a sequential code, without any parallelization directives, is the full system high performance parallel hardware description. Since only the dependent operations in an execution trace of a program need to be executed sequentially, and since all other operations in the execution trace can be executed in parallel, sequential code already has plenty of parallelism, as mentioned above. For this reason, our compiler does not require constructs outside of ordinary sequential code semantics to extract parallelism, unlike some existing HLS tools that extend C/C++ with various explicit parallelism constructs expressed through pragmas, that are not part of sequential C/C+ (e.g., a “function level pipelining” pragma, or a “dataflow” pragma in Vivado(TM) [91]). Our compiler, therefore, offers an important productivity advantage by creating multi-chip application-specific hardware accelerators merely from sequential software programs. Such a multi-chip application-specific hardware accelerator would be difficult to create by existing methods.
Software development can be a large cost factor in new hardware development projects. For example, the development of new assemblers, compilers, APIs, and kernel drivers is usually required to use the new hardware. But the multi-field-programmable gate array (FPGA) hardware accelerator produced by our compiler is functionally identical to the sequential single-threaded C/C++ function (including its hierarchy of sub-functions) the accelerator was compiled from. Thus, invoking the multi-FPGA hardware accelerator is simply equivalent to invoking, in software, the sequential single-threaded C/C++ function the accelerator was compiled from. This feature of our compiler system can reduce the software development costs of deploying new multi-chip hardware accelerators inside or outside the cloud.
Improvements in processor performance have been maintained for many years thanks to Moore’s law [78] and Dennard scaling law [35]. However, after reaching the “voltage scaling wall” in semiconductor and chip manufacturing technologies, the laws of Dennard scaling have broken down [19], meaning that it is no longer possible to have constant power density between chip generations to attain performance improvements in processors. Note that increasing the number of cores in multi-core processors cannot always be translated into performance improvements in a user’s application, even for server workloads that contain a large scale of parallelism [49].
As one means to remedy the stalled upward trend in the performance of general purpose processors, we believe in designing application-specific hardware architectures aim at achieving the best performance and power efficiency in the user’s application itself (as opposed to in the circuits of a processor, which interprets the software instructions implementing the user’s application). We believe that designing application-specific hardware is preferable to waiting for performance improvements in general purpose processors through advancements in semiconductor technology [34, 46]. The combination of high performance FPGAs and HLS compilers can even today make it practical to generate dedicated specialized hardware for each different problem. As for ASIC design, “semi-reconfigurable ASIC”s can be created by an HLS compiler while still maintaining the performance and power efficiency benefits of an ASIC, while reducing Non-Recurring Expenses through an ASIC “union chip” that can take on different identities depending on the configuration, as we have proposed in [42].
Without implying any lack of generality in our approach, we have selected a highly sequential version of the Data Encryption Standard (DES) key search algorithm (DES cracking algorithm) as a running example and case study in this article, as given in Algorithm 1 (“find the first correct key, trying all candidate keys in strictly sequential order”, where a “correct” key is one that decrypts the given ciphertext and obtains the given plaintext), to demonstrate the high performance achieved by our HLS compiler, which takes the optimized sequential x86 assembler code produced by the gcc compiler as input (including source level information made available by a
While the DES has little security value today, DES cracking remains an excellent example of an HLS compiler test program, that still has high computation requirements. Note that while DES key search is usually implemented with an embarrassingly parallel problem specification [29], Algorithm 1 is not embarrassingly parallel in the traditional sense, since the loops of Algorithm 1 have a conditional, data-dependent exit on line 19, and are therefore not DOALL loops. The conditional, data-dependent exit from the loops of Algorithm 1 requires that candidate keys be tried in strict sequential order and furthermore resists easy automatic parallelization on a SIMD or multicore machine. While DOALL loops are easy to handle in a parallelizing compiler, Algorithm 1 is an automatic parallelization challenge, which is representative of general sequential single-threaded algorithms where strict sequential order matters (e.g., searching a key in a linked list of linked lists of key-value pairs as in the
2 CONTRIBUTIONS
Our compilation techniques aim at reaching the theoretical performance limit of the sequential application being compiled, namely, an execution time equal to as little as the critical path length of the sequential application’s execution trace, where possible. Our compiler creates a complete parallel hardware accelerator system (and not just a component of a larger system) specified by the input sequential program, where the sequential program is the parallel hardware accelerator full system hardware description.
In view of the DES cracking application given in Algorithm 1 (“find the first key that decrypts the given ciphertext and obtains the given plaintext, trying all candidate keys in strictly sequential order”) used in this article as an example, our compiler’s differences from existing HLS techniques include:
– | An ability to parallelize outer loops with loop-carried control dependences; | ||||
– | An ability to pipeline an outer loop without fully unrolling its inner loops; | ||||
– | Multi-chip system design; | ||||
– | Super Logic Region-aware placement by auto-pipelining; and | ||||
– | Productive deployment of multi-chip application-specific hardware accelerators in the cloud. | ||||
We will summarize some new features of our high-level synthesis compiler for creating multi-FPGA systems below.

2.1 Ability to Parallelize Outer Loops with Loop-carried Control Dependences
Within existing parallel execution programming platforms such as OpenMP [32] (for multi-core processors), MPI [25] (for distributed parallel processors), or NVIDIA Cuda [73] (for General-Purpose Computation on Graphics Processing Unit (GPGPU) machines with many parallel threads), outer or inner loops can be efficiently parallelized only if there are no loop-carried dependences [5, 6, 7], i.e., all iterations of the loop can be executed simultaneously without fear of giving an incorrect result. Synchronization is expensive on existing architectures [30, 31, 55, 80, 96].
HLS, since it creates a customized, application-specific hardware design, is not limited by the multi-core, distributed computing, or GPGPU architecture paradigms, and has an unbounded ability to overcome limitations of existing architectures and to create the right parallel hardware for a particular application. i.e., with HLS, “sky’s the limit” in terms of hardware acceleration innovation. But, while innermost loops with loop-carried dependences have been parallelized in previous compiler literature on Instruction Level Parallelism as well as on HLS, using software pipelining [4, 33, 43, 45, 64, 68, 75, 88] (also present with some limitations in today’s commercial HLS platforms [91]) existing HLS techniques have not been able to parallelize outer loops with loop-carried dependences.
As an example of how our compilation techniques parallelize outer loops with loop carried control dependences, in the DES cracking problem as stated in this document in Algorithm 1, there is a loop carried control dependence in the outer loop (in the sequential view, iteration p of the outer loop is executed, and if a key is found, the outer loop ends before starting the next outer loop iteration p + 1). In our solution, many iterations p + 1, p + 2\(, \ldots \) of the outer loop are started speculatively, while iteration p of the outer loop is still executing, and are discarded as wasted work if iteration p of the outer loop exits; therefore, overcoming the sequential-natured execution ostensibly required by loop-carried control dependences. The reader is referred to [42] for further details of our compilation techniques for overcoming loop-carried control and data dependences in outer loops and related synchronization requirements.
2.2 Ability to Pipeline an Outer Loop without Fully Unrolling Its Inner Loops
Existing commercial HLS tools (e.g., Vivado [91]) are unable to pipeline an outer loop without fully unrolling all its inner loops first. But fully unrolling inner loops leads to wasteful use of the hardware resources, on the order of (number of iterations
Unlike previous HLS techniques, our compilation techniques can pipeline an outer loop (as exemplified in the DES outer loop without fully unrolling inner loops. Pipelining an outer loop is similar to pipelining an inner loop; it means: starting iteration \( n+1 \), \( n+2, \ldots \) of the outer loop, before iteration n of the outer loop is finished, when dependences and resources permit. Our technique of pipelining an outer loop, called hierarchical software pipelining [42] is achieved in summary by pipelining an inner loop, creating many copies of the pipelined inner loop, and adding a result reordering circuit so that the results of the inner loops are observed in sequential order by the outer loop, i.e., the collection of inner loops is made to behave like a simple pipelined multiplier that instead of taking a pair of multiplication operands every cycle, takes the inputs of an entire loop invocation of an inner loop every cycle, and delivers an entire inner loop invocation result every cycle after the pipeline fill time has elapsed, when dependences and resources permit (the entire loop invocation of an inner loop means: all code from the first instruction of the first iteration to the last instruction of the last iteration of the inner loop; it does not mean a single iteration of the inner loop). It is possible to achieve 1 cycle per iteration in the outer loop with sufficient duplication of the inner loops. By applying the technique to outer-outer, outer-outer-outer\(, \ldots \) loops recursively, even the outermost loop in a loop nest can be executed at a rate of one cycle per iteration if dependences permit and sufficient multi-chip hardware resources are provided. E.g., consider adding a further outer loop

to the
Liu et al. in [67] proposed software pipelining of an outer loop without fully unrolling its inner loops, but were unaware of our earlier work on this topic in [42]. Furthermore, [67] did not propose a method for handling loop-carried dependences in an outer loop, or for scalable partitioning of a resulting large design into multiple chips. This reference will be discussed further in the related work section.
2.3 Multi-chip Design
While there have been manually designed multi-chip FPGA hardware accelerator designs as it will explained in related work section, existing HLS tools [21, 28, 59, 60, 72, 92] are unable to automatically create a multiple chip application-specific hardware accelerator system from sequential code.
Our compilation approach is able to automatically partition a large hardware accelerator design into multiple chips as exemplified by the DES cracking application of the present paper (see Figures 1 and 2). Our compiler uses I/O controllers to address design partitioning and chip-to-chip communication. In summary, an outgoing message from a message source component in a message source FPGA chip is routed within an on-chip partial network to reach the I/O controller of the message source FPGA chip. The message is then converted to a standard UDP message format and sent out of the FPGA and out of the network interface of the AWS EC2 F1 machine instance through an efficient Data Plane Development Kit (DPDK) software “poll-mode driver”, which operates without invoking the kernel (see the virtual ethernet and the
Fig. 1. Block diagram of flat, non-partitioned DES cracking hardware accelerator design and its legend for the blocks.
Fig. 2. Block diagram of partitioned DES cracking hardware accelerator (with eight FPGAs, 16 inner loops per FPGA) hardware design. Only the first partition has top task adapter and the outer loop finite state machine.
2.4 Compiler-level Super Logic Region Crossing with Auto-pipelining
Xilinx FPGAs comprise multiple “Super Logic Regions” (SLRs) that arise from 2.5D stacking technology, where there is a significant added delay when a signal must traverse more than one SLR. Since the highest performance FPGA designs will not fit into a single SLR, the timing closure problem arising from crossing SLRs must be addressed.
The auto-pipelining feature provided by the Vivado tool inserts additional pipeline registers automatically during the placement phase to help achieve timing closure (see pp. 79-84 of [93]). A hardware component using the auto-pipelined register chain feature, is normally functionally equivalent to a FIFO, but whose sender (input) side may be very far away from its receiver (output) side, in a different SLR. But since the number of register stages the Vivado placer will add to the auto-pipelined register chain is unpredictable at an early compilation stage, one must use an extra internal FIFO with enough elements on the receiver side of the component, to ensure that there will not be any buffer overruns, as the sender side sends new data items without being aware of buffer overrun hazards on the receiver side. As our contribution, we have added a very light-weight and energy-efficient credit based flow control to the sender side (input side) of our component in the form of a small counter predicting the “remaining internal FIFO buffer elements in the receiver side”, so that a much smaller internal FIFO (with only 2 or 3 elements) can be used in the receiver side instead of an internal FIFO large enough to cover the worst case traffic. This simplification allows our compiler to use many duplicates of our new auto-pipelined FIFO component pervasively with little penalty in latency or chip resources, in every design part where SLR crossing can potentially occur, without having to know exactly where the SLR crossings will occur. Then, during the placement phase, on a long wire, more registers are added and on a short wire less registers (possibly only one) are added by Vivado. Our new auto-pipelined FIFO helped us meet a 250 MHz timing goal by merely using Vivado’s automatic place-and-route with a design spanning multiple SLRs.
2.5 Productive Deployment of Multi-chip Application-specific Hardware Accelerators in the Cloud
Large companies such as Microsoft and Google have already deployed multi-chip FPGA-based and ASIC based hardware accelerators in the cloud. However, only a few companies (AWS [9, 10, 11] and Alibaba Cloud [2, 3]) allow the user to directly program FPGAs in the cloud at the Register Transfer Level. But using FPGAs in the cloud is not easy. For example, even for an expert engineer, designing and deploying just one AWS EC2 F1 FPGA instance in the cloud requires many manual steps using the AWS console interface and manual editing of tcl scripts as needed. The requirement for taking such manual steps currently hampers human engineering productivity. Our HLS compiler helps to overcome this productivity barrier, by means of:
After this summary, our HLS compiler’s new features will be elaborated in further detail in the sections below.
3 MULTI-FPGA SYSTEM ARCHITECTURE
Our general system architecture is depicted in Figure 3, which shows our cloud-level hyper-scale vision for application-specific multi-FPGA architectures based on AWS F1 2x.large instances. AWS EC2 F1 instances are located in the same “availability zone” and the same “placement group” (for achieving lower latency communication), and are connected to each other through 10 Gbps Ethernet links through an AWS “virtual private cloud” network belonging to the user. We developed a UDP packet-based interface between F1 instances for instance-to-instance communication.
Fig. 3. Multi-FPGA System Architecture based on 8 AWS F1.2xlarge Instances with mapping of compiler generated partitions for each FPGA.
3.1 AWS EC2 F1 Instance FPGA Architecture
AWS EC2 F1 is an FPGA-based AWS cloud instance family, which consists of three types of instances: f1.2xlarge, f1.4xlarge, and f1.16xlarge. A general block diagram of an instance is shown in Figure 4. The AWS f1.2xlarge instance contains an Intel Xeon E5-2686 v4 x86 CPU and a Xilinx Virtex UltraScale+ VU9P FPGA device (xcvu9p-flgb2104-2-i). The on-board x86 Xeon CPU and the FPGA are connected via a PCI Express x16 Gen3 mode bus. Table 1 shows the available resources on the AWS EC2 F1. We used AWS f1.2xlarge instances for our multi-FPGA framework.
Fig. 4. General Architecture of AWS F1.2xlarge Instance.
There are some works that use an AWS EC2 F1 cloud instance [18, 23, 24, 40, 56, 84, 85] in the open literature, but they mainly target a single FPGA device.
In the AWS EC2 F1 family of instances, the FPGAs are not able to directly access the virtual private cloud network. The on-board x86-based processor in an instance can access the virtual private cloud network through an Elastic Network Interface (ENI); thus, an FPGA can only talk to the virtual private cloud network indirectly, through the on-board x86 processor’s ENI.
AWS provides a “Shell” hardware design for this family of instances, which includes necessary hardware interfaces including PCIe and DDR4 to help with the development of hardware applications on FPGAs. AWS provides an AWS EC2 FPGA Development Kit [9], which contains a hardware development kit (HDK) and a software development kit (SDK). The HDK and SDK, in turn, contain tools for programming an FPGA on the AWS EC2 F1 platform.
The VU9P-FPGA device has three FPGA dies (super logic regions, SLRs) called SLR0, SLR1, and SLR2. The AWS Shell is placed in some portions of SLR0 and SLR1. It is important to consider the placement of the I/O controller, which is responsible for communication among different F1 instances. The I/O controller is implemented as a custom logic part that has two interfaces: one for communicating with compiler generated custom logic and one for talking to the AWS Shell through a “Streaming Data Engine” IP provided by AWS. In order to achieve timing closure of the design at relatively higher frequencies, placement of the I/O controller in the right SLR is important.
3.2 Networking
The partitioned verilog register-transfer-level (RTL) hardware generated by our compiler is not technology-specific, and can be easily ported to any multi-chip platform interconnected by a scalable network (e.g., k-ary-n-cube, fat tree) using either FPGA or ASIC chips each having its own off-chip DRAM. Our compiler has the capability to create parts of a k-ary n-cube network (e.g., hypercube) directly within each hardware partition, so that each serial network link of a chip can be connected point-to-point to a serial network link of another chip with a copper or optical cable, implementing a k-ary n-cube interconnection, without requiring external network hardware. A cluster of chips interconnected by an external fat tree switch is also a good target hardware platform for our HLS compiler. For optimizing total cost of ownership for a frequently used applications, NRE costs of an ASIC design can be reduced by using a “chip-unioning” technique [42], so that a single ASIC can implement more than one hardware partition, depending on configuration parameters supplied during power-up. In the present section, we will describe our engineering effort to port our multi-chip hardware accelerator framework to the AWS-EC2 F1 instances platform, where scalable communication across FPGAs must be done over an AWS Virtual Private Cloud network and where communication must also include an efficient packet forwarding software layer, such as DPDK, since there is no direct connection from an FPGA to a network interface leading to the Virtual Private Cloud network.
Ideally, an FPGA or ASIC chip that is a part of an application-specific hardware accelerator must connect to a network interface directly, as some researchers have already suggested (e.g., [57]). However, in the AWS EC2 F1 instances platform, the network interface cannot be directly accessed from an FPGA. Instead, AWS has developed a sample FPGA design called the Streaming Data Engine, “
As part of the present project, we have modified
Note that network bandwidth across AWS EC2 F1 FPGA instances that are connected via Ethernet over the virtual private cloud can be a bottleneck in terms of performance for networking applications. The f1.2xlarge instance has up to 10Gb network bandwidth. Packet size plays a significant role in achieving available bandwidth and packets per second (PPS) performance on the AWS F1 instances (see [23]).
We are employing UDP-based (User datagram protocol) communication among different f1.2xlarge instances that contain compiler-generated accelerator partitions. UDP is simple and fast, but it provides unreliable data transmission. However, in our multi-FPGA experiments with f1.2xlarge instances, we did not yet encounter any packet drops, perhaps due to the currently small number of FPGAs that are being used. We do plan to address the unreliability of UDP communication in our ongoing work by adding reliability features on top of the UDP protocol. There are many works that propose reliability without sacrificing throughput [52, 54, 65, 70, 86]. We are evaluating the alternatives suitable for an AWS Virtual Private Cloud Network.
3.2.1 SDE-based Streaming Accelerator Interface.
The Streaming Data Engine (SDE) is a hardware IP module that is provided in the AWS-FPGA library [9]. The SDE provides high-performance streaming connectivity among FPGAs in different F1 instances and also the user’s software application operating at the Host CPU.
We used the AWS
Figure 5 shows the interface of SDE module with the shell and the custom logic, where the latter includes one partition of our compiler-generated accelerator. SDE talks to the shell via AXI4 bus memory-mapped PCIS (PCI slave) and PCIM (PCI master) interfaces. The SDE uses the PCIS-AXI4 interface to obtain descriptors written by DPDK software, which contain information about the data transfer being performed, e.g., the destination physical addresses for the data, and the number of bytes for the data transfer. Data are transferred between the custom logic and the memory of the on-board x86 processor through the PCIM AXI4 interface.
Fig. 5. General block diagram per AWS EC2 F1 FPGA instance.
SDE provides two AXI stream interfaces (H2C and C2H) to the custom logic to be able to send and receive data in streaming fashion.
Each accelerator FPGA partition has two global streaming interfaces, A2H and H2A, and they are connected to the SDE streaming interfaces as shown in Figure 5.
3.2.2 Chip to Chip Communication on AWS Through DPDK.
Chip to chip communication is a bottleneck in high-performance multi-chip implementations due to physical constraints. The most efficient way to interconnect chips is directly wiring them through very fast communication mediums. In the current version of FPGA instances of AWS cloud, FPGAs cannot access a virtual cloud network within the data center directly, but they are able to connect to the virtual cloud network through the network interface card of the on-board x86 CPU. In software, DPDK (1) achieves fast packet processing when communicating and dispatching tasks among several AWS EC2 F1 instances that are parts of our FPGA-based cloud hardware accelerator, and (2) allows us to bypass the operating system (OS) kernel, in order to achieve fast data transfer between the ENI connected to the on-board x86 processor and the FPGA custom logic.
In DPDK, the environmental abstraction layer (EAL) enables the application to gain access to low-level hardware resources and memory space. An EAL thread (lcore), which is a Linux pthread, executes the tasks issued by the
Eliminating network noise packets: Even if the FPGAs communicate with each other on a separate, dedicated subnet, in the AWS virtual cloud environment many unrelated Ethernet packets, not generated by our accelerator hardware and not useful for acceleration, arrive at an EC2 F1 instance, such as Address Resolution Protocol (ARP) packets. We have modified DPDK code to efficiently check for any packets not generated by our compiler-generated hardware. These incoming packets are dropped using only a few x86 instructions. They are not delivered to the FPGA; thus, we eliminate the network noise packets.
While using any software for network communication is not ideal, we used the DPDK software specifically to overcome a limitation of the current AWS EC2 F1 platform since FPGAs cannot communicate with a network interface directly: in a future, for example, ASIC-based implementation of a cloud hardware accelerator generated by our compiler, each chip will be able to connect to a plurality of high-speed network interfaces directly.
4 MULTI-FPGA DESIGN FOR DES KEY SEARCH
In this section, we will explain how our HLS compiler extracts parallelism from sequential code for multi-FPGA system design for the DES key search application, how it generates the full system hardware with all the necessary communication components, and how it automatically builds an FPGA-based multi-instance cloud accelerator system in the AWS cloud. We will first start with an explanation of the DES computation using its high-level sequential description.
4.1 DES Algorithm (High-level Sequential Description)
We have implemented the original National Institute of Standards and Technology Data Encryption Standard (DES) specification FIPS PUB 46-3 [74] (reaffirmed on October 25, 1999, withdrawn on May 19, 2005 in favor of AES) verbatim, without any optimizations, in C.

Algorithm 2 shows the decryption procedure of DES, where the symbol \( || \) means concatenation and the symbol & denotes the logic AND operation. We provide this high-level algorithm of DES in order to explain the computation. Our main contribution is that our high-level compiler takes the same high-level description (written in C) and generates very high-performance multi-FPGA distributed hardware architecture, which has the potential to outperform all previous manually-designed DES hardware architecture results.
Note that C does not have any native operations to represent hardware bit manipulation operations such as bit permutation. Such operations must be implemented with and, or, xor, shift, or other operations in C. Our compiler recognizes sequences of and, or, xor, shift and other operation that are equivalent to a rearranging of bits (or negations of bits) belonging to one or more registers, and implements such sequences of operations with a single Verilog concatenation of bits (or negations of bits) where possible. Our compiler also implements bit-width compression optimizations in registers and network payloads and attempts to avoid creating any flip-flops or wires for constant bits, redundant bits, which are copies of another bit, or dead bits. As a result, where possible, the compiler infers hardware registers and network payloads of a size smaller than standard C data type sizes of 8, 16, 32, or 64 bits.
Algorithm 3 shows the Feistel f function of DES. After input to the f function is expanded from 32-bits to 48-bits, it is XORed with the provided subkey. An S-box operation is now applied to the result. In order to generate addresses for the S-box tables of DES, there is a special encoding (line 4 in Algorithm 3).

Loop independent dependence in the i-loop in Algorithm 3, e.g., between lines 3 and 4 and also between lines 4 and 5, can be implemented as a parallel hardware [20] (or pipelined hardware even if there is a true dependence on a variable or a memory location), but it is challenging for a compiler to automatically understand the immense parallelism in the given sequential code and generate parallel and pipelined hardware execution units without any rewriting of the code. The sequential description of DES actually has a lot of parallelism. A skillful hardware engineer is able to design a parallel hardware architecture for DES, but at the cost of a long design cycle. Full system verification also requires very careful analysis and long verification efforts. However, in the present study, a user of our proposed framework can automatically create a multi-FPGA accelerator system from a simple sequential C description without rewriting the C code, without writing any Register Transfer Level code, and without having any cloud-level expertise. How our HLS compiler extracts the parallelism and generates the hardware is explained in the following section.
4.2 Compiler Overview and Parallelization
Our compiler maps a sequential program such as the DES cracker whose code is shown in Algorithm 1 to a parallel hardware accelerator such as the one in Figure 1. The hardware components are highly pipelined FSMs (shown as rectangles). Very efficient, lightweight packet switching networks (shows as ovals) are used for communication between hardware components. The hardware follows the loop hierarchy of the program, where a (possibly duplicated) inner loop is considered a child of its outer loop, and an outer loop is considered the parent of its inner loop copies. In Figure 1, one can see the outer loop of the DES cracker application, and its duplicated inner loops. However, because of the need to duplicate inner loops of an outer loop to achieve parallelism, the initial design will normally not fit in a single FPGA. The number of inner loop copies per outer loop can be estimated by the compiler or can be specified by the user for additional control.
Types of packet switching networks include incomplete butterfly multi-stage networks (“incomplete” meaning that the number of input ports and/or output ports of the network need not be a power of two), load balancing “task networks” and linear array networks.
We will provide a simplified high-level summary, illustrated in Figure 6, of our HLS compiler, which demonstrates one possible way to implement our compilation approach; there are many other ways. Our HLS compiler uses the object code and assembly code outputs of an unmodified gcc/g++ compiler as inputs and operates as a linker that has hardware acceleration capabilities. All phases of our HLS compiler are encapsulated in a gcc/g++ compatible command line tool. A compiler-generated executable that invokes a hardware accelerator also behaves like an ordinary, gcc/g++ generated executable.
Fig. 6. Overview of the proposed framework for multi-FPGA hardware design synthesis from a sequential C/C++ program.
The compiler phases comprise the following:
① | The gcc/g++ compiler is invoked from within our HLS compiler and converts the user’s C/C++ program into an x86 assembly language file. The assembly language file also contains debugging data produced by a -g flag, for providing a degree of source-level information to our HLS compiler. The user also indicates the function(s) to be accelerated. Optimizations and scheduling are performed only on the functions to be accelerated. | ||||
② | An intermediate code translator translates the x86 assembly language file to unoptimized intermediate code, consisting of RISC primitives for implementing each x86 instruction by following the x86 architecture specification verbatim. | ||||
③ | An optimizer applies standard and x86-specific optimizations to the unoptimized intermediate code and obtains clean RISC-like optimized intermediate code. | ||||
④ | A scheduler applies hierarchical software pipelining to the optimized intermediate code, as if targeting an extremely wide-issue architecture, as explained below. As a result, scheduled, software pipelined program regions are obtained. | ||||
⑤ | An FSM generator converts each scheduled, software pipelined program region into an FSM in verilog. Columns 44–52 of [42] provide an example of more details of this transformation. | ||||
⑥ | A design integrator combines the FSMs obtained from the user’s C/C++ code and verilog modules picked from the library of the compiler, to create a complete non-partitioned flat accelerator design (as if the target chip had infinite area) by wiring together the FSMs, application-specific on-chip and cross-chip networks, application-specific on-chip memories, floating point units, top task adaptor, auto-pipelined interfaces, response re-ordering units, and so on, selected from the library. The final duplication counts of Loop FSMs for achieving performance through hierarchical software pipelining, are also decided on at this stage. | ||||
⑦ | A partitioner then partitions the flat accelerator design into multiple chips, and creates I/O controllers in each chip for cross-chip communication; the result is a partitioned accelerator design. | ||||
⑧ | An executable packer then combines the un-accelerated part of the software, an executable manager program, and the locations of the files in AWS S3 storage that will contain the FPGA image identifiers when they become ready, and creates an executable. When the executable is started, the manager program will be invoked first to do housekeeping actions such as verifying the readiness of FPGA images and initializing FPGA instances, before starting the un-accelerated part of the software. When the C/C++ function within the user’s software application intended for hardware acceleration is attempted to be executed, communication messages are exchanged between the accelerator and software application, to realize the hardware acceleration and to maintain memory coherence between the software application and accelerator. The executable packer stage also creates a stand-alone verilog tarball from the partitioned accelerator design and starts Vivado processing at AWS. | ||||
⑨ | An AWS-FPGA image creator accepts the verilog tarball and it passes on to FPGA Developer AMI instances on AWS, which then in parallel execute scripts that go through all the steps to convert the verilog files to a Vivado design checkpoint, which is, in turn, submitted to AWS, which, in turn, delivers the FPGA image for each partition. When all the FPGA images are ready and required FPGA instances are up, the previously created executable will now run with hardware acceleration. Prior to this point, the executable will still run, but in software only. | ||||
For achieving verification of a multi-chip design, the executable packer also has the option to enable multi-chip verilog simulation in software. The Verilator tool is used to create a verilog simulation executable for each hardware partition. The executable packer will pack together the manager program, the un-accelerated part of the software application, and the verilog simulators for each hardware partition. When such an executable is started, the manager program will first distribute the executables to multiple servers, before starting the un-accelerated part of the software application. Then, when the C/C++ function within the user’s software application intended for hardware acceleration is attempted to be executed, simulated multi-chip hardware acceleration will be realized through message exchanges among the software application and the verilog simulators.
Parallelization consists of applying software pipelining, which is mainly based on enhanced pipeline scheduling [71], recursively in a bottom up manner following the loop hierarchy in reverse post-order enumeration. First, an inner loop is pipelined. Then, by duplicating the inner loop and adding a result-reordering hardware circuit, the collection of inner loops is made to look like a simple pipelined multiplier or simple store queue (in case the inner loop is executed only for side effects), that delivers an inner loop invocation every cycle after its pipeline is full, when dependences and resources permit. After inner loops of an outer loop are thus converted to simple pipelined units, the outer loop is software pipelined, wherein the inner loops appear like simple instructions with appropriate dependences and latencies, as observed from the outer loop. Because inner loops are made to appear as simple instructions, the usual software pipelining algorithm is applied to the outer loop as well. The algorithm continues recursively in this manner with outer-outer loops, outer-outer-outer loops, and so on, potentially creating a massive amount of additional hardware at each loop level. The outermost program, not being a loop, is not pipelined, it is merely scheduled.
Because our compiler performs hierarchical software pipelining in the presence of memory dependences and in the presence of arbitrary control flow, and since it creates synchronization circuits tailored to the application, it constitutes a general algorithm that can accept any single-threaded program as input.
To understand why reordering of the results of copies of inner loops is necessary, before an outer loop observes these results, again consider the sequential DES cracking algorithm specified in this paper. In general, inner loops of an outer loop can complete their work and return their result after an unpredictable delay. For example, while the inner loop of iteration p of the outer loop of the DES cracker application is continuing, the inner loop of iteration \( p+1 \) may be speculatively started, and the inner loop of iteration \( p+1 \) may find a correct key immediately, and may finish immediately. The outer loop must confirm that the inner loop of its iteration p will not find a correct key, before looking at the result of the inner loop of its iteration \( p+1 \), to exactly replicate the sequential algorithm specification while achieving high parallelism. When it is guaranteed that a parallel hardware accelerator implements a sequential algorithm exactly, the authors believe that the result is conceptual simplicity, which, in turn, improves the designer’s productivity. A reordering hardware unit can be implemented, for example, by a special FIFO-like queue, where elements are read and removed sequentially from the front of the queue (blocking when the desired next element is not yet present in the front of the queue), and where elements are written with random access anywhere within the special FIFO-like queue.
4.3 Partitioning
Once the flat, non-partitioned design of the hardware accelerator is obtained, it will normally not fit in a single chip, and will need to be partitioned into multiple chips. The hardware components of the flat hardware accelerator design are partitioned into parts that can each fit into a single chip.
The Figures 1 and 2 indicate the partitioning of the example DES cracker application. Figure 1 depicts the flat non-partitioned design containing all 128 inner loops. Figure 2(a) is partition 1 containing inner loops from \( 0-15 \). Figure 2(b) represents partitions \( 2-8 \) of the hardware accelerator each containing 16 inner loops, numbered 16–31, 32–\( 47, \ldots, 112 \)–127, respectively.
Note that there can be a plurality of networks in the flat, non-partitioned hardware accelerator design. When two hardware components x and y connected to a network n in the original flat hardware accelerator are assigned to different chips \( \mathsf {A} \) and \( \mathsf {B} \), respectively, after partitioning, the first component x in chip \( \mathsf {A} \) can still send a message to the second component y in chip \( \mathsf {B} \), without being aware that the design is partitioned into multiple chips. Partitioning phase is depicted in Figure 7, and is achieved as follows:
Fig. 7. An illustration of our partitioning.
– | We will call the destination port numbers of the original network n of the flat hardware accelerator the virtual destination port numbers. | ||||
– | The message source component x in chip \( \mathsf {A} \) sends a message to destination component y in chip \( \mathsf {B} \) over network n, not being aware the design is partitioned, using virtual destination port number y in the message from x to y. | ||||
– | We instantiated I/O controllers in each chip to manage cross-chip communication over a scalable cross-chip network. As the cross-chip network in this study, we used a virtual private cloud network in the AWS EC2 cloud platform, which allows fast communication between instances in the same “availability zone” and “placement group” [8] for minimizing communication delays. The part of network n that remained locally in chip \( \mathsf {A} \) routes the message from x to y to the I/O controller of chip \( \mathsf {A} \), since the destination component y is not in chip \( \mathsf {A} \). This is done by a small combinatorial logic or ROM routing table that maps the virtual destination port y to the physical destination port of the local partial n network connected to the local I/O controller. The routing table is accessed normally in a single cycle and can be further optimized, if the number of components on the original network n is a power of two and the components are distributed evenly to chips. The physical destination port routing bits are prepended to the message from x to y, to guide the message through the local partial network n. These physical routing bits are discarded when the local physical destination port (in this case connected to the I/O controller of chip \( \mathsf {A} \)) is reached. | ||||
– | The I/O controller of chip \( \mathsf {A} \) then converts the payload size of the message from x to y to a standard “flit” size (512 bits), and adds a header that indicates chip \( \mathsf {B} \) as the destination chip, and also the network number indicating which network this message should enter in the destination chip \( \mathsf {B} \), after reaching the destination chip \( \mathsf {B} \). This will be network n. | ||||
– | After the message from x to y arrives at the I/O controller of chip \( \mathsf {B} \) over a scalable chip-to-chip network, (in this case an AWS virtual cloud network) its header is used to make the message enter the correct local partial network n in chip \( \mathsf {B} \). The header of the message is deleted and the payload size of the message from x to y is then converted back to its original size. | ||||
– | The message from x to y then enters the local partial network n of chip \( \mathsf {B} \) to go from the I/O controller of chip \( \mathsf {B} \) to the component y in chip \( \mathsf {B} \). For this purpose, another small lookup table is used to map the virtual destination port y to the physical destination port of the local n network connected to the y component. The physical destination port number is prepended to the message as it enters the local partial n network at the I/O controller of chip \( \mathsf {B} \) and is used to guide the message within the local partial n network, so it goes to the destination component y. When the destination component y is reached, the physical destination port routing bits are discarded. | ||||
– | Thus, components x and y need not even be aware that the hardware accelerator is partitioned and that they are on different chips. This approach avoids the design changes inside the already-created components, that could otherwise be necessary for creating a partitioned hardware accelerator, after the flat hardware accelerator is created. | ||||
4.4 Automatically Generated Inner-Loop FSMs
Our compiler automatically generates a finite state machine (FSM) for each loop copy from the given sequential C/C++ code (in this case, DES
4.5 I/O Controller
The compiler-generated Verilog code of the I/O controller in the first FPGA chip that performs chip-to-chip and chip-to-host communication is partially given in Listing 2. For each FPGA chip, there is one specialized I/O controller unit. Only the first FPGA has an interface with the host CPU through host I/O module for chip-to-host communication. The I/O controller, which resides in the first FPGA chip is responsible for forwarding incoming packets from the host to the destination FPGA chip and outgoing packets from the FPGA chip to the host unit. The I/O controllers instantiated in other FPGA chips does not have a host interface. Each I/O controller sends and receives UDP packets; thus, all the necessary information in creating a UDP packet is embedded into I/O controller, e.g., the MAC addresses of the instances, and UDP source and destination port numbers.
Loop FSM units in the chip can communicate with the I/O controller through an on-chip packet switching network (or directly, through a network elision optimization [42], in case the packet switching network has only one input and one output port), in order to send a packet to the external interface (out of the chip). The I/O controller accepts data from the loop FSMs through loop slave interfaces. The width of the received data might be different than the external I/O bus width (see, for instance, 263-bit
If a packet is received from the outside of the chip through the response channel of the external I/O master interface that is being forwarded into a component (e.g., loop FSM) inside the chip, the previous operations are performed in reverse. First, the data are passed through the
4.6 The Second Enhanced Network Interface
All AWS EC2 instances of interest for FPGA acceleration already have a first Elastic Network Interface (ENI), which can, for example, be used for connecting via normal remote
The second network interface is brought up after an instance is started and reaches the “system-status-OK” state. The second network interface is brought down before stopping or terminating a EC2 F1 instance.
The second network interface does not need any IPV4 public address. It is used only for FPGA to FPGA communication. Other communication, such as
4.7 Implementation of Compiler-level SLR Crossing with Auto-pipelining
The AWS
The auto-pipelining feature provided by the Vivado tool inserts additional pipeline registers automatically during the placement phase to help achieve timing closure (see pp. 79–84 of [93]). In our hardware accelerator model, our HLS compiler uses this feature to enable the Vivado tool add pipeline registers where necessary, for example, in I/O controllers, linear array networks and pass-through units, which are used for receiving messages coming in to a loop FSM machine, in order to meet the 250MHz frequency target. The compiler-generated Verilog-level design spans multiple SLR regions. Furthermore, there is communication between components in different SLR regions. Hence, it is challenging to meet the 250MHz frequency target if SLR crossings are not optimized. Auto-pipelinable registers (i.e., with “autopipeline” attributes indicated in the RTL), are in fact inserted by our compiler into our hardware accelerator designs, for example, in the I/O controller unit, linear array network stage units, and pass-through units.
Figure 8 shows how we enable auto-pipelined registers with a simple credit-based flow control between our input and output streaming interfaces of I/O controller unit, linear array networks, and pass-through units. The Vivado tool will decide on the number of pipeline register stages that are needed, during the placement of our design, to achieve timing closure. The compiler-generated Verilog code has the necessary auto-pipeline attribute information.
Fig. 8. Auto-pipelining register insertion (via Vivado design tool) with our credit-based flow control.
Our contribution in inserting auto-pipelinable registers includes a credit-based flow controller design implemented via a small counter (
The auto-pipelined unit will be ready to receive data from the input side (will assert input \( \texttt {m_ready}==1 \)) if and only if:
The logic responsible to change the
– | The counter is initialized to the number of FIFO elements at power-up time. | ||||
– | The counter is decremented when: | ||||
– | The counter is incremented when: | ||||
– | Otherwise, the counter value is unchanged. | ||||
For a slightly better timing margin,
Using the circuit described above and also in Figure 8, a flow control is enabled by simply decrementing or incrementing the
When we compare the simple credit-based flow control that we developed (from a source unit to a sink unit in a single chip with a fixed delay, determined after the Vivado placement phase) with a well-known credit-based flow control method used in networks (where the sender and receiver are often far away from each other) such as the N23 method summarized in Kung et al. [63], in our Register Transfer Level solution the sender (input) side does not need any handshaking and/or packet from the receiver (output) side that indicates how many empty buffer elements are left in the receiver buffer. Since in our proposal, the flow control is done in the same circuitry with the same clock, even if the sender and receiver sides are in different SLRs, it is accomplished by simple time-delayed synchronous and continuous control signals arriving from the receiver side (
Figure 9 illustrates the FPGA layout of the placed and routed first partition (see block diagram of the first partition in Figure 2). First partition is special compared to other FPGA chip partitions since it has top task adaptor and outer loop finite state machine. Inner-loop FSMs have a rectangular shape in the layout and labeled with different colors. AWS shell logic is colored with orange in the layout and spans SLR0 and SLR1. The rest of SLR0, SLR1 resources and complete SLR2 resources can be utilized for user custom logic. As it is shown, the DES key search hardware accelerator spans into three SLRs, and there is no single inner-loop FSM, which spans into multiple SLRs. For communication between inner-loop designs, auto-pipelining is enabled for achieving a high-frequency target.
Fig. 9. Screenshot from Vivado design tool of placed and routed design (8 FPGAs, 16 inner loops per FPGA) for the first partition (depicted in Figure 2(a)) with host unit, top task adapter, and task network (colored with red and it has logic in all the three SLRs). I/O controller is colored with yellow. The inner loops (rectangular blocks) are colored with different colors for better visualization. AWS Shell logic (orange) spans SLR0 and SLR1.
The number of register stages that are inserted by Vivado for each auto-pipelining module becomes known only after the floorplanning step during the placement phase, Hence, the number of registers remains unknown before this step. We looked at each placed and routed partition (chip) to see the number of registers that were added into the potentially long paths where our compiler instantiated auto-pipelining modules in the RTL (a potentially long path is one that may traverse a longer distance or may cross multiple SLRs). Since we have set the auto-pipeline limit to 12 in this case, for each path where we use an auto-pipelining module, a varying number of register stages from 1 to 6 are inserted by Vivado. Figure 10 shows two different paths where auto-pipelining is applied: one is for a longer path (six registers were added by the Vivado tool) and the other one is for a shorter path (only one register was added by the Vivado tool). We infer that Vivado adds more register stages when the path is long (see Figure 10(b)) and fewer register stages (possibly one) when the path is short (see Figure 10(a)). Having more register stages in the longer paths is totally fine with regard to our design approach, since our components communicate not with ordinary wires, but with messages going through light-weight packet switching networks, and since the function of our compiler-generated design is unaffected by message latencies. The added latencies also do not hurt the achievable theoretical performance. Therefore, the auto-pipelining technique in conjunction with our credit-based flow control enhancement solved timing issues, enabling our compiler-generated design to reach 250MHz for high-performance, despite the inevitable SLR crossings.
Fig. 10. FPGA layout of two different auto-pipelined paths. Yellow-colored paths illustrate two different auto-pipelined paths with different amounts of register stages.
4.8 Maintaining gcc Compatibility in the Presence of Extremely Long FPGA Image Generation Times
Generating AWS-FPGA images from the Verilog files using Vivado can take many hours, and can, therefore, not become part of a normal gcc-like compiler flow. Since, in our case, the hardware accelerator (executing on a cluster of FPGAs and communicating with the un-accelerated part of the software) is functionally equivalent to the original sequential code fragment it was compiled from; the hardware accelerator is analogous to the next tier of compiler optimization after the highest compiler optimization tier in a just-in-time compilation system [16]. While we are not currently determining hot application executables and hot functions within these executables dynamically by profiling (a user must currently supply this information), executables created by our gcc-compatible HLS compiler have the property that they can be initially run in software, while Vivado processing continues in the background. The software application is then seamlessly replaced by its hardware-accelerated version, as soon as its required FPGA images all become ready, asynchronously, after many hours of Vivado processing. Below we explain how we have prototyped this seamless replacement. Our vision is a future operating system where large parts of frequently executed user applications, and even large parts of the OS kernel can be compiled into multi-chip hardware accelerators over time, without disruptions to the users.
For this reason, when starting a job to create an FPGA image from a set of Verilog files:
The net result is that we have created a gcc-compatible compiler, which keeps working seamlessly even when the FPGA images are not ready, because of the extremely long Vivado processing time. When the FPGA images are finally ready, the executable is elevated to the next compiler optimization tier, but its function stays the same, just as in a Hot Spot Just in Time compiler.
5 EXPERIMENTAL ANALYSIS
Our HLS compiler is able to take a sequential DES key search code as input and automatically generate a DES key search hardware accelerator with different design parameters, e.g., number of FPGAs and number of inner loops per FPGA. Three design configurations are compared here, i.e., (1) with eight FPGAs, eight inner loops per FPGA, (2) with two FPGAs, 16 inner loops per FPGA, and (3) with eight FPGAs, 16 inner loops per FPGA, all running at 250 MHz frequency.
5.1 Results and Evaluation
Preliminary experimental results for DES key search hardware accelerators in the cloud automatically compiled for multiple FPGAs are presented in Table 2.
†Execution time on x86 CPU becomes very long after some point since the key space is very long for a sequential execution to complete searching in a reasonable time.
It takes hours/days to complete the experiment in CPU after some point.
Table 2. Performance Results of DES Key Search on x86 CPU and our Multi-FPGA System Hardware Accelerator Automatically Created within AWS Cloud Instances in a Push-button Way
†Execution time on x86 CPU becomes very long after some point since the key space is very long for a sequential execution to complete searching in a reasonable time.
It takes hours/days to complete the experiment in CPU after some point.
In particular, referring to line 4 of this table, an application-specific hardware accelerator design consisting of eight FPGAs, each containing 16 inner loops (128 total inner loops), was compiled from the sequential single-threaded DES key search C program of Algorithm 1, and achieved a frequency of 250MHz, and a performance that is \( (8796 \text{ sec})/(0.197 \text{ sec}) \approx \mathbf {44,\!600} \) times faster than an AWS EC2 m5.8xlarge Xeon x86 machine running the original sequential single-threaded C program the hardware accelerator was compiled from.
The algorithm, Algorithm 1, is in fact a brute-force exhaustive search of the DES key space, where the search stops immediately after a correct key is found, as we have summarized previously:
Given a plaintext and ciphertext pair and an initial 56-bit key to start the search from, try all candidate keys starting from the initial key in strict sequential order and stop immediately when a key that decrypts the ciphertext correctly obtaining the plaintext is found.
The sequential-natured “find the first correct key” feature makes the algorithm more difficult to parallelize with traditional means as previously discussed. If the first key so found was not correct (was a false positive), the accelerated procedure can be called again to continue where it left off. On average, the correct key will be found after searching half the key space, where the whole key space consists of \( 2^{56} \) or about \( 72 \times 10^{15} \) candidate keys.
Algorithm 1 has two-level nested loop where there is an outer loop beginning on line 15 and an inner loop beginning on line 17. As also mentioned in earlier sections of this article, our compiler generates FSM units for the loops and duplicates the inner-loop FSMs depending on the duplication factor, which is defined by the user or estimated by the compiler. In our case, an outer loop dispatches blocks of keys to a multitude of inner loop iterations, and reorders the results from the inner loops (each returning either “key not found in block of potential keys” or the first working key in the block of potential keys) with extra reordering hardware. Before knowing if the correct key is within block n, the outer loop speculatively dispatches blocks \( (n+1) \), \( (n+2), \ldots, (n \) + \( \text{number} \)
\( \text{of inner loop FSMs}+\delta) \) before checking the result of block n, so that all inner loop FSMs are likely kept busy, where \( \delta \) is an extra speculation amount to increase the chances that all inner loops will be kept busy.
Each inner loop FSM takes a block of potential keys as input and processes the corresponding block by delivering a decryption every cycle once its pipeline is filled. It takes 38 cycles to fill its pipeline. Once the pipelines of the inner-loop FSMs are full, all the operations are executed concurrently, and one new decryption result is delivered in every cycle.
In the experiments below, time spent in function
In our experiments, we call the DES cracking application from the following terminal command running on the Host machine in the cloud:

The x86 processor being compared against the FPGA hardware accelerator is an m5.8xlarge instance on AWS, “Intel Xeon Platinum \( 8,\!000 \) series processor (Skylake-SP or Cascade Lake) with a sustained all core Turbo CPU clock speed of up to 3.1 GHz”. The original C code is compiled with the following:

for experiments using the x86 CPU alone. Note that because none of the loops of the sequential DES key search algorithm are DOALL loops, due to the data-dependent, conditional exit from the loops on line 19 of Algorithm 1, the x86 CPU baseline implementation is not multi-threaded. Our hardware accelerator was compiled from Algorithm 1 as written, including the data-dependent, conditional exit from its loops. If multi-threading is desired in the x86 CPU, the code must be rewritten so that the data-dependent, conditional exit is removed, and an embarrassingly parallel algorithm specification for DES key search must be used (“try all candidate keys and save the correct ones”). Thus, the performance comparison of our hardware accelerator to the necessarily sequential, single-threaded Algorithm 1, from which the hardware was compiled, is a logical comparison.
It is also important to compare our hardware accelerator compiled from Algorithm 1 as written, to an embarassingly parallel OpenMP implementation on the same 32 thread m5.8xlarge x86 machine, in terms of candidate keys tried per second. We will provide this comparison in Table 4. In short, as compared to the sequential single-threaded Algorithm 1 as written, the embarassingly parallel OpenMP implentation examines 15.8x more candidate keys per second, but performs more work on average.
To be able to compare the FPGA-based hardware accelerator’s execution time to that of an x86 CPU implementation of the same sequential code, the run time needs to be shortened. For the present experiment, a random 64 bit plaintext
The Register Transfer Level accelerator designs were also tested for correctness with random inputs.
Table 2 contains eight lines, and each line corresponds to a specific set of key space. After fourth line, the key space is getting large, and it takes very long time executing in x86 CPU (finding the key in a reasonable time is not possible). In this table, the first line, the starting first candidate key is
In the experiment of the fourth line of Table 2 related to the eight FPGAs and 16 inner loops per chip design, about 5.016 billion keys (see column 3) are searched, in 0.197 seconds, which means 25.5 billion keys per second performance is achieved with eight FPGAs and 16 inner loops per chip. The performance improvement vs. the x86 CPU running the sequential single-threaded C program the accelerator was compiled from, is about \( 44,\!600 \)x.
In the experiment of the fourth line of Table 2 related to the eight FPGAs and eight inner loops per chip design, about 5.016 billion keys are searched, in 0.360 seconds, which means 13.9 billion keys per second performance is achieved with eight FPGAs and eight inner loops per chip. The performance improvement vs. the x86 CPU is about \( 24,\!400 \)x.
In the experiment of the fourth line of Table 2 related to the two FPGAs and 16 inner loops per chip design, about 5.016 billion keys are searched, in 0.629 seconds, which means 8.0 billion keys per second performance is achieved with two FPGAs and 16 inner loops per chip. The performance improvement vs. the x86 CPU is about \( 13,\!900 \)x.
Note that when dependences permit, there is no limit in our HLS compiler for the number of FPGAs when creating an FPGA-based hardware accelerator from the sequential high-level description. The number of inner-loop copies per chip is constrained by the amount of logic resources available in the device target FPGA/ASIC. But if there are enough hardware logic resources to enable instantiating many inner-loop copies and dependences permit, our compiler is able to efficiently utilize these resources even if they are in different SLR regions or in different chips or in different instances.
Our HLS compiler automatically synthesizes, places, and routes the design for the AWS EC2 F1 instances FPGA targets, using the AWS infrastructure in the cloud, and, therefore, can also collect area and timing information for each FPGA chip. Table 3 summarizes our place-and route results measured with Vivado 2020.2. In our ongoing work, we are optimizing component placements in the FPGA devices to achieve higher-frequency targets, as we will mention in the following section.
Table 3. Place-and-Route Results for the First Partition (FPGA 1) of DES Key Search (with 8 FPGAs, 16 Inner Loops Per FPGA Experiment) Running at 250MHz on AWS f1.2xlarge Instance (xcvu9p-flgb2104-2-i)
5.2 Discussion and Ongoing Work
By building upon our partitioning techniques for multiple chips, we are considering compiler directed placement of hardware components into SLRs and regions smaller than SLRs as ongoing work, which can potentially improve the frequency of the hardware accelerator design.
The eight FPGA design with 16 inner loops per FPGA delivers only 25.5 billion keys per second, but its peak performance at 250 MHz should be \( \begin{equation*} (250 \cdot 10^{6} \cdot 8 \cdot 16) \approx 32 \text{ billion keys per second}, \end{equation*} \) even with the wasted work due to speculation. Similarly the eight FPGA design with eight inner loops per chip delivers only 13.9 billion keys per second, while its peak performance should be 16 billion keys per second. Inner loops are possibly not getting enough work. We are investigating the reasons. On the other hand, the two FPGA, 16 inner loops per chip design does achieve its peak performance of about 8 billion keys per second.
The main goal of the present work is to showcase our HLS compilation technology on a computation intensive problem expressed using a sequential abstraction. However, actual DES key search can also be done with increased speed using our technology. The average running time in seconds for a DES cracker that stops as soon as a correct key is found can be simply calculated as \( \begin{equation*} \frac{(2^{56})/2}{(\texttt {number of chips}) \cdot (\texttt {keys per second per chip})}. \end{equation*} \) As only a guess, if 32 inner loops can be placed in each FPGA chip by various improvements, each FPGA will search very nearly 8 billion keys per second. Then, with \( 1,\!024 \) FPGAs on AWS, the full 56 bit DES cracking problem can be solved in: \( \begin{equation*} \frac{(2^{56})/2}{(1024)\cdot (8 \cdot 10^{9})} = 4,\!398 \text{ seconds,} \end{equation*} \) or 1 hour and 13 minutes on average, which will beat the performance of all existing DES cracking implementations, at a cost of about $\( 2,\!061 \), assuming on-demand pricing on AWS at $1.65 per FPGA per hour (billed by the second). When the user is billed by the second as in the AWS platform, and when performance increases linearly with the number of FPGAs, increasing the number of FPGAs reduces the solution time for a problem but does not significantly change the cost. Increasing the number of FPGAs to \( 2,\!048 \), \( 4,\!096, \) and \( 8,\!192 \) FPGAs, will reduce hardware accelerator completion time by \( 1/2 \) (36.6 minutes), by \( 1/4 \) (18.3 minutes) and by \( 1/8 \) (9.2 minutes), respectively, while the cost stays about the same.
In the present article, we showed that a straightforward sequential C implementation of the DES key search algorithm is one and the same as a highly parallel application-specific hardware accelerator for performing the DES key search, exactly as defined in the sequential C implementation. Also note that the on-demand availability of multiple high performance FPGA chips in the cloud that are billed by the second makes nearly unbounded resources available to the hardware designer on a reasonable budget (in fact, the flat, non-partitioned hardware accelerator shown in Figure 1 in this article is similar to a chip of unbounded size, which is then automatically partitioned so that the each partition fits in a single chip). Such large resources were not easily available before for hardware designers. Also note that while a straightforward sequential code may correspond to one kind of application-specific hardware accelerator, a hardware designer with a clear understanding of how sequential code is mapped to highly parallel hardware can find ways to recode the sequential algorithm in an alternative way, e.g., to reduce the critical path, or to improve hardware utilization. The resulting hardware can be a different and better application-specific hardware accelerator. The fact that a highly parallel application specific hardware accelerator design has been shown to be one and the same as a sequential C/C++ function is, we believe, a harbinger of exciting future possibilities toward more productive hardware design.
6 RELATED WORK
In this section, we will summarize the previous work related to our research.
Modern High-Level Synthesis Tools: Even for a single application, FPGA programming for achieving high performance can be a very challenging and time-consuming process with Register Transfer Level design. For example, designing manually-tuned hardware architectures for cryptographic algorithms that mainly contain several bit manipulations similar to the DES algorithm requires extensive Register Transfer Level design efforts (finding the best scheduling, creating the optimum datapath architecture for each algorithm, and designing the controller circuit etc.) [13, 14, 15, 76, 77]. To overcome this barrier in design productivity, HLS tools have been developed [21, 28, 59, 60, 72, 92]. These tools enable hardware and software designers to generate RTL hardware architectures from a function written in the C/C++ high-level language; however, they require many manual steps in order to create a full system together with I/O controllers for cross-chip communication over a scalable network, specialized memory hierarchies, and specialized networks between the hardware units of the accelerator for achieving high performance.
In our approach [42], our compiler generates all the necessary hardware components at Verilog-level for a full system including I/O controllers for cross-chip communication, task networks for efficient dispatch of hardware tasks, accelerated execution units, specialized memory hierarchies, and so on. The hardware accelerator generated by our compiler contains many pipelined finite-state machines that synchronize with each other using specialized synchronization circuits, and jointly perform the function described by the sequential C/C++ code in parallel. These compute engines are interconnected via on-chip and off-chip networks with compiler-level customization for each specific application that is accelerated. As another new feature, our HLS compiler is able to pipeline outer loops with loop-carried control dependences, which has been further elaborated in the “Contributions” section earlier in this article.
Liu et al. in [67] propose ElasticFlow, a method for software pipelining of an outer loop without fully unrolling its inner loops, but were unaware of our earlier work in [42] on the same topic. To pipeline an outer loop containing an inner loop, [67] proposes, for example, in Figure 3 of [67], replacing the inner loop by:
– | a distributor network for distributing inner loop tasks (this was disclosed at least as a “task network” and an “incomplete butterfly network” in column 17, lines 36–67, and Figures 7–9 of [42]), | ||||
– | multiple copies of the inner loop where the copies operate in parallel (this was disclosed at least as “hierarchical software pipelining” in Figure 15 and column 25, line 45 –column 26, line 13 of [42]), and | ||||
– | a reordering collector network for collecting results of the inner loop copies and putting them back in order (this was disclosed at least in section “How to receive responses out of order” starting at column 60, line 28 of [42]). | ||||
The ElasticFlow article further suggests improving hardware utilization by having a hardware functional unit implement more than one function, e.g., in Figure 5(d) and (e) of the ElasticFlow article —named the mLPA Architecture (this was disclosed at least in Figures 58 and 59 and the section “Primitive structural transformations for sharing resources among thread units” starting on column 90, line 55 of [42]).
Furthermore, the ElasticFlow article does not propose a method for handling loop-carried dependences in an outer loop, or for scalable partitioning of a resulting large design into multiple chips.
Dai et al. [33] allows MAYBE dependences between memory instructions to be optimistically ignored (achieved by carefully coding each memory operation as an access to a different array, as appropriate, in the input C/C++ code fed into a commercial HLS compiler), and suggests constructing a customized, application-specific memory with a high number of virtual ports, that takes on the responsibility of detecting data speculation errors among memory accesses at run time. This article also implements a data speculation error recovery mechanism in cooperation with the pipelined FSM, by sending a replay signal to the pipelined FSM in the midst of pipeline execution when a data speculation error is detected, e.g., when a logically earlier store in iteration n is determined to store into the same location as a logically later load in iteration n + k, k \( \gt \) 0, that was already executed with data speculation, and has already loaded the wrong, stale value of memory location. An implication of the recovery mechanism is an instant reverse execution of iterations n + k, n + k + 1\(, \ldots \) in the FSM to return to exactly the cycle where the incorrect data-speculative load of iteration n + k is executed, with minimal disruption to a standard software pipelined schedule. Iterations n, n + 1\(, \ldots, \) n + k \( - \) 1 can continue unimpeded. This article’s approach is more resource efficient than the earlier, general purpose Multiscalar Architecture “Address Resolution Buffer” work [51] (for avoiding an associative search among many addresses, also see [61]), because of the application-specific customization during HLS. The article’s approach is also potentially less complex than software pipelining of a loop with conditional branches (implied by the load/store address comparisons in this problem), which can lead to a Minimum Initiation Interval that dynamically varies at run-time [45]. However, a further implication of the method proposed in the article is that not only the incorrect data speculative load of iteration n + k must be re-executed, but also all the operations that depended on this load that were already executed, must be re-executed with their original operands (some of these operands may be corrected after re-executing the incorrect load). This further implies undoing changes to memories and registers to return to the exact cycle of the incorrect data speculative load of iteration n + k. Also, any harmful side effects that depend on an incorrect data-speculative load (e.g., sending a network packet that has a side effect like printing a check) must be prevented. Also, a second store occurring later in iteration n can be later found to overlap with a different incorrect data-speculative load that occurred even earlier in iteration n + k, leading to a second replay/reverse execution of iterations n + k, n + k + 1\(, \ldots \) emanating from the data speculation error detected by the second store in iteration n. But this interesting article feels incomplete to the reader, in the sense that these implications and their solutions are not discussed at all, and no HLS compiler algorithm is given. Only hardware solutions for particular examples are given. The squash and recovery (e.g., reverse execution) issues for handling data speculation in outer loops have not yet been addressed in Dai et al., since this article relies on a standard software-pipelined schedule produced by a commercial HLS tool, and commercial HLS tools do not currently implement outer loop pipelining without full unrolling of inner loops.
Push-button High-Level Hardware System Compilation: There are a small number of existing push-button compilation systems from a sequential C program to create an application-specific hardware in the open literature. Zhang et al. [94] present a push-button compilation from a C program into a full system hardware design on FPGA, with the help of pragma directives. Compared to the study by Zhang et al., our method (1) does not utilize any parallelization directives and relies solely on the inherent parallelism within the input sequential single-threaded code, (2) aims at creating a massively parallel multi-FPGA system with a minimum latency when sufficient resources are available, and (3) also achieves high-frequency thanks to our proposed RTL-level inter-SLR pipelined communication mechanism. Zhang et al. mainly utilize the polyhedral model to optimize the loops and propose a task-level polyhedral model. According to their example model [94, Figure 5], using pragmas, an innermost loop may be removed from the polyhedral model in order to reduce complexity and also gain freedom from the burden of meeting the requirements of using the polyhedral model. But unlike the polyhedral model, our compiler does not impose any requirement at all on the input program. In particular, using affine array subscript expressions or affine loop bounds are not required. For example, code involving linked list traversals (e.g., searching a key in a linked list of linked lists of key-value pairs as in the
But polyhedral loop transformations can potentially make radical changes in a sequential program without altering the program’s function, in the case where the user does not care about precise exceptions. Polyhedral loop transformations may in principle be performed on the input sequential code of our compiler, before parallelization with our hierarchical software pipelining begins, to provide an additional benefit. One such transformation could be the fusion of nested loops [95], which can reduce the total latency of a hierarchically software-pipelined program.
Cong et al. [26, 27] propose an automated FPGA compilation solution for an application-specific hardware design with the help of pragma directives, depending on the program to be accelerated. Cong et al. have used the Merlin compiler and have mapped the accelerators to a rack-level multi-chip system. Their Merlin compiler requires a user/digital designer to specify how the user’s C/C++ program is to be parallelized, using OpenMP-like pragma directives such as parallel and pipeline. In Cong et al. the baseline overall framework is an existing parallel software framework, namely, Apache Spark (see Figure 3 in Cong et al.). The Merlin compiler takes a user’s C/C++ code as input and produces optimized OpenCL code, which is deployed on an FPGA connected to each node within the parallel software framework, using the OpenCL implementation available on the target platform. There is no communication among the FPGAs in different nodes. Our compiler approach differs from Cong et al., since it relies not on OpenCL and not on any parallel software platform such as Spark, but on the inherent maximum parallelism in the sequential single-threaded input code, and does not require parallelization directives on the part of the user. Our compiler creates a multi-chip FPGA hardware accelerator directly from sequential code, where the FPGAs communicate and synchronize among themselves to jointly perform the acceleration.
SLR Crossing in an FPGA chip: Another problem is that today’s HLS tools suffer from timing issues due to generated long circuit paths, especially across multiple SLRs that prevent reaching high frequency designs. A very recent study to address such a problem is the AutoBridge system proposed by Guo et al. [53]. They propose a methodology to enhance the design by coupling coarse grained floorplanning with pipelining in order to meet the timing closure, especially when there are long wires to cross the SLRs in the latest modern FPGAs, e.g., FPGA devices in the AWS cloud. Their method assume the HLS functions are written in dataflow programming model and they divide the FPGA device into a set of regions and assign each HLS function to one region. When inter-region communication is needed, Guo et al. pipeline this communication during HLS compilation. Extension of their method to non-dataflow programs is given as a discussion, although the designs appear to be dependent on the data flow programming model, whose insensitivity to message latencies, typically small sized filter functions and point-to-point fifo communication channels seem essential to the success of the Autobridge floorplanning methodology. Unlike our work, AutoBridge does not optimize its FIFOs with credit-based flow control. Our approach of placing a small auto-pipelined FIFO with credit-based flow control in each on-chip network communication channel where an SLR crossing is possible, has the advantage that it is a simple approach and works with Vivado’s existing automatic placement phase. Note that our credit-based auto-pipelined long distance FIFO units, can be used to reduce resources in any design that pervasively uses FIFO communication channels between components. Our proposed method is furthermore not restricted to any programming model, such as dataflow.
Existing Multi-FPGA Architectures: Caulfield et al. [22] present a reconfigurable cloud architecture for datacenter applications. Their proposed architecture has FPGA devices between network switches and the servers to tailor the hardware architecture to the selected workload. Their architecture has direct FPGA-to-FPGA communication for better latency. It provides (1) local compute acceleration, (2) network acceleration, and (3) global application acceleration. Caulfield et al. [22] demonstrated the performance of their Configurable Cloud architecture for Web search ranking and high-speed (network line rate) encryption workloads. According to their results, they offload encryption/decryption processes into FPGA devices to reduce the burden on the CPU cores. For example, without FPGAs, 5 or 15 cores may be required just for cryptography, depending on the encryption procedure. But when using FPGAs to accelerate the cryptography tasks, the CPU cores thus unburdened can now become free to do other work and can generate revenue.
There is a recent study [82] that accelerates database management systems (DBMSs) in AWS F1 cloud for a single FPGA card using the Vivado HLS tool. Sun et al. also [82] discuss potential directions using multi-FPGAs in the cloud for DBMSs acceleration. Jianga et al. [62] propose a method for a design of real-time AI applications on a platform with CPU and one FPGA. They also discuss multi-FPGA design for their method as a future study.
However, previous work has not displayed a full system design perspective including higher design productivity and multi-FPGA system compilation. Our proposed high-level synthesis compiler addresses this full-system design perspective problem efficiently, by automatically compiling sequential code into an application-specific hardware accelerator system targeting a multi-FPGA cloud.
There are also earlier studies about multi-FPGA design, but they are mainly implemented on local FPGAs (see for instance [87]).
Previous Fastest DES Cracker Machines: In 1998, the Electronic Frontier Foundation (EFF) DES Cracker machine [50] that contains custom ASIC chips found the 56-bit key in 56 hours. The machine contains 29 boards, where each board has 64-chips. Each chip has 24 non-pipelined search units, and each search unit completes one decryption in 16 cycles running at 40 MHz. Hence, each search unit could examine 2.5 million keys per second. The cost of the project was around $\( 250,\!000 \).
As emphasized by EFF team, “The right way to crack DES is with special-purpose hardware. A custom-designed chip, even with a slow clock, can easily outperform even the fastest general-purpose computer” [50]. However, to get the highest performance with a specialized hardware architecture for each application, Register Transfer Level hardware design complexity is a barrier. In our study, we address this problem and demonstrate that high performance is achieved using a high-level compiler that generates a full system architecture from merely sequential code. Our present proposal to create, deploy, and terminate a virtual FPGA-based hardware accelerator in the cloud on-demand, paying only during the actual use of the hardware accelerator, is also very cost effective for short-running applications as compared to an ASIC-based hardware accelerator, given the high cost of ASIC chip design and manufacturing.
But the fact that FPGA-based hardware accelerators rented by the second in the cloud are cost-effective for short-running applications, does not diminish the total cost of ownership (TCO) advantages of ASIC-based hardware accelerators in the cloud, for long-running applications [69]. Because of ASIC chip foundry features for optimizing total ASIC design cost at low and medium volume, such as multi-project wafer and multi-layer mask, and because the software investment for deploying a multi-chip hardware accelerator created by our HLS compiler is lower than normal, an ASIC-based hardware accelerator can indeed become cost-effective for important applications that must be repeatedly executed [47, 48, 90] and will achieve better performance, better power efficiency and lower total cost of ownership. ASIC-based hardware accelerators can also realize a higher performance scalable chip-to-chip network as compared to, e.g., an AWS virtual private cloud.
In 2019, Sugier [81] showed that 40 different keys per cycle can be searched with 40 pipelined parallel decoding modules (P16 version) running at 186 MHz on a single Xilinx 7S100 low-cost FPGA device. The proposed P16 architecture is approximately 150 times faster than one custom ASIC chip proposed in [50]. However, the technologies used in these two chips are different.
One of the modern DES cracking services is crack.sh [29], which is an online service for commercial DES cracking. This is a manually tuned hardware implementation of DES key search. crack.sh promises that the searched key will be found within at most 26 hours using 48 Virtex-6 FPGA devices. Each FPGA contains fully pipelined 40 DES cores that run at 400 MHz, meaning that they search \( 1,\!920 \) different keys per clock cycle.
It is interesting to note that in the crack.sh website, the designers also estimate that \( 1,\!800 \) Graphics processing unit (GPU) devices would be needed to perform the same DES key search within 26 hours with GPU computing. Note that since FPGAs are power-efficient [89], an FPGA-based design is also energy efficient as compared to a GPU-based solution running the same algorithm.
For reference, we are including the performance of the sequential single-threaded Algorithm 1 on a m5.8xlarge AWS x86 CPU in the line marked “m5.8xlarge Xeon, single-threaded”. Note that current automatic parallelizers are not able to automatically convert the sequential single-threaded Algorithm 1 into an embarrassingly parallel algorithm because Algorithm 1 has a data-dependent, conditional exit on line 19 from both of its loops. But a programmer can rewrite Algorithm 1 in an embarrassingly parallel way without the data-dependent conditional exit, although the embarrassingly parallel version is not equivalent to the single threaded version: Assuming the correct solution key is random, Algorithm 1 tries half of the candidate keys on average (performs 2X less work as compared to the embarassingly parallel version) and examines the candidate keys following a strict sequential order, whereas the embarrassingly parallel version will try all candidate keys and will return all potentially correct keys. We have indeed rewritten Algorithm 1 in this embarrassingly parallel way, with nested DOALL loops, and have used OpenMP to parallelize it, and have measured its performance on the same m5.8xlarge AWS machine instance (which has 32 threads or vCPUs), which appears as the line “m5.8xlarge Xeon, embarrassingly parallel”. The recoded OpenMP implementation achieved 15.8x the keys/second performance of the sequential single-threaded algorithm on the m5.8xlarge x86 machine. Note that the results in this table do not consider the “less work” advantage of our parallelized single-threaded Algorithm 1. But it can be seen that an FPGA or GPU have a performance advantage over a CPU for the DES key search application.
GPU-based commodity computing machines are powerful for single-instruction multiple data (SIMD) type of applications since they process a high number of threads concurrently within their large number of hardware execution units. There are some studies that use GPUs for a DES cracking application. Ahmadzadeh et al. [1] propose a single instruction multiple thread architecture for DES cracking application on GPUs. Instead of conventional bit swapping during a bit permutation, entire registers are swapped in their method. By enabling register swap and shared memory implementation of the DES algorithm on GPUs, in each iteration, each thread examines a set of 32 keys. They achieved approximately \( 6.5\times 10^9 \) keys per second with only one GPU device. They also showed linear speed-up when two GPUs are used and accomplished \( 13\times 10^9 \) keys per second performance since in their model, there is no dependency between keys.
However, an individual encryption or decryption algorithm of [1] is no longer the original DES algorithm; it is a different algorithm that takes a (plaintext, ciphertext) pair and a vector of 32 56 bit candidate keys (stored in transposed bit matrix form inside 56 32 bit registers) and returns a vector of 32 results, indicating which, if any, of the 32 candidate keys were correct. The algorithm also relies on the key evaluations being independent: It does not comply with any requirement to stop and return the correct key as soon as a correct key is discovered, trying all candidate keys in strict sequential order. Thus, the starting point of [1] is a different algorithm as compared to other DES key search studies discussed here.
Table 4 shows the keys per second performance of different DES cracker machines in the open literature, in our x86 CPU experiments and also our main result of multi-FPGA experiment automatically compiled from high-level. As one can see from Table 4, crack.sh [29] performs the best when keys per second is considered. However, they are using 48 chips, which cooperate to solve the problem, which is higher than our chip utilization in our present study. All of the open literature implementations are manually designed or tuned for the targeted platforms or architectures. By contrast, our method presents an automatic translation from a simple high-level C/C++ programming language description to a cloud-level multi-instance, multi-FPGA implementation of DES.
| Year | Target platform | Number of chips | Number of search units per chip | Keys per second in total | |
|---|---|---|---|---|---|
| EFF DES Cracker machine [50] | 1998 | ASIC | 64 | 24 search units | 3.8 billion |
| Sugier [81], P16 | 2019 | FPGA | 1 | 40 pipelined decrypters | 7.45 billion |
| crack.sh [29] | 2021 | FPGA | 48 | 40 fully pipelined DES cores | 768 billion |
| Ahmadzadeh et al. [1] | 2018 | GPU | 4 | 128 threads | 26 billion |
| Our CPU experiments | |||||
| m5.8xlarge Xeon, single-threaded | 2021 | CPU | 1 | 32 threads (1 used) | 0.0005706 billion |
| (Sequential software experiment) | |||||
| m5.8xlarge Xeon, embarrassingly parallel | 2021 | CPU | 1 | 32 threads (all used) | 0.009015 billion |
| (OpenMP Parallel software experiment) | |||||
| This work | 2021 | FPGA | 8 | 16 inner loop search engines | 25.6 billion |
| (Automatic multi FPGA compilation | |||||
| from sequential description) |
Table 4. Candidate Keys Per Second Performance of DES Cracker Machines
In Table 4, we have two CPU-based experimental results for the DES key search application. One of them is single-threaded implementation and our purpose to give such a result is to highlight the performance gain since we accept the same sequential C description in our compiler. We have also provided a multi-core implementation of the DES targeting to m5.8xlarge AWS machine instance with 32-cores in order to utilize all the available processor cores. Our automatically generated hardware system for AWS-FPGA instances performs much better when it is compared to these single core and multi-core CPU implementations.
The current literature toward DES cracking is mostly manually optimizing the workload for the targeted platform. Our method solves most of the low-level implementation issues automatically. DES algorithm is just an example for our proposed multi-FPGA compiler framework. Our work is targeted to an FPGA platform, but it can easily be configured for any other hardware platform. The main difference of our work from the literature is that our target hardware accelerator is automatically compiled from a sequential, non-optimized C/C++ program. A second difference is that our implementation which is a parallelization of the single-threaded Algorithm 1, will do less work on average, as compared to an embarrassingly parallel implementation.
7 CONCLUSION
We presented an application-specific, high-performance approach for multi-FPGA accelerator system design starting from sequential code. We implemented, tested, and verified our push-button system design model on FPGA-based AWS EC2 F1 instances and demonstrated its viability.
The authors believe that at least one surprising and non-obvious feature of the authors’ work is the fact that an entire highly parallel FPGA hardware accelerator system has been fully described by ordinary sequential code, without any parallelization directives. The authors believe that sequential code can offer a more productive way to design future multi-chip FPGA-based or ASIC-based application-specific hardware accelerator systems.
APPENDIX
A AUTOMATED DEPLOYMENT AND TERMINATION OF FPGA-BASED HARDWARE ACCELERATORS IN THE CLOUD
Developing and deploying even a single FPGA accelerator in the AWS cloud currently takes a long time and requires many manual steps, that can each go wrong during development or deployment. Thus, it is essential to reliably automate the deployment and termination of a multi-chip FPGA accelerator in the cloud. In addition to our HLS compiler, we have created the following commands using the AWS-CLI primitives under the covers, which can significantly increase the productivity of a user in deploying multi-chip FPGA accelerators:
– |
This command creates an AWS virtual private cloud complete with two subnets (the first subnet intended for normal access including | ||||
– |
This command destroys a previously created virtual private cloud for running FPGA-based hardware accelerators. No instances should be running on the virtual private cloud. | ||||
– |
This command allocates second network interfaces for | ||||
– |
All running machines of this FPGA cluster are terminated. | ||||
It would have been better to deploy an FPGA image on an FPGA hardware resource instantly on demand, at the point where the application actually starts using the FPGA image, and to terminate the deployment of the FPGA image as soon as the FPGA resource currently running the image is perceived to be idle and/or must be preempted by a higher priority hardware task (e.g., see the work on an all-hardware parallel hypervisor for efficient on-demand deployment of multi-chip accelerators within future FPGA and ASIC clouds in [41]). But as of today, initialization of the AWS EC2 F1 instances takes minutes and is currently not quick enough for an on-demand launch. This is why we settled on the above
- [1] . 2018. A high-performance and energy-efficient exhaustive key search approach via GPU on DES-like cryptosystems. The Journal of Supercomputing 74, 1 (2018), 160–182.Google Scholar
Digital Library
- [2] . 2021. Deep Dive into Alibaba Cloud F3 FPGA as a Service Instances. Retrieved from https://www.alibabacloud.com/blog/deep-dive-into-alibaba-cloud-f3-fpga-as-a-service-instances_594057.
Accessed: 2021-04-19. Google Scholar - [3] . 2021. FPGA-accelerated compute optimized instance family. Retrieved from https://www.alibabacloud.com/help/doc-detail/108504.htm.
Accessed: 2021-04-19. Google Scholar - [4] . 1995. Software pipelining. ACM Computing Surveys 27, 3 (1995), 367–432.Google Scholar
Digital Library
- [5] . 1983. Dependence Analysis for Subscripted Variables and Its Application to Program Transformations. Ph.D. Dissertation. Rice University.Google Scholar
Digital Library
- [6] . 1987. Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages 9, 4 (
oct 1987), 491–542.Google ScholarDigital Library
- [7] . 1987. Automatic translation of fortran programs to vector form. ACM Transactions on Programming Languages and Systems 9, 4 (1987), 491–542.Google Scholar
Digital Library
- [8] . 2021. Amazon EC2 Placement Groups. Retrieved from https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html.
[Online; accessed 11-April-2021]. Google Scholar - [9] . 2021. AWS EC2 FPGA Development Kit. Retrieved 28-June-2022 from https://github.com/aws/aws-fpga.Google Scholar
- [10] . 2021. Amazon EC2 F1 Instances. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.
Accessed: 2021-04-19. Google Scholar - [11] . 2021. FPGA Development - AWS Developer Forums. Retrieved from https://forums.aws.amazon.com/forum.jspa?forumID=243&start=0.
Accessed: 2021-04-19. Google Scholar - [12] . 1978. ID Report : An Asynchronous Programming Language and Computing Machine. Technical Report 114, University of California at Irvine, Computer Science Department, May 1978.Google Scholar
- [13] . 2013. Compact hardware implementations of chacha, BLAKE, threefish, and skein on FPGA. IEEE Transactions on Circuits and Systems I: Regular Papers 61, 2 (2013), 485–498.Google Scholar
Cross Ref
- [14] . 2017. A low-area unified hardware architecture for the AES and the cryptographic hash function Grøstl. Journal of Parallel and Distributed Computing 106 (2017), 106–120.Google Scholar
Digital Library
- [15] . 2012. Compact implementation of threefish and skein on FPGA. In Proceedings of the 2012 5th International Conference on New Technologies, Mobility and Security. IEEE, Istanbul, Turkey, 1–5.Google Scholar
Cross Ref
- [16] . 2003. A brief history of just-in-time. ACM Computing Surveys 35, 2 (
June 2003), 97–113.DOI: Google ScholarDigital Library
- [17] . 1978. Can programming be liberated from the von neumann style? A functional style and its algebra of programs. Communications of the ACM 21, 8 (1978), 613–641.Google Scholar
Digital Library
- [18] . 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Transactions on Reconfigurable Technology and Systems 11, 3 (2018), 1–23.Google Scholar
Digital Library
- [19] . 2007. A 30 year retrospective on dennard’s MOSFET scaling paper. IEEE Solid-State Circuits Society Newsletter 12, 1 (
Winter 2007), 11–13.Google ScholarCross Ref
- [20] . 1998. Loop parallelization algorithms: From parallelism extraction to code generation. Parallel computing 24, 3–4 (1998), 421–444.Google Scholar
Digital Library
- [21] . 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. Association for Computing Machinery, New York, NY, 33–36.Google Scholar
Digital Library
- [22] . 2016. A cloud-scale acceleration architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, Taipei, Taiwan, 1–13.Google Scholar
Digital Library
- [23] . 2018. Accelerating memcached on AWS cloud FPGAs. In Proceedings of the 9th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies. Association for Computing Machinery, New York, NY, 1–8.Google Scholar
Digital Library
- [24] . 2019. In-depth analysis on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM Transactions on Reconfigurable Technology and Systems 12, 1, Article
4 (feb 2019), 20 pages.Google ScholarDigital Library
- [25] . 1994. The MPI message passing interface standard. In Programming Environments for Massively Parallel Distributed Systems, and (Eds.), Birkhäuser Basel, Basel, 213–218.Google Scholar
Cross Ref
- [26] . 2016. Software infrastructure for enabling FPGA-based accelerations in data centers: Invited paper. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design. Association for Computing Machinery, New York, NY, 154–155.Google Scholar
Digital Library
- [27] . 2016. Invited - heterogeneous datacenters: Options and opportunities. In Proceedings of the 53rd Annual Design Automation Conference. Association for Computing Machinery, New York, NY, Article
16 , 6 pages.Google ScholarDigital Library
- [28] . 2011. High-level synthesis for FPGAs: From prototyping to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 4 (2011), 473–491.Google Scholar
Digital Library
- [29] . 2021. The World’s Fastest DES Cracker. Retrieved from https://crack.sh/.
Accessed: 2021-04-20. Google Scholar - [30] . 1985. Useful parallelism in a multiprocessing environment. IBM Thomas J. Watson Research Division.Google Scholar
- [31] . 1986. Doacross: Beyond vectorization for multiprocessors. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. IEEE, 836–844.Google Scholar
- [32] . 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (1998), 46–55.Google Scholar
Digital Library
- [33] . 2017. Dynamic hazard resolution for pipelining irregular loops in high-level synthesis. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, New York, NY, 189–194.
DOI: Google ScholarDigital Library
- [34] . 2018. A new golden age in computer architecture: Empowering the machine-learning revolution. IEEE Micro 38, 2 (
Mar 2018), 21–29.Google ScholarCross Ref
- [35] . 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5 (1974), 256–268.Google Scholar
Cross Ref
- [36] . 1968. Programming generality, parallelism and computer architecture, In Proceedings of the Information Processing, IFIP Congress 1968.Google Scholar
- [37] . 1972. On the Design and Specification of a Common Base Language.
Technical Report . MASSACHUSETTS INST OF TECH CAMBRIDGE PROJECT MAC. 47–74 pages.Google Scholar - [38] . 1974. First version of a data flow procedure language. In Proceedings of the Programming Symposium, (Ed.), Springer Berlin Heidelberg, Berlin, 362–376.Google Scholar
Cross Ref
- [39] . 1980. Data flow supercomputers.IEEE Computer 13, 11 (1980), 48–56.Google Scholar
Digital Library
- [40] . 2017. The role of cad frameworks in heterogeneous FPGA-based cloud systems. In Proceedings of the 2017 IEEE International Conference on Computer Design. IEEE, Boston, MA, 423–426.Google Scholar
Cross Ref
- [41] . 2021. Cloud building block chip for creating FPGA and ASIC clouds. ACM Transactions on Reconfigurable Technology and Systems 15, 2, Article
14 (dec 2021), 35 pages.DOI: Google ScholarDigital Library
- [42] . 2015. Method and system for converting a single-threaded software program into an application-specific supercomputer. Retrieved from https://patents.google.com/patent/US8966457B2.
US Patent 8,966,457, filed on 15 November 2011. Google Scholar - [43] . 1987. A compilation technique for software pipelining of loops with conditional jumps. In Proceedings of the 20th Annual Workshop on Microprogramming. Association for Computing Machinery, New York, NY, 69–79.Google Scholar
Digital Library
- [44] . 1999. Optimizations and oracle parallelism with dynamic translation. In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 284–295.Google Scholar
Digital Library
- [45] . 1990. A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture. In Proceedings of the Selected Papers of the Second Workshop on Languages and Compilers for Parallel Computing. Pitman Publishing, Inc., 213–229.Google Scholar
Digital Library
- [46] . 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture. Association for Computing Machinery, New York, NY, 365–376.Google Scholar
Digital Library
- [47] . 2021. Multi Layer Mask. Retrieved from https://europractice-ic.com/mpw-prototyping/general/mlm/.
Accessed: 2021-04-28. Google Scholar - [48] . 2021. Multi Project Wafer (MPW). Retrieved from https://europractice-ic.com/mpw-prototyping/general/mpw-minisic/.
Accessed: 2021-04-28. Google Scholar - [49] . 2011. Toward dark silicon in servers. IEEE Micro 31, 04 (
jul 2011), 6–15.Google ScholarDigital Library
- [50] . 1998. Cracking DES: Secrets of Encryption Research, Wiretap Politics and Chip Design. O’Reilly & Associates, Inc.Google Scholar
- [51] . 1996. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE Transactions on Computers 45, 5 (1996), 552–571.
DOI: Google ScholarDigital Library
- [52] . 2007. UDT: UDP-based data transfer for high-speed wide area networks. Computer Networks 51, 7 (2007), 1777–1799.Google Scholar
Digital Library
- [53] . 2021. AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs. Association for Computing Machinery, New York, NY, 81–92.Google Scholar
- [54] . 2002. Reliable blast UDP: Predictable high performance bulk data transfer. In Proceedings. IEEE International Conference on Cluster Computing. IEEE, Chicago, Illinois, 317–324.Google Scholar
Cross Ref
- [55] . 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing. Association for Computing Machinery, New York, NY, 311–320.Google Scholar
Digital Library
- [56] . 2019. Garbled circuits in the cloud using FPGA enabled nodes. In Proceedings of the 2019 IEEE High Performance Extreme Computing Conference. IEEE, Waltham, MA, 1–6.Google Scholar
Cross Ref
- [57] . 2021. cloudFPGA: Field programmable gate arrays for the cloud. Retrieved from https://www.zurich.ibm.com/cci/cloudFPGA/.
Accessed: 2021-04-19. Google Scholar - [58] . 2021. Data Plane Development Kit. Retrieved from http://dpdk.org.
Accessed 04.04.2021. Google Scholar - [59] . 2021. Intel FPGA SDK for OpenCL. Retrieved from https://www.intel.com/content/www/us/en/software/programmable/sdk-for-opencl/overview.html.
Accessed: 2021-04-25. Google Scholar - [60] . 2021. Intel High Level Synthesis Compiler. Retrieved from https://www.intel.com.tr/content/www/tr/tr/software/programmable/quartus-prime/hls-compiler.html.
Accessed: 2021-04-25. Google Scholar - [61] . 1998. Method and apparatus for reordering memory operations in a processor. Retrieved from https://patents.google.com/patent/US5758051.
US Patent 5,758,051, filed on 6 November 1996. Google Scholar - [62] . 2020. Optimized co-scheduling of mixed-precision neural network accelerator for real-time multitasking applications. Journal of Systems Architecture 110 (2020), 101775.Google Scholar
Cross Ref
- [63] . 1994. Credit-based flow control for ATM networks: Credit update protocol, adaptive credit allocation and statistical multiplexing. In Proceedings of the Conference on Communications Architectures, Protocols and Applications. Association for Computing Machinery, New York, NY, 101–114.Google Scholar
Digital Library
- [64] . 1989. Software pipelining. In Proceedings of the Systolic Array Optimizing Compiler. Springer, 83–124.Google Scholar
Cross Ref
- [65] . 2017. The QUIC transport protocol: Design and internet-scale deployment. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. Association for Computing Machinery, New York, NY, 183–196.Google Scholar
Digital Library
- [66] . 2015. Whole Program Paths–slides). Retrieved from https://pdfs.semanticscholar.org/6328/1ffa177c6d88841ddc6e01d1b0a74ea853e0.pdf.
Accessed 11.08.2021. Google Scholar - [67] . 2017. Architecture and synthesis for area-efficient pipelining of irregular loop nests. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 11 (2017), 1817–1830.
DOI: Google ScholarCross Ref
- [68] . 2017. Polyhedral-based dynamic loop pipelining for high-level synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 9 (2017), 1802–1815.Google Scholar
Digital Library
- [69] . 2016. ASIC clouds: Specializing the datacenter. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture. ACM/IEEE, Seoul, Korea (South), 178–190.Google Scholar
Digital Library
- [70] . 2016. Evaluation of reliable UDP-based transport protocols for internet of things (IoT). In Proceedings of the 2016 IEEE Symposium on Computer Applications & Industrial Electronics. IEEE, Penang, Malaysia, 200–205.Google Scholar
Cross Ref
- [71] . 1997. Parallelizing nonnumerical code with selective scheduling and software pipelining. ACM Transactions on Programming Languages and Systems 19, 6 (1997), 853–898.Google Scholar
Digital Library
- [72] . 2015. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10 (2015), 1591–1604.Google Scholar
Digital Library
- [73] . 2021. NVIDIA CUDA platform. Retrieved from https://developer.nvidia.com/cuda-zone.
Accessed 29.04.2021. Google Scholar - [74] . 1999. FIPS PUB 46-3, Data Encryption Standard (DES). Retrieved from https://csrc.nist.gov/csrc/media/publications/fips/46/3/archive/1999-10-25/documents/fips46-3.pdf.
Accessed 30.04.2021. Google Scholar - [75] . 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture. Association for Computing Machinery, New York, NY, 63–74.Google Scholar
Digital Library
- [76] . 2012. Compact keccak hardware architecture for data integrity and authentication on FPGAs. Information Security Journal: A Global Perspective 21, 5 (2012), 231–242.Google Scholar
Digital Library
- [77] . 2014. Improving the computational efficiency of modular operations for embedded systems. Journal of Systems Architecture 60, 5 (2014), 440–451.Google Scholar
Cross Ref
- [78] . 1997. Moore’s law: Past, present and future. IEEE Spectrum 34, 6 (1997), 52–59.Google Scholar
Digital Library
- [79] . 1991. On optimal parallelization of arbitrary loops. Journal of Parallel and Distributed Computing 11, 2 (1991), 130–134.Google Scholar
Digital Library
- [80] . 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models. In Proceedings of the 48th International Symposium on Microarchitecture. Association for Computing Machinery, New York, NY, 647–659.Google Scholar
Digital Library
- [81] . 2019. Cracking the DES cipher with cost-optimized FPGA devices. In Proceedings of the International Conference on Dependability and Complex Systems. Springer, Cham, 478–487.Google Scholar
- [82] . 2021. Accelerating data filtering for database using FPGA. Journal of Systems Architecture 114 (2021), 101908.
DOI: Google ScholarDigital Library
- [83] . 1992. On the limits of program parallelism and its smoothability. In Proceedings of the 25th Annual International Symposium on Microarchitecture. IEEE Computer Society Press, Washington, DC, 10–19.Google Scholar
Digital Library
- [84] . 2019. Unrolling ternary neural networks. ACM Transactions on Reconfigurable Technology and Systems 12, 4, Article
22 (oct 2019), 23 pages.Google ScholarDigital Library
- [85] . 2020. HEAWS: An accelerator for homomorphic encryption on the amazon AWS FPGA. IEEE Transactions on Computers 69, 8 (2020), 1185–1196.Google Scholar
- [86] . 1984. Reliable Data Protocol.
Technical Report . RFC-908, BBN Communications Corporation.Google ScholarDigital Library
- [87] . 2010. Dynamically reconfigurable dataflow architecture for high-performance digital signal processing. Journal of Systems Architecture 56, 11 (2010), 561–576.Google Scholar
Digital Library
- [88] . 1993. Decomposed software pipelining: A new approach to exploit instruction level parallelism for loop programs. In Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism. North-Holland Publishing Co., NLD, 3–14.Google Scholar
- [89] . 2017. Energy efficient scientific computing on FPGAs using OpenCL. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, New York, NY, 247–256.Google Scholar
Digital Library
- [90] . 2005. Multiple project wafers for medium-volume IC production. In Proceedings of the 2005 IEEE International Symposium on Circuits and Systems. IEEE, Kobe, Japan, 4725–4728.Google Scholar
- [91] 2020. Vivado Design Suite User Guide: High-Level Synthesis (UG902) (v2020.1 ed.). Xilinx. Retrieved from https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug902-vivado-high-level-synthesis.pdf.Google Scholar
- [92] . 2021. Vitis High-Level Synthesis. Retrieved from http://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.
Accessed: 2021-04-25. Google Scholar - [93] 2021. Vivado Design Suite User Guide: Implementation (UG904) (v2020.2 ed.). Xilinx. Retrieved from https://www.xilinx.com/content/dam/xilinx/support/documentation/sw_manuals/xilinx2020_2/ug904-vivado-implem-entation.pdf.Google Scholar
- [94] . 2015. CMOST: A system-level FPGA compilation framework. In Proceedings of the 2015 52nd ACM/EDAC/IEEE Design Automation Conference. IEEE, San Francisco, CA, 1–6.Google Scholar
Digital Library
- [95] . 2020. Optimizing the memory hierarchy by compositing automatic transformations on computations and data. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Athens, Greece, 427–441.Google Scholar
Cross Ref
- [96] . 2007. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. SIGARCH Computer Architecture News 35, 2 (
jun 2007), 35–45.Google ScholarDigital Library
Index Terms
Highly Parallel Multi-FPGA System Compilation from Sequential C/C++ Code in the AWS Cloud
Recommendations
Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and CompilersThis article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
Value State Flow Graph: A Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware
Special Section on RAW2014Although custom (and reconfigurable) computing can provide orders-of-magnitude improvements in energy efficiency and performance for many numeric, data-parallel applications, performance on nonnumeric, sequential code is often worse than conventional ...
The COBRA-ABS high-level synthesis system for multi-FPGA custom computing machines
Special issue on low power electronics and designThis paper describes the column oriented butted regular architecture-algorithmic behavioral synthesis (COBRA-ABS) high-level synthesis tool which has been designed to synthesize DSP algorithms, specified in C, onto multi-field programmable gate array (...

























Comments