Abstract
AIgean, pronounced like the sea, is an open framework to build and deploy machine learning (ML) algorithms on a heterogeneous cluster of devices (CPUs and FPGAs). We leverage two open source projects: Galapagos, for multi-FPGA deployment, and
1 INTRODUCTION
The interest in using FPGAs for computing at scale has become desirable because of the need for increased performance and reducing power. The flagship example of this is the Microsoft Catapult project that has led to an FPGA being deployed in every Microsoft server [1]. FPGAs at Microsoft are used for search engine acceleration, in a machine learning (ML) framework for applications within the data center as well as for many network and packet-processing tasks.
A distinguishing feature of the Catapult architecture is that the FPGAs can directly communicate with other FPGAs and CPUs as peers on the network versus the more common accelerator model for FPGAs where the FPGAs are attached to a CPU and only accessible through the CPU. The peer model is more efficient for applications that are large enough to span multiple FPGAs requiring low-latency communication between the FPGAs. Although Microsoft has shown significant success in scaling up and using multiple FPGAs in a single application, such as Project Brainwave [2, 3] used for real-time AI, there is no public description of how the applications are built and deployed to the FPGAs, and the platform and tools are not available for others to build their own applications. There is also no known equivalent open source platform available where someone can build their own version of Brainwave.
Brainwave has shown how useful multi-FPGA implementations can be as they leverage having all their weights in on-chip memory as opposed to accessing memory in off-chip DRAM. Having a framework to be able to build custom circuits, like the one in Brainwave, will allow users to create their own networks, which at the moment is quite difficult due to the lack of abstractions within FPGA systems. On top of being able to access on-chip memory, within an infinitely large fabric available through a multi-FPGA framework we could unroll all our computations completely, or to any desired level of unrolling. This will enable the construction of very low latency, high throughput networks that can run at batch 1. In this work, it is our hope to provide the abstraction of an infinitely large FPGA fabric by abstracting the difficulties of network-connected FPGAs. This article leads to a broad range of possible applications where low-latency, large AI inference is needed to process information in real time. Examples include systems controls, web search, real-time physics applications, and medical image processing.
For this work, we define a cluster of network-connected FPGAs, (i.e., all FPGAs have direct connections to the network) as a multi-FPGA cluster. By this definition, Brainwave is a multi-FPGA application.
The focus of this article is to describe how we created AIgean, which is an open source platform that can be used to build multi-FPGA ML applications on multi-FPGA clusters. AIgean provides the user with multiple layers of abstraction. The user can use AIgean as a black box that takes neural net descriptions as inputs and get an output of programmed FPGAs. Our black box is responsible for the creation of IP cores, communication protocols, partitioning the neural net across multiple devices, and finally generating the final bitstreams of all FPGAs. Our focus with AIgean is ease of use as well as modularity. However, the parts of the black box are implemented as a stack of abstraction layers and can be further customized by users who are experts in the various layers. This stack is built in modular pieces, which also allows for alternative implementations at each layer. In particular, a user can modify the architecture of a particular convolution layer, and implement the layer in hardware on an FPGA or in software on a processor. The communications layer can be modified to use different protocols, such as UDP, TCP/IP, layer 2 Ethernet, PCIe, parallel buses between devices, or any custom protocol. A change at any of the layers of AIgean does not affect any of the other layers, especially the application layer at the top of the stack. This provides portability between platforms, particularly across different types of FPGAs.
We started with two open source projects:
Although conceptually AIgean is a combination of two existing platforms, a significant effort was required to integrate the two platforms. Initially, the idea seemed straightforward, but when considering the details, much more is required.
An important contribution of this work is to describe that effort and more generally show the challenges of building multi-FPGA application frameworks that can be customizable and portable across multiple kinds of FPGA hardware. The main outcome is an ML platform that enables ML practitioners to use familiar tools and map them to a multi-FPGA cluster without needing to do any hardware design. We contrast AIgean with the current vendor ML flows [7, 8] that only target a few FPGAs hosted in a single server and lack the ability to scale easily.
Our contributions in this work are as follows:
(1) | A fully push-button flow to take an ML network input from popular ML tools and deploy the network to a multi-FPGA/CPU back-end. Abstracted away from the user is the creation of the hardware IP cores for the given ML network, the partitioning of these IP cores, and the connecting and routing between them. Some of the core functionality was already handled by | ||||
(2) | Modifications to | ||||
(3) | Galapagos modifications to add the partitioning layer of the stack. This layer decides how many FPGAs are required and where to place the IP cores generated by our modified | ||||
(4) | A framework that allows for incremental development and deployment of an ML application because we can seamlessly integrate hardware and software IP cores. For example, the first step to deploying an ML network is to do it entirely in software targeting a multi-CPU back-end. By simply changing a configuration file, layers of the network can be incrementally switched from running in software to running on FPGAs. Eventually, all layers can be targeted for FPGAs, or the user may choose to run with a heterogeneous implementation where some layers are in software and some are in hardware. | ||||
A large use case of ResNet-50 deployed with two configurations, one with 10 FPGAs and the other with 12 FPGAs. Changing between these implementations is done by changing only a few lines of | |||||
A fully integrated hardware and application layer stack that starts with FPGA shells, the layer in the FPGA that abstracts the application logic from the specifics of each FPGA board, the hardware middleware layer that deals with the connectivity between IP cores, a communications layer that implements the desired networking protocol between IP cores instantiated on different CPUs or FPGAs, and an application layer that takes ML networks as input and generates the required IP cores. These carefully defined abstraction layers provide an excellent research platform for experts at each layer to tinker and make each layer better. AIgean is available as open source to enable further research at all the layers of its stack and can be downloaded at https://github.com/UofT-HPRC/AIgean. | |||||
In Section 2, we describe related work, followed by Section 3, where we provide an overview of
2 RELATED WORK
We describe AIgean as a platform that can be used to build heterogeneous ML implementations with a particular focus on using FPGAs and CPUs. As a platform, AIgean spans the full computing stack from the hardware to the tools used to create the inputs to AIgean. We have built AIgean with the goal of making it flexible and modifiable at all levels of the stack to enable research and continued improvement. With this view, we present the related work according to our model of the full ML computing stack. We first describe the model and then present the related work as it fits within our model.
2.1 The ML Computing Stack
The ML computing stack is shown in Figure 1. At the top of the stack, we have a wide range of Applications & Algorithms, many of these applications having strict performance constraints. At the Cluster Deployment & Communication layer of the stack, we have petabytes of data being transferred, and at the Hardware layer, we have many mathematical operations (typically matrix/vector multiplications) implemented on a computing substrate ranging from programmable processors to custom hardware. Each layer of this stack provides its challenges. For example, at the Applications layer, the user has to decide which error rates are acceptable for their given application. At the Communication layer, the user has to decide how they will connect their devices (consisting of computing devices as well as sensors gathering data). Finally, at the low-level Hardware layer, the user may want to make optimizations on bit-level operations for their given application or define different levels of parallelism. There are opportunities for research at all levels of this stack.
Fig. 1. Abstraction stack for common ML frameworks.
2.2 Software ML Frameworks
We define software frameworks as those that mainly target CPUs and GPUs that are programmed via software. Leading software ML frameworks include TensorFlow [9], Torch [10], and Caffe [11]. They provide the users with libraries in various programming languages (e.g., Python, C++) to describe their ML applications. These frameworks then compile the applications into a series of instructions to be executed. Furthermore, they offer an interface to create custom layers that can be compiled into instructions to run on different back-end devices. Finally, they also support connectivity across multiple devices. For example, TensorFlow provides an API [12] to run on distributed clusters, where the communication between different devices (CPUs, GPUs, and TPUs [13]) is either through the CPU network link, through NVLink [14] (i.e., a proprietary link between certain NVIDIA GPUs), or via a direct network link. NVIDIA also provides the NVIDIA Collective Communication Library [15], which implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and networking. This enables scaling GPU computations across large numbers of GPUs available on a network and is supported by several popular deep learning frameworks. Note that in the current implementation of the NVIDIA Collective Communication Library, the GPUs do not have direct connections to the network, unlike what we are able to do with FPGAs. The GPUs connect to the server’s network interface through PCIe. We expect that with the acquisition of Mellanox by NVIDIA [16], GPUs will soon also be able to access the network directly and bypass the need for a PCIe transfer.
These frameworks have a high level of customization at the application level. They also allow the users to input custom instructions, but the underlying hardware circuitry is limited to CPU, GPU, and TPU computation and cannot be modified.
2.3 FPGA Overlay Frameworks
Frameworks such as the Xilinx ML Suite and the Intel Deep Learning Accelerator (DLA) provide overlays implemented on FPGAs [17, 18]. An overlay is essentially a programmable engine implemented on the FPGA, which has a limited level of customization. These suites are integrated in existing OpenCL development environments (IDEs), Xilinx SDAccel [19], and Intel OpenCL [20].
The OpenCL IDEs use HLS to improve the accessibility of the design of accelerators for FPGAs over traditional approaches based on VHDL or Verilog, which are time consuming and unfamiliar to most ML experts. In addition, these suites both provide libraries for Direct Memory Access (DMA), buffers, and communication channels, and abstract the underlying hardware, such as device drivers, PCIe link, interconnect, and accelerator placement [21]. These frameworks are similar to the Software ML Frameworks except they add the capability to customize the processor by tuning the overlay architecture on the FPGA.
Nurvitadhi et al. [22] describe a platform that supports multiple PCIe-connected FPGAs in a single server. They build a software stack on top of the Intel OPAE [23] and tightly couple operations on the CPU with operations on the FPGA. Their goal is to implement low-latency neural machine translation and do this by keeping the model in on-chip memories to avoid slower off-chip memory accesses. The ability to leverage multiple FPGAs makes this feasible. This work shows how to leverage the communication layer implemented with PCIe to target multiple FPGA overlays, but their scalability is limited by the number of boards available in one node.
These frameworks allow application developers to seamlessly deploy ML applications on FPGAs thanks to mature software and hardware development environments. However, on the one hand, the high level of abstraction through overlays minimizes the FPGA design time, and on the other hand, it reduces the user control of the generated hardware. With respect to the MS stack we define in Figure 1, these overlay frameworks allow some flexibility in the algorithms and limited flexibility in the hardware. Depending on the hooks available, a user can implement different supported layers to make their own customizations. The hardware flexibility is quite limited as an IP core is already generated. Some frameworks allow the user to modify the IP core through parameters, but this is generally limited in flexibility.
2.4 FPGA ML Core Generators
In this category of work, the focus is at the hardware level of our ML stack where the goal is to make it easier to generate cores for ML computations. These cores must then be integrated into a system that provides the full ML computing stack. Here, we present open source tools1 that generate ML accelerators as third-party IPs to be integrated into FPGA projects.
CHaiDNN [24] is an ML library for the acceleration of deep neural networks on Xilinx UltraScale MPSoCs. The library provides a subset of ML operators to be synthesized with Vivado HLS and uses 6/8-bit integer arithmetic. Pynq DL [25] provides only a configurable IP for the convolution on Xilinx Zynq SoCs. FINN [26] is a framework for the implementation of binary neural networks that use a dataflow architecture. PipeCNN [27] is an OpenCL-based FPGA accelerator for large-scale CNNs and uses pipelined functional kernels to achieve improved throughput in inference computation. The design is scalable both in performance and hardware resources, and thus can be deployed on a variety of FPGA platforms. HLSLibs [28] is a set of libraries implemented in standard C++ for bit-accurate HLS design. Many of the library operators (e.g., MatMult, SoftMax, Sigmoid) can be easily integrated into the design of ML accelerators. Recently, CNN implementations similar to the design in this article have been produced for low bit precision CNNs [29] and for sparse CNNs [30]. These solutions provide more flexibility as the developer, in some cases, can modify the generated cores, as well as integrate additional circuitry around the provided IP cores. However, this design flow is only accessible to those with hardware design knowledge.
With respect to the ML stack, ML core generators provide full flexibility of the hardware and supported algorithms. However, they provide very little when it comes to support for integrating systems at a much larger scale, as it is the user’s responsibility to integrate the generated cores into their larger design.
2.5 ML Computing on Multi-FPGA Clusters
In a multi-FPGA cluster, all FPGAs are network connected, and Brainwave [2, 3] built on top of Microsoft’s Catapult network-connected FPGA framework [1] is the most successful and well known. Each FPGA contains a customizable overlay. The focus of Brainwave is to minimize latency. Thus, the entire processing only uses on-chip memory and resources, and the neural network is partitioned across multiple FPGAs accordingly. The links between the network-connected FPGAs use Catapult’s 40-Gb/s custom Lightweight-Transport-Layer, a lightweight reliability layer on top of a communication protocol similar to that of UDP. When characterizing Brainwave using the stack defined in Figure 1, it can be observed that Brainwave also provides a flexible Application layer as multiple types of neural networks are supported. Brainwave is limited to the Lightweight-Transport-Layer for cluster communication between FPGAs, but this is still an improvement over frameworks that force all accelerator communication through a CPU. Finally, Brainwave provides some flexibility at synthesis time to customize precision, vector size, number of data lanes, and the size of the matrix-vector tile engine. These works allow users to scale a large ML framework across multiple nodes, providing the cluster deployment layer in the ML stack. These works also support a number of layers allowing for the user to customize their algorithm. However due to the scale, there is little hardware flexibility, as the parameterization happens at the node level as opposed to the level of the IP core.
2.6 Where AIgean Fits
Although AIgean can be used with a single FPGA, it best fits the category of Section 2.5, or ML Computing on Multi-FPGA Clusters, and has a similar goal as Brainwave. Both platforms use network-connected FPGAs in a peer-to-peer configuration. Brainwave uses a programmable overlay that has some parameterization that can be invoked at the time the overlay is synthesized and can implement many different ML networks depending on the program that is loaded. AIgean synthesizes custom hardware cores and implements each ML network directly in hardware, so changing an ML network will take much longer than recompiling the program for an overlay. With AIgean, there is the ability for researchers to experiment at the hardware implementation layer with
3 BACKGROUND
The goal of AIgean is to provide a scalable platform for implementing ML applications using multiple FPGAs. We use
3.1 Hls4ml
We need a way to implement hardware ML-inference cores that can take specifications from common ML frameworks. We choose
At the start of the AIgean development,
An ML designer prepares a neural network for a specific task, such as image classification, in Keras or PyTorch. After an iterative training phase that ends when the target accuracy/error goals are met, the ML designer releases a final model to be deployed for inference. The model is usually described as two files in standard formats: a JSON file for the model architecture, and an HDF5 file for the model weights and biases. These are the inputs for
The hardware designer faces the challenge of creating an optimal FPGA implementation from the given ML model. The
The conversion from deep learning model to HLS-based software is done by constructing a custom intermediate network representation that is amenable to low-latency design. From this intermediate representation, HLS code is generated with design guidelines specified in a configuration file. Optimized HLS implementations of neural network layers are generated, with the optimization dependent on specified configuration parameters. The code is thoroughly modular, and most optimizations can be tuned after the HLS code generation.
The
The trade-off among latency, initiation interval, and resource usage determines the parallelization of the accelerator logic (and vice versa). In
Within the
For the development of AIgean, a number of improvements were made within
Streaming dataflow between the layers (with Galapagos)
Optimized large layers for the Dense/Linear Layer, CNN Layer, Pooling Layer, Split Layer, and Merge Layer
Modified Reuse Factor for CNN throughput
Weight reconfiguration through the use of external block RAM ports
Finally, the generated ML accelerators have interfaces that are system agnostic. In Section 4, we illustrate our extension to the Galapagos flow that enables a designer to rapidly prototype ML accelerators and deploy them in a Galapagos system with minimal effort.
3.2 AIgean Stack
Fig. 2. The AIgean stack. It includes an Application layer on top of the previously developed Galapagos stack [6].
AIgean is a development stack for deploying ML applications across multi-FPGA and CPU clusters. This logically can be seen as a superset of the Galapagos development stack with a specific application layer. Galapagos is a hardware stack that provides customization at different levels of abstraction [6]. The main goal of Galapagos is to abstract the low-level hardware plumbing required to deploy an application across multiple FPGAs while also providing the ability to port applications across multiple FPGA platforms (i.e., platforms built using different FPGA cards with different networking infrastructures). We know of no other platform that can take as input just the computation kernels and a logical description of the connections between the kernels and then generate all of the FPGA bitstreams with all of the network connectivity included. Without Galapagos, an application developer with a multi-FPGA cluster would need to be an expert in hardware design. In addition to building the computation kernels, the developer would need to incorporate into their design the interfaces to the on-board memory, the network interfaces, the network protocol hardware (most likely hardware UDP or TCP/IP cores), and configure Ethernet MAC addresses, IP addresses, the routing information for moving data between kernels, as well as build all of the packet formatting and protocol translation between the computation kernels. The FPGA vendor platforms for OpenCL [19, 20] are usable by non-hardware application developers because they abstract away these details. Galapagos does the equivalent abstraction, but for a multi-FPGA cluster environment. By building on Galapagos for AIgean, we can leverage the multi-FPGA abstraction that is provided by Galapagos, and can focus on the integration of
The structure of Galapagos is analogous to a traditional software or networking stack, with each layer of the stack providing an API for the layer above. The lower the layer in the stack, the closer it is to the physical hardware. Figure 2 shows the AIgean stack.
Physical hardware and connectivity. This layer represents the physical hardware that runs applications, and for this work we focus on the FPGAs. Aside from implementing the computations in FPGA logic, we can also implement different forms of connectivity. In Galapagos, we can use PCIe, 10G SFP+ Ethernet, 100G QSFP28 Ethernet, and L1 circuit switching. For Ethernet, we can select TCP/IP, UDP, and raw L2 Ethernet. Once configured in this lower level, typical software and ML practitioners can work at a higher level of abstraction.
Hypervisor. The hypervisor3 abstracts away the I/O interfaces of a single FPGA so that the hardware applications only needs to connect to a standardized interface, and they can then be implemented on any FPGA that has the same hypervisor. This is the key requirement that enables applications to be portable across multiple hardware platforms enabled with Galapagos. In the same way, the hypervisor in the software world provides an abstraction of the hardware and some level of services, typically I/O and memory.
Middleware. This layer connects the different devices within the Galapagos cluster and sets the off-chip network communication protocols between them. Within Galapagos, computation kernels can address each other and are agnostic of their placement.
Communication layer. The communication layer provides the APIs with the ability to send packets using the connections laid out by the middleware. Galapagos transmits packets using the AXI-Stream protocol [43]. All of the kernels within the cluster can reach any other kernel via AXI-Stream. Since the middleware layer provides the network address translation functionalities to convert AXI-Stream into off-chip network packets using the desired network communication, the network details and locations of kernels are abstracted away from the user. In software, this is the role of network-socket libraries or other network and communication protocols such as the Message Passing Interface (MPI) [44].
Application layer. For AIgean, the application layer is the ML layer provided by
4 IMPLEMENTATION AND TOOL FLOW
The implementation of AIgean requires a significant effort to create a seamless integration of
4.1 Tool Flow
The stages of the AIgean tool flow are visually presented in Figure 3. The AIgean automated flow provides a black box that takes an ML model to a CPU/FPGA cluster. In the following sections, we highlight the inner components of the black box because they can be modified by domain experts working within each part of the tool flow to explore a large design space relevant to their interests.
Fig. 3. The AIgean flow. The components in the black box are abstracted away for the average users.
Each of these stages corresponds to layers of the abstraction stack for common ML frameworks we described in Figure 1. The stages are described as follows.
4.1.1 Implementation-Agnostic Model Tooling.
This stage of the flow corresponds to the top layer of the ML stack, Applications & Algorithms. This layer of the stack is for the data scientist and ML experts, where they can tune their network for a given accuracy independent of the implementation and performance. For a given application, the users will decide on the algorithms they wish to use for their ML implementation. Using their application-specific test data, they can determine a suitable accuracy for a given neural network, independent of the implementation being done in hardware or software. Once a model is trained with the appropriate precision and performance requirements, AIgean will take this model and perform a full conversion to a distributed deep learning inference engine.
4.1.2 HLS Layer Implementation.
At this stage, the input from the previous stage is transformed by
In Section 5, we explore two different IP core implementations of ResNet-50 as an example. This part of the flow first generates a directory structure with many sub-directories, and each sub-directory is for an individual IP core (one IP core per layer), containing the HLS source code and build files. The top-level directory also has a build file, and then the user can then do a parallel build across all sub-directories of all the IP cores. For our ResNet-50 case study in Section 5.5, the generation of the directory structure and HLS source files can take on the order of minutes, whereas the HLS can take on the order of a few hours. The HLS also does an out-of-context place and route so that we can get a more accurate resource utilization that is then used in the partitioner.
4.1.3 Layer Partitioning.
At this stage, the user begins with IP cores described in C++ that were generated from
This is our first implementation of this partitioner and leaves much more room for future work focused on the partitioning. Given that the partitioner is a separate abstraction layer, changing the partitioner can be done without requiring any changes in the other layers. The partitioner is also implemented within Galapagos as this is independent of the ML use case and can be applied to other domain spaces.
Once we get a partitioning from our Galapagos partitioner, our AIgean-specific bridging cores then provide bridging based on the kernels that occur at the edges of each FPGA. This is done separately by our ML to Galapagos layer (ML2G) as the bridges required at each edge is specific to the partitioning as we have a different bridge depending on the width of the ML kernel on the edge. The
4.1.4 Software Cluster Implementation.
This stage is optional but highly recommended for heterogeneous development. The underlying Galapagos framework can wrap HLS synthesizable C++ code with software libraries to enable network socket communication. The underlying Galapagos software library [45] translates Galapagos stream packets into network packets in a user-specified off-chip network protocol (e.g., UDP, TCP). We describe the underlying Galapagos framework in Appendix A.3. Galapagos can be seen as using the standard AXI-streaming protocol, typically used for streaming kernels within a single Xilinx FPGA. There is also basic routing with a destination field within AXI-stream. Galapagos can take AXI-stream and encapsulate packets with higher-level protocols to get the convenience of a single device AXI-stream but over multiple nodes. The user at this stage can create a homogeneous cluster partitioned across multiple software nodes (with each software node taking the place of a hardware node), recreating the network topology the user wishes to have for their heterogeneous deployment. All the network connections, binary generation, and deployment are automated with the underlying Galapagos framework.
4.1.5 Heterogeneous Cluster Implementation.
Once the neural network is partitioned across multiple software nodes and shown to be working correctly, the user can then migrate parts of their software deployment into hardware nodes. This is done by simply changing a parameter in one of the Galapagos files to indicate that an IP core should be implemented in hardware rather than run in software. Since Galapagos ensures that both software and hardware nodes use the same protocol, the migration is seamless. The migration of cores from software to hardware can be done in an iterative process as the generation of hardware bitstreams can be a time-consuming process. The outputs of this stage are the final bitstreams. For each FPGA, the IP cores are put together and synthesized to a bitstream. In our ResNet-50 case study, this took on the order of a couple of hours.
4.2 Hls4ml Modifications
The full details of the modifications implemented in
Further optimizations are applied for ResNet-50, including the fusing of batchnorm layers with the convolutional layers and the compression of 8-bit weights into single 16-bit weights so that DSP multiplier units can be used and the total number of needed multiplications is halved. Finally, an additional configuration parameter is added to the autogeneration that allows for approximate tuning of the reuse factor to obtain the desired CNN throughput. With this new option, the tuning factor for the network is defined by the desired throughput in operational clocks and the reuse factor is adjusted so that every layer achieves the desired throughput. As a result of the throughput tuning, the reuse factor will be adjusted to ensure the inter-layer latency is roughly the same. A balanced throughput avoids significant bottlenecks between the layers. The full details of the throughput tuning is described in Appendix A.1.
The Fig. 4. An hls4ml kernel can be surrounded by custom bridges to make it possible for off-chip communication.
Furthermore, with the processing of streams, it allows us to send flits of data corresponding to different images within the same packet, allowing for a more efficient use of bandwidth. These streams are also configurable by allowing the user to configure how many AXIS stream flits4 to pack within one network packet. This solution can be significant for FPGA-CPU links where it is crucial to amortize the cost of network communication on the CPU, due to the overhead added by the Linux network stack.
4.3 Galapagos Modifications
To explore the design space of large ML networks (like ResNet-50) across multiple FPGAs, we developed an automated partitioner to work with the rest of the Galapagos framework. When we turned to ResNet-50 to implement a very large network, it quickly became clear that we needed an automated means for partitioning a large application to make the best use of the resources. Our first partitioner is described in Section 4.1.3 and is not specific for just
The original Galapagos framework supported 10G TCP and L2 Ethernet for off-chip communication. We designed a bridge to provide the option for 100G UDP cores to increase the performance of our network links and to reduce the probability of the network communication being the bottleneck. This enhances the capability of any application using Galapagos, not just AIgean.
To support 100G, we also improved the portability within Galapagos to support multiple bit-widths of data. The prior bridging within Galapagos assumes all kernels communicate over AXI-stream with a destination side channel. However, due to the additional required bridge needed to allow Fig. 5. The IP cores providing the bridging in Galapagos.
A major goal of AIgean and Galapagos is the ease of development. Galapagos offers functional portability of cores between hardware and software by implementing software libraries to model the hardware routers and bridges shown in Figure 5. The software library (i.e., libGalapagos) is described in the work of Tarafdar and Chow [45]. Furthermore, Galapagos allows fast simulations of the cluster thanks to the combination of the HLS IP cores (in C++) and libGalapagos for the connections between cores. When we designed the additional cores and bridges for AIgean (e.g., 100G UDP core), we also implemented the libGalapagos equivalent of these cores to maintain functional portability. On top of prototyping the entire cluster in software, we have also added the ability to simulate the entire cluster in RTL. We designed an RTL model of a network switch that is configurable and can simulate the latency between network links. This capability has been invaluable during the development of AIgean as a platform but would not be required during the normal use of AIgean. The combined efforts of both these frameworks result in a fully configurable design space exploration tool of multi-node heterogeneous ML clusters.
5 RESULTS
This section presents the outcomes of our efforts to build AIgean. It is important to emphasize that the initial goal of this work is to build a platform to enable the development of multi-FPGA ML applications. The performance results that we report here demonstrate that AIgean is working, and even with our first example applications, the results are reasonable. For this work, we claim success if we are able to easily build ML applications and map them to multiple network-connected FPGAs. There is much room to tune for application performance given a working AIgean, and we will now be using AIgean to explore opportunities for tuning and to build other kinds of networks.
We first describe the hardware testbed used for our experiments and discuss the ease of use of AIgean in its current state with a case study we did for our own experiments. Then we present more quantitative results by addressing the physical limits of the communication links, and finally we present the current performance results of our first applications starting with a small network to illustrate the latency benefits of using network-connected FPGAs and then for ResNet-50 as a test to see whether we can implement a very large network.
5.1 Hardware Testbed and Tools
Our hardware testbed comprises Supermicro servers with Intel Xeon E5-2650V4 CPUs and 64 GB of memory. The FPGA boards we have available are Alpha Data 8K5s with a Xilinx KU115-2 FPGA, Fidus Sidewinders with Xilinx ZU19EG FPGAs, and Xilinx Alveo U200 and U250 cards with XCU200 and XCU250 FPGAs, respectively. We have 16 Sidewinders mounted in a 16-slot PCIe chassis, and the other boards are mounted in PCIe slots of our servers. For the network interconnect, we have two Dell S4048-ON 10G and two Z9100-ON 100G switches. The servers are connected to a 10G switch and the FPGA boards are mostly connected to 100G switches. For the AIgean tests reported here, we used Vivado 2019.1 and Sidewinder FPGA boards connected to 100G switches.
The SDAccel platform we used was on an Amazon f1.2xlarge instance using SDAccel v2018.2. For those tests,
5.2 Ease of Use Case Study
While working on this article, we have gone through several iterations of different layers of the stack. One iteration involved optimizing our HLS library to use DSPs more efficiently, with the same functionality. We describe the results in Section 5.5, but we would like to discuss the steps required to change our cluster implementation between the two IP core implementations. This change was done in the
5.3 Communication Protocol
In this section, we present the latency and throughput measurements for different link configurations. This is to provide some understanding of the penalties for communication over network links. For communication between nodes, we can use a 100G UDP core [46] or a 10G UDP core [47] on the FPGA. These are interchangeable within our framework by the user. The 100G core uses a 512-bit interface as compared to the 64-bit interface for the 10G core. The CPU NIC we use is a 10G SFP NIC [48], even when communicating to a 100G FPGA. The specific FPGA board we are using is the Fidus Sidewinder with an MPSoC FPGA [49]. For latency measurements, we send a single flit of data (8 bytes) using the four different types of links listed in Table 1. The results in Table 1 involving software are shown with the 100G UDP core on the FPGA, but similar results are observed when using the 10G UDP core. From hardware to software, we observe the FPGA outputting at 100 Gb/s, but we experience packet drop in the software when doing the throughput measurement. Observe that the links involving software are limited by the CPU network stack and library implementation, whereas the FPGA-to-FPGA links can transfer at the full network bandwidth. Note that the results in Table 1 are for the raw throughput, including the protocol headers, which is why it is possible to achieve the full link bandwidth when using the FPGAs.
5.4 Autoencoder
Here we describe our first small multi-FPGA network implemented with AIgean. We consider an example network with applications for high-energy physics. Specifically, our network is an autoencoder designed to detect anomalous events, potentially from new physics. An autoencoder is an unsupervised learning technique that leverages a neural network where a bottleneck in the shape of the network forces a compressed representation of the original input. Details about the model and use cases can be found in Appendix A.2.
This network is a very interesting size for our studies, as it can be implemented on a single FPGA, but this requires a high degree of resource reuse that necessarily increases the inference latency. When splitting the network across multiple FPGAs, we can adjust the throughput and latency of the network by changing the reuse factor and compiling the network across multiple FPGAs. The network split across multiple FPGAs will have a higher throughput but incurs some latency from the transfer of the intermediate results.
The resources for the autoencoder network are shown in Table 2 along with the resources available on the FPGAs we used. To test this autoencoder, we considered two separate implementations of the network: an implementation using an AWS F1-instance (VU9P FPGA) using SDAccel, and a second implementation using AIgean on three Sidewinder (ZU19EG FPGA) boards. What is notable is that the single FPGA implementation would not be able to fit on a single Sidewinder board, and it would have to be spread over multiple FPGAs for the chosen reuse factor. The single FPGA implementation also requires more than one super logic region, and as a consequence has difficulty meeting timing when compiled on the F1 instance with SDAccel.
Table 3 highlights the results from implementing the autoencoder on various devices as well as on a single FPGA using SDAccel and three FPGAs using AIgean.
Our 1-FPGA autoencoder is clocked at 125 MHz at a low reuse factor when using SDAccel. Limitations in our version of SDAccel, as well as the resources required for the FPGA, prevented us from using a higher clock speed. For the 3-FPGA version, we used AIgean and were able to achieve 200 MHz for two of the FPGAs and 190 MHz for the third one. We did not try to improve it, so we will use 190 MHz since that is the limitation. To make a fair comparison to the 1-FPGA implementation, we scale the AIgean latency by the ratio of clock speeds and get \(0.08 \times 190/125 = 0.12\)ms, which is still \(0.24/0.12 = 2\) times better than the latency using SDAccel. This shows that there is still a significant architectural advantage to using multiple FPGAs and is not unexpected because more resources are available. The performance increase with three FPGAs can be attributed to (a) the use of networking to directly communicate with the FPGA, yielding low latency, and (b) less demanding resources per FPGA since only one-third of the model is implemented on each FPGA.
The implementations of this model on both a single FPGA and the full three FPGAs have an initiation interval of 552 clocks and require roughly the same resources (the reuse factor is the same). In other words, the three FPGAs are capable of processing a new image every 2.76 \(\mu s\) (362 KHz). Such a throughput approaches the demands needed for real-time processing of anomalies at the LHC. Although the single FPGA implementation with SDAccel has a potential throughput that is half that of the 3-FPGA implementation, achieving this throughput would require efficiently buffering the inputs and outputs by sending larger batches of calls on and off the FPGA through the DDR and PCIe transfers. As a consequence, the individual (batch-1) latency would be significantly degraded for the final throughput to approach half that of the 3-FPGA implementation.
5.5 ResNet-50
To test AIgean on a much larger network, we have developed a multi-FPGA implementation of ResNet-50 [50]. The flexibility provided by AIgean allows us to target a high throughput implementation whereby we unroll the multiplication in each CNN layer at a rate corresponding to the number of pixels that are being used in each respective CNN layer. This allows for the design of ResNet-50 that can be balanced across the different CNN layers to have a uniform throughput.
Most of ResNet-50’s architecture can be broken down into many sub-blocks consisting of a Split, two to three convolutions followed by a Relu, and an addition operator as shown in Figure 6. The dashed boxes represent the IP block granularity that we have used within our implementations.
Fig. 6. Sub-blocks found throughout ResNet-50 and our IP cores.
We have two implementations of ResNet-50: the first requires 12 Sidewinder boards using int-8 precision (ranging from 80% to about 90% of the resources used on each FPGA); the second is more DSP efficient and requires 10 Sidewinder boards using int-8 precision as well. We have one FPGA available to use as a 100G data generator that can feed inputs at line rate to the FPGAs. For the 12-FPGA configuration, we tested in a piece-wise fashion.5 We have tested the traffic generator and the first 10 of 12 FPGAs followed by testing the traffic generator and the remaining 2 FPGAs. We have verified that the full 10-FPGA configuration and the piece-wise 12-FPGA configuration can run at 660 images per second.
Table 4. Performance of Different Layers and Implementations at Batch 1
Table 4 summarizes the throughput and latency results of our full 12-FPGA implementation of ResNet-50. When the source data is coming from the CPU, we observe that the maximum throughput is only 400 images per second with a latency of 7 ms due to the bandwidth limitation between the CPU and the FPGA (5-ms latency between the CPU and the FPGA). To demonstrate the full performance achievable with the FPGAs, we use the FPGA data generator and observe a throughput of 660 images per second with a latency of about 1.9 ms. The latency is determined through a simulation of the full ResNet-50 network where each layer is separately run in parallel. The network delay between each FPGA is estimated from Table 1 using the QSFP. For 10 hops, the total network delay would be 0.0017 ms, which is insignificant compared to the computation latency. The next row gives the values for Microsoft’s Brainwave [51]. For the latency of Brainwave, we quote the end-to-end latency determined from sending an image to a Brainwave server and then receiving the result for a CPU within the same computing cluster. The final row shows the performance for an Nvidia V100 GPU using the mixed precision implementation of ResNet-50 applied for batch 1. The latency and throughput quoted is obtained through the use of the Triton inference server with a client on the same machine. As a consequence, the latency numbers include the PCIe transfer time in addition to the network inference. Equivalent numbers quoted by Nvidia yield a batch-2 latency of 1 ms with a throughput of 2,000 images per second for the same model [52]; batch 1 latency is not quoted.
| FPGA | FLip-Flops | LUTs | DSPs | BRAM |
| Number | (%) | (%) | (%) | (%) |
| 0 | 31.8 | 40.1 | 71.5 | 1.73 |
| 1 | 26.7 | 35.1 | 74.8 | 11.9 |
| 2 | 11.12 | 12.06 | 74.8 | 0.68 |
| 3 | 49.3 | 66.6 | 65.0 | 6.90 |
| 4 | 38.3 | 50.9 | 71.5 | 2.00 |
| 5 | 20.5 | 23.2 | 78.5 | 14.0 |
| 6 | 54.0 | 72.6 | 65.0 | 7.08 |
| 7 | 57.3 | 75.9 | 65.0 | 10.1 |
| 8 | 60.1 | 78.2 | 68.3 | 13.8 |
| 9 | 58.9 | 76.5 | 52.0 | 7.26 |
| 10 | 44.5 | 57.4 | 58.5 | 5.12 |
| 11 | 30.9 | 39.9 | 38.7 | 8.72 |
| Total Absolute Resources Used Across All FPGAs | 5.05 M | 3.28 M | 15.4 K | 31.5 MB |
| Total Resources Available Per FPGA | 1.04 M | 522 K | 1.9 K | 34.6 MB |
Table 5. Resource Utilization Percentage of Each FPGA and Total Resources Available Per FPGA
Table 5 summarizes the resources used for our 12-FPGA implementation. Note that this was partitioned with our greedy partitioning scheme that uses a heuristic of 80% utilization before allocating the next FPGA. The 10-FPGA configuration is very similar in terms of resources but with half the DSPs. Some other noteworthy details are that a number of the layers early in the network are smaller, and we can see that the FPGAs are DSP limited as compared to the larger layers later in the network being logic limited. The highest resource utilized for each FPGA is shown in bold, representing the limiting factor of each FPGA (with exception of the last FPGA that is not fully used.). For perspective, the total resources available on an individual FPGA are shown at the bottom of Table 5. This FPGA is approximately equivalent to a single SLR of the VU9P FPGA in the Amazon F1 instance (each VU9P having three SLRs) [53]. For further perspective, we can also compare this to the Xilinx Alveo U250 [54]. Our current utilization is DSP limited, and we could fit our entire ResNet-50 implementation on two Alveo U250 boards, where the U250 board has 12.2K DSPs.
Last, we would like to contrast this implementation with previous implementations of ResNet-50. The design flow of AIgean differs from previous 8-bit implementations of ResNet-50 in that no overlay is used, and each layer is implemented separately. In this scenario, it is possible to continuously stream images through the implementation without having to wait for an image to be complete. With the overlay architecture, the images are streamed through each layer to a buffer and then subsequent layers are loaded and the next layer is streamed. As a consequence, a scheme is needed for buffering of each input. Additionally, some time is needed to switch between layers. With the AIgean design flow, the whole network exists on the FPGA fabric, and so images can be continuously pumped through. This leads to a more efficient use of multiplier resources, at a cost of additional resources to route individual layers together. Since images are continuously pumped through, we achieve batch-1 streaming. Additionally, since we are continuously pumping images through, the amount of buffering between the layers is limited to just the partial image that is needed for matrix multiplications of the CNN applied to nearby pixels.
To understand the efficient use of resources, we compute the total number of multiplication operations needed for a perfectly efficient FPGA clocked at 200 MHz. With our implementation of ResNet-50, we find a total of 4B multiplications, which if we divide by 3 \(\times 10^{5}\) clocks to achieve a 1.5-ms latency at 200 MHz yields a total of 13,500 multiplications per cycle. Our current implementation uses 15,419 DSPs, which is slightly more due to the fact that many of the individual layers are tuned to a latency that is actually below 1.5 ms. The number of DSPs can be reduced through two means: first, through the sharing of DSPs, which is only partially implemented here, and second, through the use of a faster clock frequency. The sharing of DSPs would lead to roughly a factor of 2 reduction in DSPs. A faster clock frequency would yield a lower latency for the same number of DSPs. Since each multiplier unit is mapped directly to a specific multiplication within the network, the only way to inefficiently use the DSP resources results from the case where an allowed reuse parameter for a specific latency is not near the desired throughput and, as a consequence, the individual layer has a significantly lower latency than its neighboring layers.
Adjustment of the reuse parameter effectively modifies the initiation interval of each layer. A reuse factor of 5,000, corresponds to a layer that has an initiation interval of 5,000. To efficiently adjust the reuse parameters with
When adjusting the reuse factor, we observe a direct correlation with the number of DSPs. Halving the reuse factor will halve the initiation interval of the matrix multiply within a layer, and it will also double the number of DSPs. Flip-Flops, and LUTs will not change as significantly since they largely exist to store partial images. BlockRAMs are used primarily to store weights of the neural network on the FPGA. Their second use is to act as a buffer between layers. As a consequence, the BlockRAM resources will not change significantly with reuse factor. In this current implementation, since DSP sharing of the multiplications is only partially used, the resulting resources are more consistent with a ResNet-50 implementation having a latency of roughly half the observed latency (0.75 ms).
Faster implementations of ResNet-50 are possible by adjusting the reuse factor. However, for CNNs, a lower bound is present in the current, pixel-by-pixel implementation of the algorithm. The lower bound results from the fact that for each pixel that streams through the algorithm, there is a one clock latency. Furthermore, there is an additional latency of three clocks to prepare the inputs to run the matrix multiplication. For layers within the network, where there are many pixels, such as the first layer, the ultimate latency is limited by these operations. Applying this limit to the first layer of ResNet-50, we find that the single layer throughput is bounded to be greater than roughly 0.4 ms. Lower single-inference latencies can still be achieved by splitting the image into sub-images and simultaneously streaming these sub-image streams into separate, cloned implementations of the chosen layer. Although the use of multiple streams effectively reduces the single inference throughput by the (number of streams)\(^{-1}\), it has the added cost of increasing the resources by the number of streams.
6 FUTURE WORK
The work demonstrated within this article is the first prototype of what is possible with an open multi-FPGA ML platform. This leaves much room to improve in all areas of the ML stack in Figure 1. Within the hardware section, there leaves much room for optimization of the IP cores themselves. Furthermore, the potential for splitting images into multiple streams can effectively block any throughput limitation at the cost of large resource usage, as well as larger throughput.
Once the IP cores are further optimized, it is our hope that the communication once again is the bottleneck. When this is the case, then we should explore more intelligent partitioning schemes that limit communication across the FPGA boundary. At the moment, this is a greedy solution looking solely at resource utilization without taking into consideration the communication patterns between IP cores within the cluster.
Finally,
7 CONCLUSION
AIgean is a platform for mapping ML applications onto a cluster of network-connected FPGAs. This is much more scalable and has higher performance for computing than using the FPGA vendor tools, which are principally targeted at a single server with a handful of PCIe-connected FPGAs. Results from our initial implementations of actual networks show the benefits of using FPGAs for low-latency applications. We have also built two implementations of ResNet-50 to show that AIgean can implement very large networks.
The structure of AIgean is a number of abstraction layers spanning the entire computing stack from the ML development layer at the top to the physical hardware layer at the computing and communication layer implemented in the FPGA. This gives multiple opportunities to optimize the computing stack and for research depending on the area of interest and the design expertise available—that is, from ML algorithms down to low-level hardware design.
The layered approach makes it easier to implement AIgean because it is possible to leverage the Galapagos multi-FPGA platform and only add an additional application bridge to the Galapagos library. It also makes it possible to quickly add automation in the translation of the
By leveraging Galapagos, AIgean is also portable to other FPGA platforms as long as the low-level hypervisor layer in the FPGA is created. AIgean also leverages the ability of Galapagos to deploy computing kernels to either CPUs or FPGAs such that an application can be first debugged and characterized entirely in software before committing all or parts of it to FPGA hardware.
The experience of developing AIgean has demonstrated the challenges of building a multi-FPGA application development platform that is portable across many FPGA boards, but it proves that it is feasible in a reasonable amount of time.
AIgean is available as an open source project and can be downloaded at https://github.com/UofT-HPRC/AIgean.
Acknowledgments
We sincerely thank the reviewers for their helpful comments that significantly improved the quality of this article.
A APPENDIX
This is an appendix covering details on both
A.1 HLS4ML
Extending this paradigm to larger networks, such as ResNet-50, was not considered part of the scope of
Dense/Linear Layer
CNN Layer
Pooling Layer
Split Layer
Merge Layer
These layers make up the core of most deep neural networks used. Furthermore, the large layer design flow established through the development of AIgean will enable the fast implementation of other layers following the AIgean implementations. In this appendix, we will outline the various adaptations needed in
A.1.1 Design Flow.
To enable large CNN layers within
A.1.2 Neural Network Weights.
Within
A.1.3 Dense/Linear Layer.
A single dense (tf notation) or linear (PyTorch notation) layer is the core of most NN implementations where matrix multiplication is performed. In
A.1.4 CNN Layer.
To avoid large outputs and to improve the overall throughput, a new CNN layer was developed, which operates on an image pixel by pixel. In this scenario, images are streamed through an array of streams between layers, with each depth element in the stream corresponding to a single pixel of an image. Pixels are then streamed one at a time into a layer, and the resulting output pixel is streamed to the next set of layers. The partial image is stored within a layer using a line buffer implementation. The line buffer is implemented as an array of shift registers, and we rely on the specialized HLS shift register objects to ensure the final implementation uses explicit shift register logic elements (SRLs). The line buffer is further optimized to store the minimal number of pixels required by the kernel size (Convolution kernel height \(\times\) row). An intermediate buffer is also used to store the kernel window before the matrix multiply needed for the convolution kernel. The matrix multiplication within the CNN kernel uses the default dense layer within
The reuse factor for the CNN kernel, thus, defines the reuse per output pixel (i.e., the reuse is tied to the matrix multiplication for the Kernel). For the base implementation, the overall latency of the convolution kernel is five clocks per output pixel + the additional reuse factor for the dense layer. This five-clock overhead can be reduced for small networks by utilizing a fully partitioned line buffer at the cost of more resources. Zero Padding for the individual layers is built into the layer implementations.
A.1.5 Pooling Layer.
In addition to a CNN layer implementation, a pooling layer is added following the same data flow as in the CNNs (array of streams). The pooling layer implementation is similar to the CNN layer, except that it performs pooling instead of the convolution kernel matrix multiplication present in the CNN.
A.1.6 Split/Merge Layers.
To allow for the possibility of ResNet-50, split and merge layers are added to
A.1.7 Reuse Factor.
The reuse factor for the CNN points directly to the reuse factor of the Dense layer implementation. This reuse factor of the dense layer has two internal implementations depending on the size of the reuse factor. For instances where the reuse factor is smaller than the number of input features, the systolic array is split across the input regions so that neighbouring multiplications are accumulated into the same or adjacent output feature. For instances where the reuse factor is larger than the number of inputs, the systolic array is split across the output feature. The input features are multiplexed and multiplied before being accumulated across the output features. These optimizations were chosen to ensure optimal resource usage in the matrix multiply to allow for large matrix multiplications with millions of weights. As a consequence of these choices, certain reuse factors that are multiples of the inputs and outputs are particularly resource optimized. Within
To allow for a balanced throughput between the layers, we have modified the reuse factor for AIgean kernels to account for the per layer throughput. Instead of defining the DSP reuse factor for the network, the reuse factor per layer is dynamically computed based on a desired per-layer throughput for the total network. To compute the optimized reuse factor, we rely on an analytic formula of the total layer throughput based on the reuse factor. This analytic formula yielding the total throughput per layer, \(R\), in clocks, is defined as
(1) \[\begin{equation} R = N_{\it pixel} \left(6 + \frac{N_{\it in}N_{\it out}}{r}\right) \end{equation}\] Where in this case, \(r\) is the reuse factor per layer, and \(N_{\it pixel}\) is the number of pixels in a CNN layer, \(N_{\it in}\) is the number of input features and \(N_{\it out}\) is the number of output features. The additional 6 approximates the number of clocks per pixel that a single layer requires to perform a shift and fill and reuse one matrix multiply; this presents a lower bound on the latency for a single CNN layer of \(6 N_{\it pixel}\). Ultimately, this limitation is soft since multiple CNN layers can be used on separate parts of an image. Within the
A.2 Models
In this appendix, we present a detailed description of each NN model used in this paper. The choice of these architectures is partly motivated by use in high energy physics, where low-latency deep neural network inference is an essential tool for operation. As a consequence, we comment on the model architecture and its application to problems within physics.
A.2.1 Autoencoder.
Autoencoders are unsupervised networks often used to identify anomalous features. By creating an information bottleneck within the network, autoencoders can compress and classify detector level information. The autoencoder considered in this example is capable of identifying LHC collisions that occur anomalously at the LHC. In particular, this network can identify top quark pair productions, Higgs boson pair production, and other more exotic final states.
At the Large Hadron Collider, this network has a direct application through the integration of data scouting/trigger level analyses [58]. Data scouting is a process whereby partially reconstructed collisions are read out and processed to investigate collisions that are normally thrown out in the LHC data flow chain. This technique is particularly powerful in the search for Dark Matter [59]. In this case, events can be analyzed at a rate as large as 40 MHz, and the autoencoder can be used to create an “anomaly stream” at a reduced rate. With a maximum data rate of the order of 50 Terabits per second within the LHC trigger “scouting” stream, throughput and low latency are critical, since any delay in even a single inference would require significantly larger buffers. Distribution of this system onto multiple FPGAs brings a significant advantage since it would allow for very low latency while preserving the ability to pipeline events with small initiation interval.
The autoencoder network is trained using events that involve known and understood physics processes. Thus, any event that cannot be encoded and decoded accurately is a potential candidate for new physics searches. The inputs and outputs to the network are 276 expert event features. The number of hidden features in the first layer is 276. The second layer goes down by 1/3 to 184, the third by 1/2 to 92, and remains having 92 features for the next 6 layers before the hidden features expand back symmetrically to 276 output features. The compression factor in the bottleneck is thus 3, and in total, the network consists of 12 fully connected layers and over 300,000 weights. Between the fully connected layers Relu activation is used.
A.2.2 ResNet-50.
Lastly, we consider the well-known ResNet-50 benchmark. ResNet-50 is a deep neural network used for image processing [60]. The ResNet-50 architecture has been shown to have a lot of versatility. In particular, quantized ResNet-50 has been retrained for the process of top quark identification within high energy physics, leading to results that are comparable to world leading algorithms [61]. More recently, ResNet-50 has become the standard algorithm for benchmarking algorithm performance. With the development of a quantized neural network, the 8-bit implementation of ResNet-50 has taken over the floating point implementation as the standard benchmarking algorithm for neural network inference. The 8-bit ResNet-50 has been shown to yield almost identical performance to the full ResNet-50 implementation with the added advantage of 8-bit operations.
For the implementation of ResNet-50 considered in this paper, we use the 8-bit implementation. The network consists of 50 convolutional layers, 2 pooling layers, a dense layer, 16 merge layers, 16 split layers, and 50 batch normalization layers. To minimize the total amount of computation, the batch normalization layers are fused with the convolutional layers. ResNet-50 takes an input image that is \(224 \times 224\) pixels and will iteratively reduce the size of the image from \(224 \times 224\) to a final image that is \(7 \times 7\) pixels. The total number of multiplications present in ResNet-50 is 4.8 Billion. For a target throughput of 650 Hz, this translates to 3.1 trillion multiplications per second or 15000 continuous multiplications at 200 MHz. In the AIgean implementation, we perform two 8-bit multiplications per DSP. As a consequence, we require 7.5k DSPs for the full implementation if we have 100% efficiency. Our actual usage is 9.9k DSPs, corresponding to an aggregate 77% efficiency. The reason for the 77% efficient computation, and not higher, is a result of the fact that the ResNet-50 target design was actually synthesized for a throughput faster than 1.5 ms per image to ensure the target latency is met. The latency range of each layer ranges between 1.0 ms to 1.4 ms, with most layers having a throughput of 1.3 ms.
A.3 Galapagos
Galapagos is a heterogeneous deployment stack that allows users to deploy streaming IP cores on a cluster of FPGAs and CPUs. First we will describe the high-level abstraction model of Galapagos and then delve into each layer of abstraction that we implemented.
A.3.1 High-Level Abstraction.
The end goal of Galapagos is to be able to overlay a data flow graph of streaming IP cores to a cluster of devices, without the user having to worry about how to physically connect these IP cores amongst each other on one device as well as across devices. These IP cores should be able to address their destinations with respect to their target IP core, independent to how and where the IP core is implemented and placed. The IP cores themselves use AXI-stream to communicate with each other, with a destination side-channel. This is typically used within a single FPGA for routing amongst AXI-stream kernels. Our goal is to be able to provide this seamlessly across many devices to provide the user an abstraction similar to a single device. This can be seen as AXI-stream over the data center. We accomplish this by automating the encapsulation and decapsulation of AXI-stream packets with higher level network protocols. Figure 7 shows the high-level overview of Galapagos. On the left-hand side is a placement and implementation agnostic data-flow graph of streaming IP cores, and the right-hand side is how they are placed and implemented. Prior to this work the user had to provide configuration parameters to define the mapping of kernels to devices, but even that is now abstracted away with a partitioner. While the vendor tool kits provide the ability to add networking to a design [62, 63] the application must be explicitly aware of the networking.
Fig. 7. An overview of Galapagos. The user provides the implementation and network agnostic data flow graph of streaming IP cores, and our tool flow implements the right hand side, with the appropriate bridging to connect the devices together. Fig. 8. An example of the lowest level of abstraction. With the abstraction provided, all of these streaming devices have a consistent interface and can communicate with one another
Galapagos is built from the bottom up through several layers of the stack. From the bottom, individual devices are abstracted to appear to be streaming devices. This is shown in Figure 8. We have implemented this on different FPGAs and CPUs but this could be extended to other devices such as IoT sensors. Once each device is abstracted we can connect them together seamlessly at the protocol level. Furthermore we can look at a finer granularity of IP cores that can run on these devices, and even migrate implementations of these nodes as long as they have a consistent interface. Galapagos provides higher levels of abstraction to place streaming IP cores on these devices and connect and route amongst these IP cores on one device as well as target multiple devices. This is done through the implementation of the following layers of the stack: Physical Hardware and Connectivity, Hypervisor, Middleware Layer, and the Communication Layer.
A.3.2 Physical Hardware and Connectivity.
This layer of the stack refers to the physical devices you would have in your cluster and how they are connected. Currently we have created clusters of FPGAs (The Fidus Sidwinder, Alphadata 7v3, Pynq ZC702), x86 CPUs, and connectivity using 1G Ethernet, 10G SFP, and 100G QSFP. Currently the requirement for these devices is to have some connection to the network that we can connect to a network switch, but even this requirement can be abstracted in some way by the Hypervisor above. We have tested the same abstraction layers on these different devices to show the consistency of our higher levels of abstraction. Two examples of Galapagos setups can be seen in Figure 9.
Fig. 9. Two examples of “data centers” where we deployed Galapagos.
A.3.3 Hypervisor.
This layer of the stack refers to the abstraction of the physical device to standardize their interfaces to appear like Figure 8. Our standard model assumes a control path and a data path. The control path is used for configuration, programming, and monitoring the devices. Typically in our FPGAs we either use PCIe for FPGAs that do not have a tightly coupled ARM (Alphadata 7v3), or AXI for FPGAs with a tightly coupled ARM (Fidus Sidewinder, Pynq ZC702). The data path is used by the application IP cores to communicate off-chip to other nodes within the cluster. For the hypervisor to comply with the rest of the layers of the stack, it has to provide an AXI-stream interface. This standardization makes it simple for a user to add their own board within the stack. All they would need to do is provide an AXI-stream interface that can connect to a network switch. An example FPGA hypervisor is shown in Figure 10.
Fig. 10. An example Galapagos FPGA Hypervisor.
A.3.4 Middleware and Communication Layers.
The middleware is responsible for partitioning the kernels onto different FPGAs. This was previously done by a user-specified configuration, that provides a hint to the our middleware layer to place kernels on different devices. However we now automate this partitioning. This is described in Section 4.1.3. Once we have the placement of kernels, the middleware then places bridges to allow for kernels to communicate off-chip. The hypervisor guarantees an AXI-stream without a side channel available for the destination field. However with a Galapagos router and bridge we can take AXI-stream packets destined for off-chip and append the destination as a header. The Galapagos router has a routing table that specifies the location of all kernels by destination within the cluster. Furthermore, the off-chip communication can be done over various network communication protocols, handled by the communication layer. Depending on the destination FPGA, our network bridge encapsulates the packet with the correct network header. The network bridge is specific for each off-chip communication protocol the user wishes to support. If the user wishes to implement Galapagos on top of their own network protocol, they would need to supply a bridge that can translate their network packets into AXI-stream packets with a Galapagos header. The formation of the Galapagos router, routing table, and network bridges is all automated. The IP cores generated by the Middleware are shown in Figure 11.
Fig. 11. Automated Middleware IP cores in Galapagos.
Footnotes
1 We report only tools publicly available on GitHub and with a high user rating (Star Metric).
Footnote2 In HLS, the initiation interval specifies the number of clock cycles between the introduction of new inputs in a pipeline.
Footnote3 In Microsoft terminology, this layer is called the shell [1].
Footnote4 A flit is the amount of data transferred in one clock cycle in an AXIS stream.
Footnote5 We could not get access to enough boards.
Footnote
- [1] . 2016. A cloud-scale acceleration architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). IEEE, Los Alamitos, CA, Article 7, 13 pages. Google Scholar
Digital Library
- [2] . 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (
March 2018), 8–20. https://www.microsoft.com/en-us/research/publication/serving-dnns-real-time-datacenter-scale-project-brainwave/.Google ScholarCross Ref
- [3] . 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Alamitos, CA, 1–14.
DOI: http://dx.doi.org/10.1109./ISCA.2018.00012 Google ScholarDigital Library
- [4] . 2018. Fast inference of deep neural networks in FPGAs for particle physics. Journal of Instrumentation 13, 7 (
July 2018), 305.Google ScholarCross Ref
- [5] . 2017. Enabling flexible network FPGA clusters in a heterogeneous cloud data center. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 237–246. Google Scholar
Digital Library
- [6] . 2018. Galapagos: A full stack approach to FPGA integration in the cloud. IEEE Micro 38, 6 (2018), 18–24.Google Scholar
Cross Ref
- [7] Xilinx. n.d. Xilinx Vitis AI. Retrieved April 10, 2021 from https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html.Google Scholar
- [8] Intel. n.d. OpenVINO. Retrieved April 10, 2021 from https://www.intel.com/content/www/us/en/artificial-intelligence/programmable/solutions.html.Google Scholar
- [9] . 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016). Google Scholar
Digital Library
- [10] . 2002. Torch: A Modular Machine Learning Software Library.
Technical Report . Idiap.Google Scholar - [11] . 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 675–678. Google Scholar
Digital Library
- [12] TensorFlow. n.d. Distributed Training with TensorFlow. Retrieved April 8, 2021 from https://www.tensorflow.org/guide/distributed_training/.Google Scholar
- [13] Google. n.d. Cloud Tensor Processing Units (TPUs). Retrieved April 8, 2021 from https://cloud.google.com/tpu/docs/tpus/.Google Scholar
- [14] . 2020. NVIDIA NVLink Fabric: Advanced Multi-GPU Processing. Retrieved November 3, 2021 from https://www.nvidia.com/en-us/data-center/nvlinkGoogle Scholar
- [15] NVIDIA. n.d. NVIDIA NCCL. Retrieved April 8, 2021 from https://developer.nvidia.com/nccl/.Google Scholar
- [16] NVIDIA. n.d. NVIDIA Completes Acquisition of Mellanox, Creating Major Force Driving Next-Gen Data Centers. Retrieved April 10, 2021 from https://nvidianews.nvidia.com/news/nvidia-completes-acquisition-of-mellanox-creating-major-force-driving-next-gen-data-centers.Google Scholar
- [17] . 2019. Machine Learning (ML) Suite. Retrieved November 3, 2021 from https://github.com/Xilinx/ml-suite.Google Scholar
- [18] . 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, Los Alamitos, CA, 411–4117.Google Scholar
Cross Ref
- [19] . 2020. SDAccel Development Environment. Retrieved November 3, 2021 from https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html.Google Scholar
- [20] . 2020. Intel FPGA SDK for OpenCL Software Technology. Retrieved November 3, 2021 from https://www.intel.com/content/www/us/en/software/programmable/sdk-for-opencl/overview.html.Google Scholar
- [21] . 2017. Exploration of OpenCL for FPGAs using SDAccel and comparison to GPUs and multicore CPUs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, Los Alamitos, CA, 1–4.Google Scholar
Cross Ref
- [22] . 2019. Scalable low-latency persistent neural machine translation on CPU server with multiple FPGAs. In Proceedings of the 2019 International Conference on Field-Programmable Technology (ICFPT’19). 307–310.Google Scholar
Cross Ref
- [23] GitHub. n.d. Open Programmable Acceleration Engine. Retrieved November 3, 2021 from https://opae.github.io/.Google Scholar
- [24] . 2018. CHaiDNN. Retrieved November 3, 2021 from https://github.com/Xilinx/CHaiDNN.Google Scholar
- [25] . 2019. PYNQ DL. Retrieved November 3, 2021 from https://github.com/Xilinx/PYNQ-DL.Google Scholar
- [26] . 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 65–74. Google Scholar
Digital Library
- [27] . 2019. An OpenCL-based FPGA Accelerator for Convolutional Neural Networks. Retrieved November 3, 2021 from https://github.com/doonny/PipeCNN.Google Scholar
- [28] . 2019. Open-Source High-Level Synthesis IP Libraries. Retrieved November 3, 2021 from https://hlslibs.org.Google Scholar
- [29] . 2020. Memory-efficient dataflow inference for deep CNNs on FPGA. CoRR abs/2011.07317 (2020).
arxiv:2011.07317 https://arxiv.org/abs/2011.07317.Google Scholar - [30] . 2020. From TensorFlow graphs to LUTs and wires: Automated sparse and physically aware CNN hardware generation. In Proceedings of the 2020 International Conference on Field-Programmable Technology (ICFPT’20). 56–65.
DOI: http://dx.doi.org/10.1109/ICFPT51103.2020.00017Google ScholarCross Ref
- [31] 2020. Keras: The Python Deep Learning Library. Retrieved November 3, 2021 from https://keras.ioGoogle Scholar
- [32] . 2017. Automatic differentiation in PyTorch. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’17).Google Scholar
- [33] 2019. ONNX: Open Neural Network Exchange. Retrieved November 3, 2021 from https://github.com/onnx/onnx.Google Scholar
- [34] . 2020. Automatic deep heterogeneous quantization of deep neural networks for ultra low-area, low-latency inference on the edge at particle colliders. arXiv:2006.10159 (2020).
arxiv:physics.ins-det/2006.10159 Google Scholar - [35] . 2015. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10 (2015), 1591–1604. Google Scholar
Digital Library
- [36] . 2020. Vivado Design Suite. Retrieved November 3, 2021 from https://www.xilinx.com/products/design-tools/vivado.html.Google Scholar
- [37] . 2021. hls4ml: An open-source codesign workflow to empower scientific low-power machine learning devices. In Proceedings of the tinyML Research Symposium 2021.
arxiv:cs.LG/2103.05579 Google Scholar - [38] . 2021. Fast convolutional neural networks on FPGAs with hls4ml. arXiv:2101.05108 (
1 2021).arxiv:cs.LG/2101.05108 Google Scholar - [39] . 2020. Accelerated charged particle tracking with graph neural networks on FPGAs. In Proceedings of the 34th Conference on Neural Information Processing Systems.
arxiv:physics.ins-det/2012.01563 Google Scholar - [40] . 2020. Fast inference of boosted decision trees in FPGAs for particle physics. Journal of Instrumentation 15, 5 (2020), P05026.
DOI: http://dx.doi.org/10.1088/1748-0221/15/05/P05026arxiv:physics.comp-ph/2002.02534 Google Scholar - [41] . 2020. Distance-weighted graph neural networks on FPGAs for real-time particle reconstruction in high energy physics. Frontiers in Big Data 3 (2020), 598927.
DOI: http://dx.doi.org/10.3389/fdata.2020.598927arxiv:physics.ins-det/2008.03601 Google Scholar - [42] . 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Transactions on Reconfigurable Technology and Systems 11, 3 (2018), Article 16, 23 pages.
arxiv:cs.AR/1809.04570 Google ScholarDigital Library
- [43] ARM. 2010. AMBA 4 AXI4-Stream Protocol. Retrieved April 19, 2020 from https://static.docs.arm.com/ihi0051/a/IHI0051A_amba4_axi4_stream_v1_0_protocol_spec.pdf.Google Scholar
- [44] . 1998. MPI—The Complete Reference: The MPI Core. Vol. 1. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- [45] . 2019. libGalapagos: A software environment for prototyping and creating heterogeneous FPGA and CPU applications. In Proceedings of the 6th International Workshop on FPGAs for Software Programmers. 1–7.Google Scholar
- [46] . 2019. GULF-Stream. Retrieved January 13, 2020 from https://github.com/QianfengClarkShen/GULF-Stream.Google Scholar
- [47] . 2015. Scalable 10Gbps TCP/IP stack architecture for reconfigurable hardware. In Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’15). IEEE, Los Alamitos, CA, 36–43. Google Scholar
Digital Library
- [48] Intel. 2019. Intel 82599 10 GbE Controller Datasheet. Intel.Google Scholar
- [49] . 2019. Fidus Sidewinder. Retrieved January 13, 2020 from https://fidus.com/products/sidewinder/Google Scholar
- [50] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [51] . 2019. FPGA-accelerated machine learning inference as a service for particle physics computing. arXiv preprint arXiv:1904.08986 (2019).Google Scholar
- [52] . 2021. NVIDIA Data Center Deep Learning Product Performance. Retrieved June 17, 2021 from https://developer.nvidia.com/deep-learning-performance-training-inference.Google Scholar
- [53] . 2021. Amazon EC2 F1 Instances. Retrieved January 19, 2021 from https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- [54] . 2020. Alveo U200 and u250 Data Center Accelerator Cards Data Sheet. Retrieved January 19, 2021 from https://www.xilinx.com/support/documentation/data_sheets/ds962-u200-u250.pdf.Google Scholar
- [55] . 2021. Distance-weighted graph neural networks on FPGAs for real-time particle reconstruction in high energy physics. Frontiers in Big Data 3 (2021), 598927.
arxiv:hep-ex/2008.03601 Google Scholar - [56] . 2020. Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml. Machine Learning: Science and Technology 2, 1 (
Dec. 2020), 015001.DOI: http://dx.doi.org/10.1088/2632-2153/aba042Google ScholarCross Ref
- [57] . 2017. Deep Learning with INT8 on Xilinx Devices. Retrieved November 3, 2021 from https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf.Google Scholar
- [58] . 2018. Fast reconstruction and data scouting. In Proceedings of the 4th International Workshop Connecting the Dots 2018.
arxiv:hep-ex/1808.00902 Google Scholar - [59] . 2020. Search for a narrow resonance lighter than 200 GeV decaying to a pair of muons in proton-proton collisions at \(\sqrt {s} =\) TeV. Physical Review Letters 124, 13 (2020), 131802.
DOI: http://dx.doi.org/10.1103/PhysRevLett.124.131802arxiv:hep-ex/1912.04776 Google ScholarCross Ref
- [60] . 2015. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).
arxiv:1512.03385 http://arxiv.org/abs/1512.03385Google Scholar - [61] . 2019. FPGA-accelerated machine learning inference as a service for particle physics computing. Computing and Software for Big Science 3, 1 (2019), 13.
DOI: http://dx.doi.org/10.1007/s41781-019-0027-2arxiv:physics.data-an/1904.08986 Google ScholarCross Ref
- [62] GitHub. n.d. Xilinx Vitis Network Example. Retrieved November 4, 2021 from https://github.com/Xilinx/xup_vitis_network_example.Google Scholar
- [63] GitHub. n.d. Xilinx Vitis with 100G TCP/IP. Retrieved November 4, 2021 from https://github.com/fpgasystems/Vitis_with_100Gbps_TCP-IP.Google Scholar
Index Terms
AIgean: An Open Framework for Deploying Machine Learning on Heterogeneous Clusters
Recommendations
LegUp: high-level synthesis for FPGA-based processor/accelerator systems
FPGA '11: Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arraysIn this paper, we introduce a new open source high-level synthesis tool called LegUp that allows software techniques to be used for hardware design. LegUp accepts a standard C program as input and automatically compiles the program to a hybrid ...
Synergy: An HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC
Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale ...
An integrated high-level hardware/software partitioning methodology
Embedded systems are widely used in many sophisticated applications. To speed the time-to-market cycle, the hardware and software co-design has become one of the main methodologies in modern embedded systems. The most important challenge in the embedded ...

















Comments