skip to main content
research-article
Open Access

A High-Throughput, Resource-Efficient Implementation of the RoCEv2 Remote DMA Protocol and its Application

Authors Info & Claims
Published:22 December 2022Publication History

Skip Abstract Section

Abstract

The use of application-specific accelerators in data centers has been the state of the art for at least a decade, starting with the availability of General Purpose GPUs achieving higher performance either overall or per watt. In most cases, these accelerators are coupled via PCIe interfaces to the corresponding hosts, which leads to disadvantages in interoperability, scalability and power consumption. As a viable alternative to PCIe-attached FPGA accelerators this paper proposes standalone FPGAs as Network-attached Accelerators (NAAs). To enable reliable communication for decoupled FPGAs we present an RDMA over Converged Ethernet v2 (RoCEv2) communication stack for high-speed and low-latency data transfer integrated into a hardware framework.

For NAAs to be used instead of PCIe coupled FPGAs the framework must provide similar throughput and latency with low resource usage. We show that our RoCEv2 stack is capable of achieving 100 Gb/s throughput with latencies of less than 4μs while using about 10% of the available resources on a mid-range FPGA. To evaluate the energy efficiency of our NAA architecture, we built a demonstrator with 8 NAAs for machine learning based image classification. Based on our measurements, network-attached FPGAs are a great alternative to the more energy-demanding PCIe-attached FPGA accelerators.

Skip 1INTRODUCTION Section

1 INTRODUCTION

In various research projects as well as in commercial data center installations e.g., [13, 15], it has been shown that Field Programmable Gate Arrays (FPGAs) can be effectively used as a potential alternative to Graphical Processing Units (GPUs) as compute accelerators. Besides the advantage of typically providing ideally suited accelerators for a given compute problem, FPGAs show the advantage of significantly lower power consumption than their GPU counterpart. In most cases, these FPGA-based accelerators are currently integrated using Peripheral Component Interconnect Express (PCIe) as the main communication interface into server-grade compute nodes. Besides these tightly coupled, more co-processor-like architectures, a couple of different architectures can be found, where network-based communication interfaces will be used as an additional interface to the available PCIe interface e.g., for cross-accelerator communication and data transfer [7].

This is typically due to the nature of the various accelerator types and architectures, which differ in their communication bandwidth requirements. Various types of offload accelerators do not require a high communication data rate, if the bandwidth is neither needed for input or output data nor control flow communication. e.g., Machine Learning (ML) inference accelerators, as we show in our paper as an application for our accelerator concept, are profound compute-bound. This processing characteristic allows more or less autonomous processing nodes/offload engines that would benefit from a loosely coupled communication interface.

Therefore, the next step towards raising the degree of freedom for the integration of FPGA-based accelerators in data centers are Network-attached Accelerators (NAAs), which can be completely decoupled from host computers. This offers the opportunity to use the available network infrastructure as communication backbone instead of PCIe, e.g., [5]. The coupling with PCIe also limits the amount of FPGAs per host. This problem can be eliminated with NAAs. Therefore, the NAA architecture provides less overall energy consumption because it requires a smaller number of hosts. In this architectural decomposition both, the high bandwidth data transfer and the usual accelerator control commands need to be carried out by the network interface. High bandwidth data transfer should be implemented as one-sided Remote Direct Memory Access (RDMA) between compute nodes and the NAA as this allows highest transfer speeds.

In addition to the Infiniband [11] network, which supports RDMA and is often deployed in data centers, Ethernet has become a viable alternative for integrating NAAs into data center infrastructures. This development has become possible because Ethernet is more widespread and also supports up to \( 400 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) with the IEEE 802.3bs standard [1]. However, Ethernet only covers Open Systems Interconnection (OSI) reference model layers 1 and 2 compared to Infiniband, which covers layers 1–4. To make the more advanced capabilities such as RDMA, reliable connection services, and data flow control of the Infiniband Architecture Specification available for Ethernet networks as well, RDMA over Converged Ethernet (RoCE) was developed [9]. Later an appendix [10] was published implementing RoCE over UDP/IP to make it routable. It was named RDMA over Converged Ethernet v2 (RoCEv2) or Routable RoCE (RRoCE). RoCEv2 features basic connection establishment, single message transmission and of course the actual one-sided RDMA READ and WRITE operations to memory locations. To achieve a high degree of interoperability it is desired to use a standardized protocol like RoCEv2.

To meet the required flexibility for network-attached FPGAs, we developed a resource-efficient RoCEv2 stack integrated into a hardware framework. This framework will provide infrastructure on the FPGA to be used by any accelerator. Such an accelerator framework for standalone FPGAs as NAA must meet the following requirements:

(R1)

The accelerators on an FPGA must be interchangeable at runtime. In addition, different applications should be running simultaneously.

(R2)

All basic functions are provided in an abstract and resource-efficient form ensuring easy handling while keeping large parts of resources for the application.

(R3)

An interoperable, standardized network connection should be used for communication. This ensures frictionless communication with other compute nodes. It also allows combining several accelerators in to any network topology (1-to-1, 1-to-N or N-to-1) via commercially available managed switches.

(R4)

The network interface should have a high data throughput and low latency to provide the accelerator with suitable amounts of data.

(R5)

Remote DMA ought to be used to exchange huge data blocks. The FPGA should be able to both receive and initiate transfers. Reliable connections with detection of transmission errors and repeated sending of erroneous or missing data are necessary to prevent data loss that could cause problems in the application.

(R6)

It must be possible to manage the FPGA via network.

(R7)

For easy administration the Internet Protocol (IP) addresses should be obtained automatically.

Our main contributions are threefold. First, we present a vendor-independent 100 Gb/s capable RoCEv2 stack. Second, we show the integration of the RoCEv2 stack into a hardware framework and its exemplary use by an ML accelerator. Third, we demonstrate the energy efficiency and scalability of the NAA approach in an ML use case. The last two contributions are extensions to our conference paper [24]. The underlying UDP/IP network stack is not a focal point of this work.

The structure of the paper is the following: In Section 2 we give an overview of the related work regarding FPGA-specific RDMA architectures and network-attached FPGA frameworks. Section 3 provides a detailed overview of our architecture of the overall framework. In Section 4 we present our RoCEv2 architecture, while Section 5 evaluates the RoCEv2 stack in terms of throughput and latency and compares it to related work. Section 6 demonstrates the utility and advantages of our NAA architecture with RoCEv2 compared to other accelerator technologies by the use case of an ML accelerator. In Section 7 we conclude our paper and give an overview of our planned activities in the field of network-based accelerator architectures.

Fig. 1.

Fig. 1. Comparison of a software network stack vs. an offloading engine and our hardware based communication stack including RoCEv2 where software components are shown in yellow and hardware in blue.

Skip 2RELATED WORK Section

2 RELATED WORK

As depicted in Figure 1, a network stack implementation consists of hardware and software components throughout the OSI layers, where the most common case is a nearly complete software implementation, where only layers 1 and 2 are hardware implementations. To raise the throughput of the host network interface adapters, offload engines (UDP/TCP Offloading Engine (UOE/TOE)) are commonly used, where most of the network-related processing is achieved by hardware up to the application layer. In our framework, we have opted to implement all components as an integrated hardware design as our goal is to physically decouple the FPGA from any host computer.

There are several \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) FPGA implementations for both TCP/IP [21] and UDP/IP [19, 29]. However, these all have the common disadvantage of not providing RDMA, which can dramatically increase data throughput, especially between servers and FPGAs, by zero-copy without CPU involvement.

Table 1.
NameProtocolThroughputFPGALUTsFFs36K BRAM
StRoM [8, 25]RoCEv2\( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \)XCVU9P122k (10%)214k (9 %)402 (19%)
ETRNIC [33]Asymmetric RoCEv2\( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \)XCZU1971k (14%)44k (4.2 %)116 (12%)
ERNIC [34]RoCEv2\( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \)XCZU1953k (10 %)47k (4.5 %)142 (15 %)
MANSOUR [16]proprietary RDMA\( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \)XCKU5P~13560 CLBs (50%)\( ^{{{1}}} \)~101 (21%)
PROP [35]proprietary PCIe-based RDMA\( 26 \,\mathrm{G}\mathrm{b}/\mathrm{s} \)XC7VX690TX\( ^{{{2}}} \)XX
Weerasinghe [31]proprietary UDP/IP to DDR\( 10 \,\mathrm{G}\mathrm{b}/\mathrm{s} \)XC7VX690T62k (14%)59k (7 %)X
  • \( ^{1} \)only CLBs in % reported, approximate number calculated. \( ^{2} \)X means not reported.

Table 1. Capability-aware Resource Utilization Comparison of Related Work

  • \( ^{1} \)only CLBs in % reported, approximate number calculated. \( ^{2} \)X means not reported.

Table 1 shows a comparison of the different related works in terms of protocols employed, maximum throughput and design resource usage. Table 2 in Section 5.4 presents the resource utilization for our RoCEv2 implementation.

In PROP [35], the authors present a concept for PCIe-based RDMA communication between multiple servers via a central unit in a rack. Their results show a peak performance of around \( 26 \,\mathrm{G}\mathrm{b}/\mathrm{s} \). PCIe communication for the connection between the server and the central unit has scalability and interoperability disadvantages compared to Ethernet-based communication. In addition, the bandwidths achieved are not competitive.

Mansour et al. [16] published a data acquisition system based on \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) RDMA. They proposed a new RDMA protocol over UDP to decrease FPGA resource usage on a XCVU9P Virtex UltraScale+ [32]. The protocol shows a higher data throughput compared to RoCEv2. This is due to less header data in the proposed protocol as well as the omission of the challenging implementation of the Invariant Cyclic Redundancy Check (ICRC) in RoCEv2. A clear disadvantage of a proprietary protocol is the lack of interoperability.

Xilinx provides two different \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) RoCEv2 implementations: Embedded RDMA enabled NIC (ERNIC) [34] and Embedded Target RDMA enabled NIC (ETRNIC) [33]. Since both are published by Xilinx, they are only available for Xilinx FPGAs, which represents a classic vendor lock-in. Both support multiple queue pairs. ERNIC is a full-featured RDMA implementation for Reliable Connection (RC) transport services. In contrast, ETRNIC is a limited version that supports RDMA READ and WRITE operations. It does not support incoming RDMA operations which means an FPGA cannot be accessed via RDMA by another compute node. In Table 1, the resource consumption for both cores with 128 Queue Pair (QP) is shown. The numbers only include ERNIC or ETRNIC and not the necessary Ethernet MAC nor the memory interface. As depicted, ETRNIC requires fewer resources compared to ERNIC with the exception of LUTs.

StRoM [25] is a programmable, smart network card that can perform stream-based operations on RDMA data in addition to pure RoCEv2 data transfers. By extending the RoCEv2 protocol, the network card can be used as a local or remote offload engine. Their 100GbE RoCEv2 implementation shows promising latency and throughput results. It supports both RDMA WRITE and READ as well as multiple connections because 500 QP are available for use. The extended capabilities are especially made possible by an increased memory requirement. The IP cores used were developed using high-level synthesis (HLS) targeting Xilinx FPGAs and published in [8]. Due to the necessity to use the Xilinx toolchain, there is a vendor dependency in contrast to our vendor-independent HDL-based implementation. Compared to ERNIC, it requires more than twice as many LUTs, 4 times as many FFs, and 2.8 times the number of Block Random Access Memories (BRAMs), despite its smaller feature set. The Queue Pair Numbers (QPNs) of the remote side required for communication via RoCEv2 must be exchanged through a separate channel. In the standard, a connection setup with Management Datagram (MAD) headers is required for this purpose. Only the Unreliable Datagram (UD) mode does not need this, but according to the standard this mode does not support RDMA transfers. It remains unclear how an RDMA transfer with servers via commercial RoCEv2 network cards should work without a proper connection establishment. This seems to be an obstacle to the required interoperability with other RoCEv2 endpoints. The comparison of the latencies for RDMA READ and WRITE shows a slight advantage for our implementation, although we only measured them at \( 40 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) and included a switch in the measurement.

Weerasinghe et al. [31] compared the network performance of disaggregated FPGAs against bare-metal servers, virtual machines and Linux containers where they conclude that the FPGA has better round trip times and throughput. They also provided comparisons between UDP, TCP and RDMA even though it seems they only tested UDP and a Xilinx TCP IP core on the FPGAs. UDP communication between the FPGAs performs well, which is not surprising due to the lower header overhead. However, it has only unreliable RDMA access between FPGAs because the UDP/IP payload is directly stored in the memory, without any control flow like in RoCEv2. Additionally, their throughput is limited to 10GbE hardware at the moment with a planned upgrade to 40GbE or 100GbE. The concept of their disaggregated FPGAs is very similar to our NAA architecture and their results are a strong reason to investigate the NAA concept further. However, they have not implemented RDMA communication, which complicates data exchange between FPGAs and servers. Despite these limitations, their hardware framework takes a similar percentage of board resources in logic resources as our proposed implementation.

In [20] this concept of network-attached FPGAs was further pursued. A resource management system, which communicates with the FPGAs via HTTP according to the RESTful principle, is presented. Data exchange with the FPGA is done via TCP/UDP/IP based on a 10 GbE interface. In contrast to our NAA architecture, the advantages of one-sided RDMA communication are not used.

In contrast, in [30] tightly coupled FPGAs with network connections are employed as accelerators. The authors present a comprehensive hardware and software framework with communication based on Message Passing Interface (MPI). PCIe or Advanced eXtensible Interface Bus (AXI) is used for control communication, while the network interface is selected for data communication. RDMA for transferring large transfers is not used, despite the advantages. The network interface reaches a bandwidth of \( 58.4 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) with TCP/IP, which is not a good utilization for a 100 GbE interface. Due to the tight coupling for the control data, the advantages of the loose coupling of the data path via network are not fully realized.

In [7], the authors describe the use of FPGAs as accelerators at a very large scale. Here, the FPGAs are tightly coupled via PCIe and all network traffic to the network card of the host server is routed through the FPGA via 40 GbE. The FPGA is mainly intended to accelerate networking tasks such as encryption. In addition, the FPGAs can be leveraged as distributed accelerators for remote servers over the network. The proprietary Lightweight Transport Layer protocol is used for communication. It provides reliable connections and shows promising latency results. RDMA communication is not implemented and the use of a proprietary protocol is a major barrier for interoperability. In contrast to our concept, the network-coupled FPGA is only treated as a second-class device.

In summary, we can say that our NAA architecture has advantages over comparable concepts by implementing the RoCEv2 protocol with reliable and fast one-sided RDMA communication, also in terms of interoperability. Even when considered on its own, the RoCEv2 stack performs well. Only ERNIC [34] is comparable, but has the clear disadvantage due to vendor dependency.

Skip 3FRAMEWORK ARCHITECTURE Section

3 FRAMEWORK ARCHITECTURE

The usage of standalone FPGA accelerators is hindered by the unfamiliar and complicated handling, despite their advantages in throughput, latency and energy efficiency. To overcome these difficulties we propose an easy to use and flexible hardware framework with a corresponding software infrastructure. A more detailed description of the framework, especially the software framework, is given in [27].

3.1 Hardware Framework

Our hardware framework, called Accelerator Framework and shown in Figure 2, serves as a hardware abstraction layer to ensure easy usage of basic functions (R2) like external memory access, network communication and management. Besides the static framework, each FPGA can implement a parameterizable number of reconfigurable accelerators (R1) embedded in an Accelerator Socket with a specified interface. Each socket can be used in parallel with shared external memory access and network communication (R1). Each socket interprets the incoming data on its own which supports a variety of applications. The main framework components are:

Fig. 2.

Fig. 2. Example overview of proposed Accelerator Framework with Command Control Interface (CCI) in red, network data bus in dotted blue and memory data bus in yellow.

Management.

To manage the framework and the accelerators an AXI Lite [6] management bus called Command Control Interface (CCI) is used. Each component provides a number of predefined registers that can be extended depending on the application. In addition, the framework provides further status information such as a unique device ID.

External Memory.

The different sockets are connected to the external memory using an AXI [6] bus. The framework supports shared memory for data exchange or any number of independent memories per Accelerator Socket, configurable at design time. In Figure 2, an independent interconnect is used for each memory. Since there are more sockets than memories, several share an interconnect. The use of physically independent memories allows a strong separation between data from different accelerators. Both an exclusive sequential write as well as read access to the memory is performed with 16 GB/s via a 512-bit data bus with a clock of 266.6 MHz. This is close to the theoretical maximum of 17.064 GB/s of the used 64-bit memory with 2133 MT/s. Both the automatic refreshing of the memory module and the AXI interconnect contribute to the losses.

Network Architecture.

The network stack provides a high-speed, low-latency communication interface (R4) to other devices via a standardized network interface (R3).

Our network stack is split into two main components which are visualized in Figure 3.

Fig. 3.

Fig. 3. Overall Network Architecture. Red is a streaming interface, dashed blue is memory data bus and dotted yellow is CCI.

\( \bullet \) UDP/IP. The UDP/IP stack provides User Datagram Protocol (UDP)/IP abstraction and is connected to a Medium Access Control (MAC) controller. It decodes incoming packets according to the different network layers. For transmitting, the payload is streamed in and wrapped in UDP, IP and Ethernet headers. The UDP stack internally handles Internet Control Message Protocol (ICMP) based ping packets, i.e. incoming ping requests are answered. It also features the Address Resolution Protocol (ARP) and the Dynamic Host Configuration Protocol (DHCP) for a seamless integration into existing network infrastructures (R7). On the transmit side, all the different packet sources (i.e., application payload, CCI packets, ARP request/response, DHCP, and ICMP) are aggregated. All component connections use a wide streaming interface with 256-bit respectively 512-bit to ensure maximum performance with minimum resource consumption. On an Arria 10 GX FPGA [12], we achieve a clock rate of 230 MHz with 256 bits of bus width, thus meeting the bandwidth requirements for a 40 GbE interface.

On the application side, the stack is connected to the network data interconnect for applications to send and receive UDP payloads. UDP packets are forwarded to the application layer which can be the RoCEv2 stack. Directly coupled with the UDP/IP stack is a CCI Bridge that listens for specific packets containing CCI master commands that are forwarded to the CCI interconnect (R6).

\( \bullet \) RoCEv2. As an RDMA protocol with reliable connections, RoCEv2 meets our requirement R5. It uses a streaming connection to the UDP/IP stack for receiving and sending the RoCEv2 packets. The other side of the stack is connected to the external memory interconnect to write/read the RDMA payload data into the off-chip memory. The RoCEv2 architecture is described in detail in Section 4.

3.2 Software Framework

In addition to our hardware framework, we have developed a software framework for using NAA in a heterogeneous data center environment. It consists of an Acceleration Server, a Resource Manager, a private container repository for accelerator IP cores and one or various Node Managers. Each NAA is managed by a Node Manager running on a central management server. The Node Manager reports the usable resources of each NAA to the Resource Manager. The Acceleration Server acts as an Accelerator-as-a-Service interface to access the NAA through an Application programming interface (API). To provide a service, the Acceleration Server queries the Resource Manager for suitable free resources. The Resource Manager maintains the available resources and information about their capabilities. The Acceleration Server chooses the resources for usage, reserves them at the Resource Manager and configures the corresponding Node Managers. The associated container with a suitable accelerator is launched from the repository. If necessary, the NAA is reconfigured. Then the Acceleration Server transfers the computation data to the NAA using RDMA. At the end of the computation, the results are transferred back to the Acceleration Server via RDMA and the Acceleration Server releases the resource at the Resource Manager. The Resource Manager notifies the Node Managers which stops the corresponding accelerator container. The Acceleration Server can be replaced by an application with integrated Accelerator Server functionalities to minimize communication overhead.

Skip 4ROCEV2 ARCHITECTURE Section

4 ROCEV2 ARCHITECTURE

4.1 Architectural Requirements

Our goal is to create a RoCEv2 implementation that meets high throughput requirements of at least \( 40 \,\mathrm{G}\mathrm{b}/\mathrm{s} \). High throughput processing requires both a fast clock and a large data bus width. To obtain a realistic design frequency, we have chosen a 256-bit bus width for \( 40 \,\mathrm{G}\mathrm{b}/\mathrm{s} \), yielding to a minimum frequency of \( \frac{40\mathrm{G}\mathrm{b}/\mathrm{s}}{256\mathrm{b}} = 156.25\, \mathrm{M}\mathrm{Hz} \). To have enough dead cycles for setup and control flow, we choose 230 MHz as the target frequency. The clock speeds of current FPGAs are realistically about \( 200 \,\mathrm{M}\mathrm{Hz} \)-\( 350 \,\mathrm{M}\mathrm{Hz} \) when using a large bus width. The selected design clock is therefore at the lower end of the realistic range, which allows us to reach timing closure quickly. However, for some applications even higher throughput is needed. For this reason, we designed our network stack with 100GbE hardware in mind. To provide \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) we added a parameterizable option to use a 512 bit bus which enables us to reach \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) while also targeting \( 230 \,\mathrm{M}\mathrm{Hz} \).

An important aspect to consider when designing hardware with large bus widths is the interface choice. Especially in network communication, backpressure is unavoidable but it is unfavorable if the whole bus needs to be stored in block memory for the backpressure to work. Therefore, we decided to use backpressure streaming interfaces wherever possible, which, unlike FIFOs for example, may utilize more logic in favor of less memory consumption. Only for the clock domain crossing of the external memory bus system a FIFO is used, which simultaneously provides the backpressure.

A major concern with the high bus width is the ICRC calculation of the RoCEv2 protocol. The ICRC is usually calculated bitwise or with an 8-bit lookup table for byte-wise processing but we have 32 or 64 bytes in one cycle. This makes it infeasible to use a lookup table because with an 8-bit lookup table the calculation of one word would take 32 or 64 cycles and 32-byte or 64-byte large lookup tables do not fit in any existing memory.

4.2 Design Decisions

The Infiniband specification regarding RoCE is a complex combination of protocol definitions reaching from the physical layer, link layer, network layer and transport layer up to the software transport interface. As we are implementing RoCEv2, we do not handle the layers below the transport layer and the Infiniband Global Routing Header (GRH) is replaced with the UDP/IP headers. Additionally, the software transport interface specified by Infiniband is an interface specifically designed for use in processor-driven environments which does not quite meet our needs in a hardware implementation of basic RDMA transfer. Therefore, we decided to adapt our implementation in some factors while still preserving compatibility with existing RoCEv2 implementations, especially with Mellanox Network Adapters that implement RoCEv2:

  • No implementation of dynamic QPs but instead handle one connection at a time which is sufficient but could be changed in the future if needed.

  • A simplified approach to memory registration on the FPGA which is better suited for our hardware framework environment.

  • No direct support for work requests except connection management and RDMA WRITE and READ operations because basic send and receive requests can be made directly over the UDP/IP stack by the accelerator.

  • The Infiniband RoCEv2 READ operation is not implemented, instead RDMA READ and WRITE are both implemented with RoCEv2 WRITE operations. An extension based on the RoCEv2 protocol serves as pre-RDMA protocol for the exchange of necessary protocol parameters as visualized in Figure 4 and described below. The drawback of this approach is that for a READ the CPU of the remote host must be involved.

  • We do not support the optional RoCEv2 Congestion Management (RCM) based on IP Explicit Congestion Notification (ECN).

These adaptations allow us to have a standard conform and at the same time resource efficient RDMA transfer.

Fig. 4.

Fig. 4. Sequence diagram of an RDMA WRITE followed by an RDMA READ transfer including connection establishment. The black packets are Infiniband specified packets and blue and yellow packets originate from our pre-RDMA protocol using Infiniband RC Send Only messages.

An example of what communication looks like is shown in Figure 4. Before memory data can be transferred between nodes, a connection must be established. After connection establishment either node can initiate a memory transfer which is started with our pre-RDMA protocol. This self-designed protocol works on top of the standardized Infiniband RC Send Only packets and exchanges the amount of data to be transferred and the memory region of the receiving node. Both are needed for allocating and registering memory before the transfer starts. When writing data to a remote node, the size of the transfer is sent and then the receiving node can allocate and register memory. It replies with the memory region the data should be sent to. As these packets are Infiniband RC Send Only messages, they are acknowledged with an Infiniband Acknowledgment (ACK). For an RDMA Read the memory is allocated and registered and the resulting memory region with the size of the transfer is sent to the other node. After the pre-RDMA messages, the actual RDMA data is sent with a RoCEv2 WRITE. It is important to note that this approach allows us to use off-the-shelf network cards to transfer memory via RoCEv2 to an FPGA or a CPU with our user-space RDMA software.

4.3 Implementation

Any RoCEv2 packet consists of at least the Base Transport Header (BTH) which contains a Package Sequence Number (PSN) to uniquely identify each packet, a QP Number to identify the connection endpoint, an acknowledgment request bit and the packet type. Our implementation is capable of sending and receiving the following RoCEv2 packet types:

  • Communication Management (CM) packets for connection establishment.

  • RoCEv2 RC Send Only packets for the pre-RDMA protocol.

  • RoCEv2 WRITE packets for the actual memory transfer.

  • ACK packets sent in positive response to a packet with an active acknowledgment request.

  • NAK (negative ACK) packets to signal unsupported accesses (e.g., in datagram mode), other remote errors, or packet losses indicated by an incorrect PSN.

Each packet ends with an ICRC to ensure integrity which is calculated the same way as the Ethernet Frame Check Sequence (FCS).

In the receiving data path the BTH is first extracted and evaluated. Then the packet payload is forwarded to and handled by the specific subtype parsers. If the packet contains RDMA data, it is written to the memory data bus. The transmitting data path builds the different types of packets with the corresponding meta data and RDMA data from the memory data bus and finally the ICRC is calculated and appended.

Besides the actual packet handling the control logic of the RoCEv2 protocol is more complex than that of the stateless UDP protocol. As we are implementing RC communication we need to handle acknowledgments and timeouts as well as not acknowledgments. For this purpose, we included a special timeout unit that signals when, for example, the remote did not acknowledge a packet in time with a ACK. A notable feature of the RC is that packets can be retransmitted if they are lost. This is easy for communication packets but more complicated when sending data from memory because we need to read the data again from the memory. Our data bus to the memory is an AXI which has a 4k boundary, meaning that only 4 kB can be read with one command. However, if we would issue the command just before we need the data, it would induce a lot of latency and with that reduce the performance. On the other hand, if we issue a lot of read requests and then receive a negative acknowledgment (NAK), we need to read the memory from an early address again. This means that we need to wait for the completion of the already issued commands before we can read from the earlier addresses again. For this reason we decided to build an entity which has a configurable amount of 4 kB read requests it should issue at once which limits unnecessary memory reads in case of a NAK but also avoids latencies from the memory data bus.

4.4 ICRC Calculation

As mentioned in Section 4.1, ICRC calculation is a major problem for high throughput implementations with large bus widths. The algorithm used by the RoCEv2 protocol is the same 32-bit check sequence as the Ethernet FCS with the polynomial 0x4C11DB7. In the Infiniband protocol, the ICRC is calculated over the Global and Local Routing Header (LRH), followed by the actual payload. For the ICRC calculation of RoCEv2 packets, the GRH is replaced with 8 bytes of 0xFF, and the LRH by the IP and UDP header where variant fields such as Time To Live are masked with 0xFF.

ICRC is a bitwise non-commutative function. Software implementations often use an 8-bit lookup table with precomputed results for byte-wise processing but a 512 or 256-bit lookup table for our bus width is not feasible. Therefore, we decided to expand the bitwise calculation to 512 and 256 bits which allows us to calculate the ICRC on a streaming interface with no backpressure. The expanded, bitwise calculation results in a big XOR combination for each of the 32 result bits. Each XOR combination has the 512 or 256 bits of the bus and the previous ICRC bits as inputs. This does take up some resources but it is the most resource-efficient way to calculate a 512 or 256-bit ICRC in one cycle. A problem occurring with this approach is that not all packets are aligned to 512 or 256 bits. However, all RoCEv2 packets are padded to a 4-byte alignment which means that we can calculate the last word in 32-bit chunks. This will take longer than one cycle but as it will only be the last word of a packet it does not induce too much delay. Furthermore, we decided to optimize the RoCEv2 RDMA Middle packet which has always the same length and is predominant in larger transfers. With the 512-bit bus its last word always has 40 valid bytes and with the 256-bit bus its last word is 8 bytes long so we also expand the ICRC for 320 and 64 bits, respectively.

Skip 5ROCEV2 EVALUATION Section

5 ROCEV2 EVALUATION

To fully evaluate our design, we tested it for reproducibility, maximum throughput and latency and compared the resource usage of our testbed design to other implementations.

Fig. 5.

Fig. 5. Test architecture. Yellow is control data and blue is network data.

User API.

5.1 RDMA Software

To test our RoCEv2 stack not only between two FPGAs but also in communication with servers, we implemented a user-space Linux software based on linux-rdma [3] for our testbed and the system integration. A C++ wrapper based on the Singleton design pattern is used for easy integration into user programs and ensures that only one connection is used at a time. The users essentially have four functions at their disposal, as can be seen in Listing 6.

The functions write() and read() offer a simple API to initiate an RDMA connection from the server to another host (server or FPGA). In contrast, the functions acceptIncomingWrite() and acceptIncomingRead() wait for incoming connections and then execute the callback functions. Due to our design decisions (see Section 4.2), each established connection is terminated immediately after a transaction is completed so that the QP is available for another connection. Furthermore, the software automatically generates the necessary RoCEv2 RC Send Only messages for our pre-RDMA protocol.

5.2 Test Architecture

The testbed architecture is visualized in Figure 5. Our goal was to have a testbed design that can measure throughput and latencies on its own and report the results to persistent storage on a server. We used an Arria 10 GX 1150 development board with two QSFP modules supporting two 40GbE-CR4 connectors. Each QSFP module is connected to a 40GbE MAC IP and both QSFP connectors are connected to a 40GbE layer 3 switch [18]. We designed a testbed architecture which contains the main RoCEv2 stack, interconnections to the DDR3-SDRAM and performance measurements. The DDR3-SDRAM on our board consists of two chips with 72 bits each at 2133 MT/s which yields a 576-bit bus with 267 MHz at quarter-rate support and 512 bits with ECC enabled. In real memory testing, we achieved \( 128 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) for each memory for both sequential read and write with ECC enabled.

For the throughput measurements, we used a trigger system to be able to measure the total transfer time from initiation on the first testbed up until the last data word is written into the target memory. A similar setup was used for the latency tests with the difference that the second trigger fired when the first data word was written into the memory instead of the last.

Because the transceivers on our FPGAs are only capable of \( 40 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) we could only verify the \( 40 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) design over the network and the switch. We verified the \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) capability of the 512-bit bus implementation by directly connecting the streaming interfaces that would usually connect to the MAC controller.

5.3 Performance

While aiming for a low complexity, our implementation should also be capable of a very high throughput while providing very low latency. To have a comparison for our results we performed the same tests between two CPU servers with 40GbE network cards (MCX416A-BCAT)[17] and Mellanox OFED v4.7 drivers which are connected to the same switch. We tested the throughput and latency for RDMA READ and WRITE on both, FPGAs and CPUs.

Latency.

The FPGA-to-FPGA latency was measured with a previously established connection from the initiation of the RDMA transfer by an application/accelerator until the first data word was written to the external FPGA memory. The time required for the pre-RDMA protocol is included. The measured data path involves: RoCEv2 and UDP/IP of the sender, sender MAC, wire from sender to switch, packet processing on the switch, wire to the receiver, receiver MAC, UDP/IP and RoCEv2 receiver until handover to external memory. The FPGA transfers show a latency of \( 2.8 \,\mathrm{\mu }\mathrm{s} \) and \( 3.9 \,\mathrm{\mu }\mathrm{s} \) for RDMA READ and WRITE operations, respectively. Different transaction lengths have no influence on the measured latency.

To measure the latency between two servers, we performed 4-byte RDMA transactions. The measured time includes the whole RDMA operation until its completion since it is not possible to determine the exact time of writing to the server RAM due to the DMA. Because of the small transaction size, the results are still comparable. The data path involves: sender network card, wire from sender to switch, packet processing on the switch, wire to the receiver, receiver network card. The CPU transfers show a latency of \( 126 \,\mathrm{\mu }\mathrm{s} \) and \( 191 \,\mathrm{\mu }\mathrm{s} \).

This illustrates that the FPGA is about 45 times quicker in initiating a transfer, which speeds up smaller transfers in particular. The differences between RDMA READ and WRITE transfers can easily be explained by comparing them in Figure 4. The pre-RDMA protocol of the read transfer only consists of one RC message (including acknowledgment), containing the transfer size and memory region while the write transfer needs two messages as the memory region has to be sent by the receiving node.

We also measured the latency of just our hardware implementation without the 40GbE interfaces and MAC Controller by connecting the streaming interfaces that would usually connect to the MAC controller directly. In this test, our read transfer shows a latency of \( 1.1 \,\mathrm{\mu }\mathrm{s} \) and the write transfer \( 1.5 \,\mathrm{\mu }\mathrm{s} \) which suggests that about 60% of the latency of our “real world” setup is induced by the fabric.

Fig. 6.

Fig. 6. Throughput results for different test lengths with the difference of FPGA and CPU transfers shown dotted.

Throughput.

In Figure 6 the throughput results are visualized over the DMA transaction length. It becomes clear that for large transaction sizes, both the FPGAs and the CPUs reach the theoretical throughput maximum.

The name 40GbE refers to the total bandwidth for all bits to be transmitted, which is reduced by the various headers of the protocol stack to a usable payload data rate. Of the transmitted bits 38 bytes per Ethernet frame are for the preamble, start of frame delimiter, Ethernet header, FCS and interpacket gap. Furthermore, the Ethernet payload is limited to 1500 bytes [2] of which 20 bytes represent the IP header, 8 bytes the UDP header, 12 bytes the BTH and 4 bytes the ICRC. The Infiniband standard further specifies that Maximum Transmission Unit (MTU) must be a power of 2, which means that we send Ethernet frames with 1024 bytes of RDMA payload. (1) \( \begin{equation} \begin{split} T_{theoretical} & =40\mathrm{G}\mathrm{b}/\mathrm{s}\cdot \frac{1024}{1024+38+20+8+12+4} \\ & \approx 37.034\mathrm{G}\mathrm{b}/\mathrm{s} \end{split} \end{equation} \)

Equation (1) describes the maximum theoretical throughput one can achieve with RoCEv2 over 40GbE hardware of which we reached \( ^{37.031}/_{37.034}=99.99\% \). We measured the throughput including our pre-RDMA protocol which adds the time of two or four small packets for RDMA READ and WRITE, respectively. The pre-RDMA protocol is the reason why the RDMA read transfer is a bit faster for both, the CPU and FPGA as only half the packets need to be transmitted.

It can be seen that especially for transfers below \( 100 \,\mathrm{M}\mathrm{B} \) the FPGA throughput is significantly higher. At 50 to \( 150 \,\mathrm{k}\mathrm{B} \) the difference between the FPGA and CPU throughput is about \( 25 \,\mathrm{G}\mathrm{b}/\mathrm{s} \). For small transaction sizes, the latency of the RDMA transaction including the pre-RDMA protocol is dominant, so the communication between FPGAs achieves higher throughput due to the lower latency. This influence decreases as the transaction size increases.

To make sure that our implementation is \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) capable we tested the FPGA implementation in the same way without the QSFP connectors and MAC controller. The curve looks similar to the \( 40 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) curve and reaches \( 98 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) payload throughput which is more than the RoCEv2 protocol over 100GbE could theoretically achieve (\( 92.6 \,\mathrm{G}\mathrm{b}/\mathrm{s} \)).

Table 2.
NameProtocolThroughputFPGAAdaptive Logic Modules (ALMs)M20K
Proposed\( ^{{{1}}} \)RoCEv2\( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \)Stratix10 SX 280057k (6%)216 (2%)
Proposed [24]\( ^{{{2}}} \)RoCEv2\( 40 \,\mathrm{G}\mathrm{b}/\mathrm{s} \)Arria10 GX 115038k (9%)164 (6%)
Proposed NAA\( ^{{{3}}} \)RoCEv2\( 40 \,\mathrm{G}\mathrm{b}/\mathrm{s} \)Arria10 GX 115048k (11%)246 (9%)
  • \( ^{1} \)design with 1 external memory (DDR4), 1 socket, with DHCP. \( ^{2} \)design with 1 external memory (DDR3) and 1 socket, without DHCP. \( ^{3} \)MobileNetV2 NAA with 2 external memories (DDR3) and 4 sockets.

Table 2. Capability-aware Resource Utilization Comparison for Proposed Designs

  • \( ^{1} \)design with 1 external memory (DDR4), 1 socket, with DHCP. \( ^{2} \)design with 1 external memory (DDR3) and 1 socket, without DHCP. \( ^{3} \)MobileNetV2 NAA with 2 external memories (DDR3) and 4 sockets.

5.4 Resource Utilization

Table 2 shows a comparison of the proposed works in terms of protocols employed, maximum throughput and design resource usage. For comparison with related work, we have presented these in Table 1 in Section 2. To provide at least some measure of comparability all values are also given in percentage of the total resources available on the used FPGA. The size of the hardware framework depends on the number of addressable off-chip memories and sockets, as can be seen in the comparison of the proposed variants 1 and 2.

From the scope of supported functions, ERNIC [34] is superior to our implementation as it supports multiple QPs and native READ operations. StRoM [8, 25] offers a mixed picture. The implementation offers some additional capabilities (multiple connections, RDMA READ) but does not support connection establishment through the RoCEv2 protocol. Due to the interoperability problems of proprietary protocols, we do not compare PROP [35], Mansour et al. [16] and Weerasinghe et al. [31] in detail.

It can be seen, that the proposed NAA design compares favorably to state-of-the-art designs, reaching \( 40 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) and theoretically \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) while using only a small percentage of available resources on a mid-range FPGA. Compared to ERNIC, our \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) RoCEv2 stack (excluding MAC and DDR since Xilinx does not provide these numbers) requires 91 M20K versus 142 36k BRAM. Converted to the number of memory bits, our implementation thus saves 64% BRAM, which corresponds to our approach of a low-complexity design. Both designs require 47k registers. Compared with our complete framework for \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \), StRoM requires 3.3 times more memory bits. This also demonstrates the success of our resource-efficient implementation, which delivers high throughput and low latency despite its reduced functionality. Furthermore, to our knowledge, our design is the only implementation that targets Intel FPGAs but can be used vendor-independently, which prevents vendor lock-in.

Skip 6USE CASE ML ACCELERATOR Section

6 USE CASE ML ACCELERATOR

To verify the energy savings of our NAA approach and test the performance of the RoCEv2 core as well as the framework, we built an image data classification demonstrator [26] using the Deep Convolutional Neural Network (DCNN) MobileNetV2 [23]. MobileNetV2 is an image classifier designed for the Imagenet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) dataset [22]. It reads in images of size \( 224\,\times \,224 \) pixels and assigns them one out of 1,000 labels. This is achieved by consecutively applying various convolution operations comprising a total of 300M multiply-accumulates.

The demonstration system as shown in Figure 7 consists of eight NAAs, a 40 Gb/s switch with passive 40GbE-CR4 cables as interconnect, and a server (Intel® Xeon® Silver 4114 CPU @ 2.20GHz with 10 cores, 48 GiB DDR4 RAM, 1TB SATA HDD, redundant 1400 W power supply) with a 40 Gb/s two port network card running the application and the runtime resource management system. All components of the runtime resource managements system run on one server, but they can easily be split to different systems thanks to the gRPC communication used. As pictured in Figure 8, the eight NAAs are mounted in a common enclosure with an energy-efficient, redundant 800 W power supply (with 80PLUS® Platinum certification) and cooling.

RoCEv2 is used to transfer weights and image data to the accelerators and to read back inference results. Compared to PCIe it has the advantage of being able to access all eight accelerators from a single server. Advantages of RoCEv2 include high reliability and good support in our network card, allowing us to utilize the available bandwidth at low CPU loads.

Fig. 7.

Fig. 7. Architecture of the NAA demonstration system.

Fig. 8.

Fig. 8. NAA chassis with 8 NAAs.

Fig. 9.

Fig. 9. Integration of MobileNetV2 accelerator into the NAA.

6.1 NAA Implementation

The setup of a single NAA can be seen in Figure 9. Each NAA implements two MobileNetV2 instances, which in turn are split into two sockets communicating via a shared DRAM. The main MobileNetV2 Accelerator was previously described in [14]. Compared to the original implementation, improved pipelining allows us to run both instances at 250 MHz instead of 200 MHz.

Reducing CPU Load.

Because of the way the accelerator implements the first layer of MobileNetV2, the input data needs to be provided in a special format: The \( 3\times 3 \) convolution is implemented as a \( 1\,\times \,1 \) convolution with input size \( 112\,\times \,112\,\times \,27 \). To save block RAM, the first few layers are split horizontally into four separate blocks. Each block overlaps its neighbor by two rows. Thus, these rows need to be repeated, forming a tensor of size \( (112\,+\,3 \cdot (2\,+\,2))\times 112\,\times \,27 = 124\,\times \,112\,\times \,27 \). Because the accelerator requires the number of input channels to be a multiple of 2, the 27 input channels need to be padded to 28. The accelerator operates on 14 horizontally adjacent pixels in parallel. Thus, the \( 124\times 112\times 28 \) input tensor needs to be reshaped into a \( 124\times 8\times 14\times 14\times 2 \) tensor, then transposed in dimensions two and three. Finally, the input is padded to the memory word size of 64 byte, yielding a \( 124\times 8\times 7\times 64 \) or \( 124\times 3584 \) tensor.

These reshaping operations are relatively expensive when done on a CPU, leading to higher power consumption and lower throughput. Thus, we need to move them into hardware. We decide to do this in a separate Reshape Socket in order to simplify our implementation: The accelerator reads the input image in quick bursts. If Reshape was implemented on the input of the accelerator, the reshape logic would need to be able to process one 64-byte word per clock cycle, necessitating complex buffering and shifter/multiplexer logic. Implemented as separate socket that writes results back to DRAM according to the double-buffer principle, the Reshape component can operate in parallel with the main accelerator. The only constraint for this Socket now is that it needs to process each image as fast as or slightly faster than the main accelerator. This is easily achieved even when operating on just one pixel at a time. The reshape socket also helps drop the network bandwidth requirements from 124 \( \cdot \) 8 \( \cdot \) 7 \( \cdot \) 64 = 444 kByte per frame to 224 \( \cdot \) 224 \( \cdot \) 3 = 151kByte per frame.

Bandwidth Requirements.

Since the MobileNetV2 parameters are too large to fit in the device’s block RAM, they are stored in external memory and read from there as needed. At a maximum throughput for a single MobileNetV2 socket of 658 fps, the required memory bandwidths are: (2) \( \begin{equation} \begin{split}B_{Rd} = B_{RdOrginialImage} + B_{RdReshapedImage} + B_{RdResult} + B_{RdParameters} \\ = (224~Pixel \cdot 224~Pixel \cdot 3~Byte + 124 \cdot 8 \cdot 7 \cdot 64~Byte + 1000 Byte + 7122816~Byte) \\ \cdot 658 fps = 5.08 GB/s \end{split} \end{equation} \) (3) \( \begin{equation} \begin{split}B_{Wr} = B_{WrOrginialImage} + B_{WrReshapedImage} + B_{WrResult} \\ = (224~Pixel \cdot 224~Pixel \cdot 3~Byte + 124 \cdot 8 \cdot 7 \cdot 64~Byte + 1000 Byte) \cdot 658 fps \\ = 392 MB/s \end{split} \end{equation} \) (4) \( \begin{equation} \begin{split}B_{Total} = B_{Write} + B_{Read} = 5.47 GB/s, \end{split} \end{equation} \) where the summands describe the bandwidths required for the following tasks (in the order an image passes through the FPGAs):

  • Write the input image into DRAM via RDMA. (\( B_{WrOrginialImage} \))

  • Read the input image into the reshape socket. (\( B_{RdOrginialImage} \))

  • Write the reshaped image back to DRAM for later processing by the accelerator.

    (\( B_{WrReshapedImage} \))

  • Read the parameters for the accelerator. (\( B_{RdParameters} \))

  • Read the reshaped image into the accelerator. (\( B_{RdReshapedImage} \))

  • Write the classification results (probabilities for each label) to DRAM. (\( B_{WrResult} \))

  • Read the classification results back via RDMA. (\( B_{RdResult} \))

The memory bandwidth calculated in Equation (4) can be easily provided by the hardware framework as described in Section 3.1.

The required network bandwidth without the CCI traffic for accelerator management is calculated in Equation (5) as follows: (5) \( \begin{equation} \begin{split}B_{Total} = B_{ImagesToFPGA}+ B_{ResultFromFPGA} = 224~Pixel \cdot 224~Pixel \cdot 3 Byte \cdot 658 fps \\ +\, 1000 Byte \cdot 658 fps = 99.05 MB/s + 658 kB/s = 99.71 MB/s \end{split} \end{equation} \)

The inference task will be executed in a pipelined manner, i.e., RDMA, reshape and accelerator operate in parallel on different images. Thus, as long as these minimum bandwidth requirements are fulfilled, the actual available bandwidth will have no or negligible effect on the achieved throughput. The problem remains compute-bound.

For a cluster of eight NAAs running two accelerators each, the total network bandwidth required on the server side is about 1.6 GB/s, which is easily fulfilled by the 40G Ethernet links. Because of the remaining limitations in the RDMA software, the FPGAs could not be efficiently operated by a single process. Instead, two separate client processes are spawned, using separate physical ports of the network card.

Table 3.

Table 3. Resource Usage on Bittware 385A FPGA Board with an Intel Arria 10 GX 1150 FPGA Supporting 40 Gb/s RoCEv2

6.2 NAA Resource Usage

Table 3 displays resource utilization when implementing our MobilenetV2 NAA, as depicted in Figure 9, with 40 Gb/s (256-bit bus width) RoCEv2 support on an Intel Arria 10 GX 1150 FPGA, which has 400 000 ALMs and about 2 700 M20K BRAM blocks. We performed the synthesis and place and route process using Quartus 19.4. with the Superior Performance setting. In contrast to our conference paper [24], both external memories can be accessed via RoCEv2. This, as well as the use of four sockets instead of two, increases resource utilization.

With a high bus width, it is important for resource usage what the word width of a BRAM block is, as it is sometimes necessary to store the whole bus in one cycle for backpressure. The M20K blocks in our FPGA have a width of 40 bits which means that we need to combine 13 BRAM blocks when storing 512 bits and 7 BRAM blocks for storing 256 bits in one cycle. It can be seen that our framework itself uses about 29k ALMs and 164 BRAMs but additional controllers are needed for using the external memory and sending MAC packets. Our DDR controller has the option to enable Error-Correcting Code (ECC) capabilities. Although ECC-protected memory requires a lot of additional resources, we consider reliable memory with error detection and correction to be important and have, therefore, used ECC for external memory. The controllers may vary from board to board but in our case, they used 19k ALMs and 82 BRAMs with the memory ECC enabled. Therefore, our framework leaves about 91.9% of the BRAM resources, 100% of the DSPs and about 88.8% of the ALM logic resources available to the accelerators, which is the majority of resources (R2). Other framework settings require even fewer resources, as can be seen in Table 2. The framework with only one memory interface and one socket even leaves 91% of the ALMs and 94% of the BRAMs free for accelerator use.

6.3 NAA Test Setup

To test our setup, we run two parallel inferences on the 50,000 images of the ILSVRC2012 validation set. The client application runs on the server in a container, which was started by our software framework (for a detailed description of the process see Section 3.2). The client reads in the images in batches of 128. Each batch is transferred via RDMA to the DRAM on one of the FPGA boards. Once this transfer has completed, the application prompts the corresponding Reshape socket to start processing the batch. The Reshape socket keeps a counter of how many images have been processed. The application periodically reads this counter and prompts the accelerator when one or more new images are ready. The accelerator also keeps track of the number of processed images. The application monitors this counter as well and reads the results back via RDMA as soon as a new batch is completed. The results are compared to the original label of the image in order to compute the prediction accuracy. The application can operate four FPGAs in parallel, interleaving the above steps in order to ensure full utilization and minimal idle time of the eight accelerators. With two parallel running applications simulating two different users, a total throughput of 10340 fps was achieved: 646 fps per accelerator socket. Compared to the 658 fps possible with a standalone accelerator, this equals a drop of less than two percent. All accelerators still achieve the MobileNetV2 accuracy of 71.74 % [14].

6.4 Energy Consumption

An important criterion for the effectiveness of ML accelerators is, besides throughput and accuracy, their energy consumption. Through optimization of energy consumption, both the CO2 emissions as well as the operating costs can be reduced. Therefore, we continuously measured the energy consumption of our NAA architecture. To measure the server, we use the internal measurement sensors, which are read out via the Baseboard Management Controller (BMC). For the switch and the NAA chassis, we use commercially available energy measurement devices. The power consumption of the individual NAAs is logged using a measurement system developed in-house. This system monitors all power supply rails of the FPGA with high temporal resolution (1 kHz) and stores the measurements in a database. For our energy measurements, we started the image classification from a launched container in a continuous run. The measured values were taken after several runs to simulate the effects of constant operation. We measured the energy consumption of the different NAAs one after the other. We have not performed any optimizations to save energy on the server.

Fig. 10.

Fig. 10. Energy consumption per frame for MobileNetV2 ILSVRC2012 classification for different architectures. GPU only shows the energy consumption of merely the GPU accelerator, where GPU includes the host server in addition to the GPU accelerators. FPGA only includes the pure energy requirements of the FPGA accelerator. For PCIe and NAA, the system components required for use are included in addition to the FPGA. For PCIe these are the host server and for NAA the control server and the switch. The GPU was a Tesla V100, the FPGA board a Bittware 385A.

Figure 10 shows the energy consumption to classify an image using MobileNetV2 with GPU-based accelerators, with FPGA accelerators tightly coupled via PCIe, and with our proposed NAA comparatively. We used the same server type for all comparisons. In the GPU system, 2 Tesla V100 GPUs are installed as classic GPU accelerators. Relative to [26], we could increase the throughput and energy efficiency for a realistic comparison. Now we achieve a throughput of 6042 fps in total on both GPUs with an energy consumption of 76 mJ per frame for the whole system including the host server. Of this, the GPU alone needs 56 mJ. In the case of the PCIe coupled FPGA accelerators, 2 FPGAs are installed in one server. This yields 2619 fps and a total energy consumption of 74.8 mJ per frame including the host server. Compared to the GPU solution, this is only a small improvement, but it should be noted that the Tesla V100 [4] already has a technology advantage over the Arria 10 GX FPGAs [12] produced in a 20 nm process in contrast to the TSMC 12nm FFN manufacturing process. The NAA architecture delivers the best energy efficiency with 58.2 mJ/frame (including the control server and switch) at 10340 fps. Therefore, it is evident that this architecture is the most energy-efficient solution for image classification. It beats the tightly coupled FPGAs as the closest competitor by a factor of 1.29 (1.31 towards GPUs). The efficiency gain mainly results from the increase in the ratio of accelerators to servers, since a server with 100 W measured idle power already causes a high base power. This applies despite the additional 39.4 W measured for the switch. Together, both components generate a static power of 140 W.

Skip 7CONCLUSION Section

7 CONCLUSION

In this article we present a resource-efficient FPGA implementation of the RoCEv2 protocol that enables maximum throughput on 40GbE hardware and is capable of 100GbE when using higher bandwidth Ethernet physical hardware. Likewise, we show the integration of the RoCEv2 stack into a framework and its use as an NAA. By using the presented NAA architecture as an accelerator for MobileNetV2-based image classification, it is shown that this architecture is the most energy-efficient solution. Therefore, an FPGA accelerator does not necessarily have to be attached to a host CPU via PCIe. The scalability of the architecture has proven to be excellent, as demonstrated by the use of eight NAAs. The CPU load on the server is very low and the network bandwidth is sufficient to add more NAAs to the system. Only the limitations of our RDMA software and missing hardware hinder this. Adding more NAAs will further increase system throughput and energy efficiency by sharing the system’s high static power consumption of about 140 W among more accelerators.

We have shown that the RoCEv2 core itself is resource-efficient, easy to integrate into similar frameworks and provides the same flexibility as tightly coupled interfaces. With our self-defined pre-RDMA protocol, it is shown that the necessary protocol parameters for an RDMA transmission can be transmitted “in-band” with Infiniband-defined packets. On the other hand, it must be stated, that the network coupling of accelerators is not suitable for all acceleration problems. This is due to the nature of the DMA concept itself, whose efficiency benefits from the transfer of huge data chunks, as can be seen in our experiments.

We strongly believe that by physically decoupling the accelerator from the host, the resulting architecture will lead to highly scalable, energy-efficient integration of application-specific FPGA accelerators into data centers. The NAA approach can also be used for other applications where it promises energy savings as well.

7.1 Future Work

Our next goal is to extend the MobileNetV2 accelerator to an ML training accelerator and integrate it into a distributed training system using the presented NAA architecture. The distributed training requires high communication bandwidths and low latencies. For this reason, we want to thoroughly test the \( 100 \,\mathrm{G}\mathrm{b}/\mathrm{s} \) capability of the RoCEv2 stack with real hardware. This could not be realized yet due to the lack of 100GbE infrastructure on our side. To raise the overall throughput even more, Ethernet channel bundling will be addressed as well. RoCEv2 ICRC provide a good integrity verification, however, for the use of Network-attached Accelerator, on-the-fly IP layer encryption is needed for assuring security.

During the integration of the MobileNetV2 accelerator, we identified some areas for improvement in the RoCEv2 stack that we want to address:

  • Support for multiple QPs for simultaneous connections. Our initially assumed simplification of only one QP did not prove successful due to repetitive connection setups and teardowns when using multiple sockets.

  • Support of RCM to avoid congestion or even data loss with N-to-1 communication models between different NAAs.

  • RDMA READ support to make our IP Core more readily interoperable with existing middleware and overcome the disadvantage of CPU involvement in READ operations.

Another use case of the NAA framework is presented in [28]. We intend to continue this work and implement an NAA-based transcoding architecture to explore the NAA concept further.

REFERENCES

  1. [1] 2017. IEEE Standard for Ethernet Amendment 10: Media Access Control Parameters, Physical Layers, and Management Parameters for 200 Gb/s and 400 Gb/s Operation.Google ScholarGoogle Scholar
  2. [2] 2018. IEEE Standard for Ethernet. (82018), 5600 pages. Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] 2021. Linux RDMA. (2021). https://github.com/linux-rdma/rdma-core.Google ScholarGoogle Scholar
  4. [4] 2021. NVIDIA Volta Unveiled: GV100 GPU and Tesla V100 accelerator announced. (2021). https://www.anandtech.com/show/11367/nvidia-volta-unveiled-gv100-gpu-and-tesla-v100-accelerator-announced.Google ScholarGoogle Scholar
  5. [5] Abel F., Weerasinghe J., Hagleitner C., Weiss B., and Paredes S.. 2017. An FPGA platform for hyperscalers. In 2017 IEEE 25th Annual Symposium on High-Performance Interconnects (HOTI). 2932.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] ARM. AMBA®AXI ™and ACE ™ Protocol Specification. WWW. (n.d.). https://static.docs.arm.com/ihi0022/e/IHI0022E_amba_axi_and_ace_protocol_spec.pdf.Google ScholarGoogle Scholar
  7. [7] Caulfield A. M., Chung E. S., Putnam A., Angepat H., Fowers J., Haselman M., Heil S., Humphrey M., Kaur P., Kim J., Lo D., Massengill T., Ovtcharov K., Papamichael M., Woods L., Lanka S., Chiou D., and Burger D.. 2016. A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 113.Google ScholarGoogle Scholar
  8. [8] Zürich ETH. Scalable Network Stack supporting TCP/IP, RoCEv2, UDP/IP at 10-100Gbit/s. WWW. (n.d.). https://github.com/fpgasystems/fpga-network-stack.Google ScholarGoogle Scholar
  9. [9] Association InfiniBand™ Trade. 2010. Annex A 16: RoCE. (WWW). https://cw.infinibandta.org/document/dl/7148.Google ScholarGoogle Scholar
  10. [10] Association InfiniBand™ Trade. 2014. Annex A 17: RoCEv2. (WWW). https://cw.infinibandta.org/document/dl/7781.Google ScholarGoogle Scholar
  11. [11] Association InfiniBand™ Trade. 2015. Architecture Specification Volume 1, Release 1.3. (WWW). https://cw.infinibandta.org/document/dl/7859.Google ScholarGoogle Scholar
  12. [12] Intel. 2018. Intel Arria 10 Device Overview. (WWW). https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/arria-10/a10_overview.pdf.Google ScholarGoogle Scholar
  13. [13] Kachris C. and Soudris D.. 2016. A survey on reconfigurable accelerators for cloud computing. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL). 110. Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Knapheide J., Stabernack B., and Kuhnke M.. 2020. A high throughput MobileNetV2 FPGA implementation based on a flexible architecture for depthwise separable convolution. In 2020 30tst International Conference on Field-Programmable Logic and Applications (FPL). 277283. Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Lockwood J. W. and Monga M.. 2016. Implementing ultra-low-latency datacenter services with programmable logic. IEEE Micro 36, 4 (July2016), 1826. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Mansour W., Janvier N., and Fajardo P.. 2019. FPGA implementation of RDMA-based data acquisition system over 100-Gb ethernet. IEEE Transactions on Nuclear Science 66, 7 (July2019), 11381143. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Technologies Mellanox. ConnectX-4 EN Card. WWW. (n.d.). https://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-4_EN_Card.pdf.Google ScholarGoogle Scholar
  18. [18] Technologies Mellanox. Mellanox Technologies SX1012 12 Port QSFP+ 40GbE 1U Ethernet. WWW. (n.d.). https://www.provantage.com/mellanox-technologies-msx1012b-2brs7MLNX1M3.htm.Google ScholarGoogle Scholar
  19. [19] Toronto University of. WWW. (n.d.). https://github.com/UofT-HPRC/GULF-Stream.Google ScholarGoogle Scholar
  20. [20] Ringlein Burkhard, Abel Francois, Ditter Alexander, Weiss Beat, Hagleitner Christoph, and Fey Dietmar. 2019. System architecture for network-attached FPGAs in the cloud using partial reconfiguration. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 293300. Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Ruiz Mario, Sidler David, Sutter Gustavo, Alonso Gustavo, and López-Buedo Sergio. 2019. Limago: An FPGA-based open-source 100 GbE TCP/IP stack. Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, Karpathy Andrej, Khosla Aditya, Bernstein Michael, Berg Alexander C., and Fei-Fei Li. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Sandler Mark, Howard Andrew G., Zhu Menglong, Zhmoginov Andrey, and Chen Liang-Chieh. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs/1801.04381 (2018). arxiv:1801.04381 http://arxiv.org/abs/1801.04381.Google ScholarGoogle Scholar
  24. [24] Schelten Niklas, Steinert Fritjof, Schulte Anton, and Stabernack Benno. 2020. A high-throughput, resource-efficient implementation of the RoCEv2 remote DMA protocol for network-attached hardware accelerators. In 2020 International Conference on Field-Programmable Technology (ICFPT). 241249. Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Sidler David, Wang Zeke, Chiosa Monica, Kulkarni Amit, and Alonso Gustavo. 2020. StRoM: Smart remote memory. In Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 29, 16 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Steinert Fritjof, Knapheide Justin, and Stabernack Benno. 2021. Demonstration of a distributed accelerator framework for energy-efficient ML processing. In 2021 31st International Conference on Field Programmable Logic and Applications (FPL). 386. Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Steinert F., Schelten N., Schulte A., and Stabernack B.. 2020. Hardware and software components towards the integration of network-attached accelerators into data centers. In 2020 23rd Euromicro Conference on Digital System Design (DSD). 149153. Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Steinert Fritjof and Stabernack Benno. 2022. Architecture of a low latency H.264/AVC video codec for robust ML based image classification. Journal of Signal Processing Systems (31 Jan.2022). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Systems HITEK. WWW. (n.d.). https://hiteksys.com/fpga-ip-cores/udp-ip-offload-engine.Google ScholarGoogle Scholar
  30. [30] Tarafdar Naif, Eskandari Nariman, Sharma Varun, Lo Charles, and Chow Paul. 2018. Galapagos: A full stack approach to FPGA integration in the cloud. IEEE Micro 38, 6 (2018), 1824. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Weerasinghe J., Abel F., Hagleitner C., and Herkersdorf A.. 2016. Disaggregated FPGAs: Network performance comparison against bare-metal servers, virtual machines and Linux containers. In 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). 917.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Xilinx. 2015. UltraScale+ FPGAs Product Tables and Product Selection Guide. (WWW). https://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-.Google ScholarGoogle Scholar
  33. [33] Xilinx. 2018. Xilinx embedded target RDMA enabled NIC v1.1. (June2018). https://www.xilinx.com/support/documentation/ip_documentation/etrnic/v1_1/pg294-etrnic.pdf.Google ScholarGoogle Scholar
  34. [34] Xilinx. 2021. Embedded RDMA enabled NIC v3.1. (June2021). https://www.xilinx.com/support/documentation/ip_documentation/ernic/v3_0/pg332-ernic.pdf.Google ScholarGoogle Scholar
  35. [35] Zang D., Cao Z., Liu X., Wang L., Wang Z., and Sun N.. 2015. PROP: Using PCIe-based RDMA to accelerate rack-scale communications in data centers. In 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS). 465472.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A High-Throughput, Resource-Efficient Implementation of the RoCEv2 Remote DMA Protocol and its Application

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Reconfigurable Technology and Systems
            ACM Transactions on Reconfigurable Technology and Systems  Volume 16, Issue 1
            March 2023
            403 pages
            ISSN:1936-7406
            EISSN:1936-7414
            DOI:10.1145/35733111
            • Editor:
            • Deming Chen
            Issue’s Table of Contents

            Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 22 December 2022
            • Accepted: 24 May 2022
            • Revised: 9 May 2022
            • Received: 6 September 2021
            Published in trets Volume 16, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Refereed
          • Article Metrics

            • Downloads (Last 12 months)1,203
            • Downloads (Last 6 weeks)251

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!