skip to main content
research-article
Open Access

VCSN: Virtual Circuit-Switching Network for Flexible and Simple-to-Operate Communication in HPC FPGA Cluster

Published:11 March 2023Publication History

Skip Abstract Section

Abstract

FPGA clusters promise to play a critical role in high-performance computing (HPC) systems in the near future due to their flexibility and high power efficiency. The operation of large-scale general-purpose FPGA clusters on which multiple users run diverse applications requires flexible network topology to be divided and reconfigured. This paper proposes Virtual Circuit-Switching Network (VCSN) that provides an arbitrarily reconfigurable network topology and simple-to-operate network system among FPGA nodes. With virtualization, user logic on FPGAs can communicate with each other as if a circuit-switching network was available. This paper demonstrates that VCSN with 100 Gbps Ethernet achieves highly-efficient point-to-point communication among FPGAs due to its unique and efficient communication protocol. We compare VCSN with a direct connection network (DCN) that connects FPGAs directly. We also show a concrete procedure to realize collective communication on an FPGA cluster with VCSN. We demonstrate that the flexible virtual topology provided by VCSN can accelerate collective communication with simple operations. Furthermore, based on experimental results, we model and estimate communication performance by DCN and VCSN in a large FPGA cluster. The result shows that VCSN has the potential to accelerate gather communication up to about 1.97 times more than DCN.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Various alternative computational systems to von Neumann architecture have been proposed and studied to realize high-performance computing (HPC) in the post-Moore era [17, 21, 40]. A large-scale FPGA cluster is one of the promising approaches for HPC in the near future due to its power efficiency and flexibility [6, 37]. Implementing dedicated circuits optimized for the target application, and clustering FPGAs to perform large-scale parallel processing, it is possible to realize a computer system that has both high power efficiency and computational performance [1, 6]. The state-of-the-art FPGA boards have high-bandwidth onboard memories such as HBM2 and high-bandwidth external communication ports. In addition, the amount of computing resources, such as on-chip logic and memory, has increased enough to apply to high-performance computing even in current commercial FPGA boards.

The network system connecting FPGA nodes is one of the critical factors in determining the overall performance and scalability in FPGA clusters [25], as in general cluster systems. Many studies on multi-FPGA systems aim to accelerate specific processes or applications. These studies often employ a network optimized for the communication patterns required by target applications, which is usually constructed with direct links connecting FPGAs with high-speed serial cables. We refer to this sort of network directly connecting FPGAs with cables as a direct connection network (DCN). An optimal DCN can minimize the communication latency between neighboring nodes to contribute to the acceleration of the target to scale the performance of target applications.

In contrast, we have been developing an FPGA cluster that accounts for its operation and management as an HPC system. This cluster development aims to create a power-efficient and flexible HPC system that takes advantage of FPGAs. In large-scale systems that execute multiple applications simultaneously as existing HPC systems do, network scalability and flexibility are the keys to achieving high performance and operability. However, DCN, which is typically used for a specific application as mentioned above, does not have scalability and flexibility sufficiently. DCN could have huge communication latency due to high network diameters in large-scale FPGA clusters. Also, remote users cannot reconfigure or divide the network remotely because it requires reconnecting cables by hand. In addition, the network topology is limited by the number of external ports in an FPGA node.

In practical HPC systems, one of the solutions to such problems is to employ redundant networks with high communication capacity. As an example, the supercomputer Fugaku uses a proprietary six-dimensional torus Tofu interconnect [2] for the network. The higher-dimensional network topology keeps the diameter of the network small and solves the problem of communication latency. In addition, the high degree of freedom in selecting communication routes due to the redundant design provides a flexible network topology to meet requirements from various applications. However, applying such a specially designed network system to FPGA clusters is not realistic in terms of cost, as it requires the development of dedicated chips and the design of the nodes themselves.

Considering these points, we have adopted the indirect network of commercially-available 100Gbps Ethernet for a prototype FPGA cluster. It is relatively easy to implement the indirect network using an Ethernet IP provided by vendors. However, from the system user’s point of view, the indirect network is much inferior in terms of operability compared to DCN that can be easy to use by simply sending data streams. For example, communication using Ethernet requires users to generate and receive Ethernet frames according to the protocol required by the Ethernet IP. In addition, if a host server controls the indirect network with its software stack, communication efficiency may be low due to huge overhead.

Our approach to the problem is network virtualization by dedicated hardware in the FPGA. It provides a simple-to-operate virtual circuit-switching network (VCSN) where users can easily handle communication like DCN. VCSN can have multiple virtual ports per one physical network port on the FPGA node. VCSN allows a virtual port to be connected to another one in a different FPGA node as requested so that users can easily change a virtual network topology by switching virtual links. Also, these ports share the network I/O bandwidth of the FPGA, and the user can assign bandwidth to each virtual port. These features allow VCSN to flexibly reconfigure the virtual network to meet a wide range of requirements.

Another advantage of this network virtualization is scalability. Since VCSN employs an indirect network, the diameter of the network can be smaller than that of DCN because a commercially available high-radix switch has more ports than those of an FPGA board. For example, the 100Gbps Ethernet switch we use has 16 ports, while no FPGA board has such a number of ports. In addition, if we increase the number of virtual ports, the degree of freedom in selecting routes for transfers is increased. These features of the indirect network and the redundant network topology are expected to achieve efficient data transfer even for operations with high communication costs, such as collective communication, in a large-scale network. Collective communication is a critical operation for data distribution and collection for the overall performance of large-scale parallel processing in HPC systems. Therefore, one of the objectives of this paper is to demonstrate that VCSN realizes efficient collective communication in the FPGA cluster.

Our prototype FPGA cluster employs a hardware platform we have developed, called AFU Shell [33], at each node that allows easy access to the interfaces on the FPGA, such as onboard DDR4 memories, PCI express, and network. AFU Shell is a shell extension accompanied by a network subsystem that can be replaced to support networks other than VCSN. Another objective of this paper is to show that VCSN has sufficient communication performance for actual operation by comparing it with other networks, such as inter-FPGA DCN and a host server’s network.

This paper describes the mechanism and performance evaluation of VCSN, which provides a flexible and simple-to-operate virtual network for large-scale FPGA clusters. We describe the hardware-based system to realize VCSN and its control method. We also show the overhead for control and efficient communication protocols and explain that VCSN provides high communication performance in actual operation. As a demonstration, we evaluate the communication performance on our prototype FPGA cluster. We compare the performance of inter-node data transfer and collective communication between VCSN and DCN and discuss the effectiveness of VCSN. Based on the results of this experiment, we evaluate the scalability of VCSN by formulating a model of communication time to estimate the time of collective communication in larger-scale networks.

The contributions of this paper are summarized below.

Proposal of the virtual circuit-switching network (VCSN) for HPC FPGA clusters.

Design and implementation of VCSN with 100 Gbps Ethernet.

Performance analysis of point-to-point communication with DCN and VCSN.

Performance comparison between DCN and VCSN for gather communication.

This paper is organized as follows. In Section 2, we list related work on FPGA networks and clusters to show the novelty of this research. Section 3 describes the structure of our developed FPGA cluster system. It also shows the hardware design in FPGAs, including the network subsystem and AFU Shell. Section 4 presents performance evaluations of inter-FPGA data transfers to compare VCSN and DCN. Section 5 describes how to achieve collective communication in the FPGA cluster. We also discuss the communication efficiency under high network load between VCSN and DCN. In Section 6, we demonstrate gather communication as an example of collectives in our FPGA cluster and compare the performance between VCSN and DCN. We also show an analysis of the communication performance and an estimation of the performance of larger systems. Finally, Section 7 presents our conclusions and avenues for future work.

This paper is based on the contents our previously-published conference papers [23, 33], while it is sufficiently extended with additional materials and discussion. The details of these previous papers are described in the end of Section 2.

Skip 2RELATED WORK Section

2 RELATED WORK

The communication among nodes in the FPGA cluster is classified into two main approaches. One is to communicate via a network connecting the CPU servers equipped with the FPGA nodes [36], which requires multiple memory copies in CPU servers. The other is to use a dedicated network connecting the FPGA nodes. We will practically compare the communication in these networks in Section 4, and the conclusion is that the latter has better communication performance. Therefore, we list and describe the previous studies on dedicated FPGA networks for FPGA clusters and heterogeneous clouds to show the novelty of this study.

In many research projects and practical systems, the external network ports of commodity FPGA boards are connected by cables to each other to establish the dedicated networks of FPGA clusters. These networks can be roughly divided into two types: direct networks and indirect networks [13, 23, 35]. In the direct network, external ports of FPGA boards are directly connected to each other by cables, and communicate by using serial transceivers and receivers. On the other hand, in the indirect network [8, 32], all FPGA nodes are connected to network switches by cables. Hence, data transfers between FPGA nodes always go through one or more switches. Each of these networks has its own characteristics, and each study has used one that matches its purpose.

The former provides low-latency communication between directly connected neighboring nodes. However, its weaknesses are the inflexibility of its fixed topology and the lack of scalability because of the high communication latency between distant nodes in a large-scale network. From these features, direct networks are suitable for relatively small FPGA clusters that are specialized for specific applications, such as a convolution neural network [38], three-dimensional FFT [29]. Wang et al. [38] implemented a convolutional neural network with multiple FPGAs and optimized the performance by a deep pipeline with a 1D topology. Sheng et al. [31] demonstrated a three-dimensional Fast Fourier Transform (3D FFT) on an FPGA cluster with a direct connection network. Another important work is [28] by Putnam et al. which employs an FPGA cluster for a large-scale data center. This work employed a 6x8 two-dimensional torus network for the FPGA cluster to accelerate communication and service in the data center. For these examples, direct networks provide high communication performance when specific situations or applications limit communication patterns.

An example of a general-purpose FPGA cluster implementation employing the direct network is Cygnus [35]. It is a heterogeneous cluster that consists of GPUs and FPGAs. In Cygnus, there are 64 FPGA nodes, and each node has four external ports, forming an 8 \(\times\) 8 two-dimensional torus topology. This FPGA cluster provides high performance, but the partitioning of the network limits the available topology. For example, when assigning FPGA nodes to multiple applications in Cygnus, the topology available to individual applications is limited to a two-dimensional mesh or one-dimensional. In summary, the weaknesses of direct networks for FPGA clusters are as follows.

The number of available communication ports limits network topologies we can adopt.

Routing functions such as routers that consume FPGA resources are necessary.

Network topology cannot be modified or divided without changing cable connections directly.

Due to these weaknesses, it is not appropriate to directly apply the direct network to our target large-scale FPGA clusters.

On the other hand, an indirect network has a larger minimum communication latency than a direct network. However, all nodes can communicate with each other without involving other nodes, and it has high scalability by using switches with a sufficient number of ports. Because of such flexibility and scalability, indirect networks are suitable for large-scale FPGA clusters. There have been various pieces of research related to the indirect network, such as an FPGA-based implementation of the network switch [26, 27], FPGA-based Ethernet communication [19], UDP/IP stack [4, 20]. They proposed an advanced and high-performance communication system or its components. Compared with these studies, this paper focuses on large-scale communication using the entire network of FPGA clusters, which has been lacking in these previous studies.

For large-scale clusters or clouds, Tarafdar et al. [32] presented a network framework for FPGA clusters in a heterogeneous cloud data center. They proposed a communication platform for lightweight heterogeneous clouds by OpenStack to communicate among various devices, such as CPUs, GPUs, and FPGAs. Their target is a framework for heterogeneous systems, and the performance evaluation was limited to simple communication. Caulfield et al. [8] proposed a configurable cloud architecture based on FPGA. They utilized FPGAs as communication devices in a large-scale data center to improve communication performance. De Matteis et al. [10] proposed an HLS communication interface specification called SMI for multi-FPGA systems, targeting HPC applications. SMI is an MPI-like specification and provides a seamless programming model with OpenCL-capable Intel FPGAs. Although SMI allows users to easily implement various communication functions in multiple FPGA systems, their study did not address the network system or communication mechanism. Compared with these previous studies, our approach differs in the following points.

Virtual networks that facilitate resource management through network partitioning and topology reconfiguration.

Low-overhead topology reconfiguration through hardware-based network virtualization.

Demonstration of actual collective communication assuming high-performance computation.

Our study also targets to operate collective communication in the FPGA cluster to demonstrate the performance of VCSN. There are also some researches on network virtualization and collective communication with FPGA clusters. The idea of virtualization of FPGA networks has been also partially presented in [39]. In this study, they mentioned the virtualization of FPGA networks by software, unlike us, without practical performance evaluation. For collective communication on FPGA clusters, [30] demonstrated collective communication on a direct network. They focused efficient routing method and algorithm for collective communications in FPGA clusters. Therefore, their target is quite different from our study. Gao et al. [12] showed MPI_Barrier implementation on FPGAs. This is a very low-latency MPI_Barrier implementation for large clusters, which is useful for real-world cluster processing. However, their research itself is an implementation of some of the features of collective communication, and does not deal with actual communication. He et al. [15] proposed an FPGA accelerated collective library, ACCL, designed for applications in Xilinx FPGAs. They achieved flexible and fast collectives and demonstrations with up to eight nodes. However, their goal was an FPGA implementation of high-performance collective communications themselves, not targets for the network system or communication mechanism. Our goal is to provide an optimal virtual topology for FPGA clusters through network virtualization.

There has been a lot of research on FPGA-based HPC systems in recent years. Noctua [11] is an FPGA cluster in which nodes are connected by an optical circuit switch (OCS). OCS can change interconnection in the switch dynamically. It also provides very high communication bandwidth. The weaknesses of this system are that it is expensive to install and requires a lot of money for large-scale implementation, and the port connections need to be pre-defined. Mavroidis et al. [22] proposed ECOSCALE as a programming environment and runtime system for HPC based on FPGAs. ECOSCALE directly maps the HPC application hierarchical structures to hardware resources to introduce highly efficient heterogeneous architecture. However, our approach virtualizes the FPGA cluster network itself, allowing for the flexible construction of various network topologies. Since there is little direct competition between their research and ours, we can rather expect more integrated achievements through collaboration in the future.

In our previous studies [23, 33], we have shown virtual circuit-switching network (VCSN) system for a large-scale FPGA cluster, that realizes configurable virtual network topology by exploiting flexibility of FPGAs and indirect networks. This system can reconfigure its virtual network topology in less than one micro second. We also conducted fundamental evaluations such as point-to-point data transfer. However, it did not describe precise operation on FPGA nodes for complex communications such as collective communication. Also, the performance evaluation was limited to the basics of point-to-point communication. This paper sufficiently extends the previous work [23, 33] by adding more detailed communication analysis, practical performance evaluation, and performance estimation of VCSN in large-scale clusters.

Skip 3FPGA CLUSTER Section

3 FPGA CLUSTER

This section presents details of a prototype FPGA cluster that we developed to evaluate the performance of the VCSN implementation. It includes an overview of the cluster, network connectivity, a unique AFU Shell platform for the FPGA cluster, and a dedicated communication protocol for this system. First, we list the requirements of the FPGA cluster system to realize VCSN and efficient communication.

The first requirement is a mechanism to easily reconfigure any virtual network topology. The FPGA cluster that we target in this study requires scalable and easy-to-use network virtualization that users can handle as a circuit-switching network. As described in the previous section, an indirect network suits for this purpose because of its flexibility and scalability. Based on this indirect network, we will build a system that exploits the advantages of a virtual network that can remotely reconfigure the network topology independent from physical network connections.

The second is communication performance. Since our challenge is developing an FPGA cluster to apply to HPC systems, communication performance, especially low latency and high bandwidth, is necessary for the network. Although VCSN has a higher latency than a direct network for communication between neighboring nodes, it can achieve high performance in actual communication by using efficient communication protocols and selecting an optimal virtual topology.

The third is to provide a simple control and efficient mechanism for various types of communication among multiple FPGAs. Our system utilizes Direct Memory Access (DMA) transfer between onboard memories in different nodes for inter-node communication. We will show effectively combine this inter-node communication by simple control to achieve high-speed collective communication in the FPGA cluster in Section 5. For this, we have designed AFU Shell that can fully exploit the network bandwidth. Furthermore, we introduce Message Passing Interface (MPI) in the CPU servers to operate multiple FPGA nodes in parallel.

The following subsections describe each element of the prototype system of the FPGA cluster we have developed for this study.

3.1 Overview

Figure 1 shows the structure of the prototype FPGA cluster we have developed. The cluster consists of 16 FPGA nodes and 8 CPU servers. Every CPU server connects to 2 FPGA nodes through PCI express slots. FPGA nodes have two external ports, both of which are linked to the FPGA-dedicated network by optical cables. Since optical cables have lower attenuation and interference than copper cables, we choose optical ones considering their suitability for large-scale HPC systems [7]. All CPU servers are also connected by the InfiniBand network.

Fig. 1.

Fig. 1. Structure of our developed FPGA cluster prototype.

The cluster employs Intel Programmable Acceleration Card (PAC) D5005s as FPGA nodes. PAC is an acceleration card that contains an FPGA chip, DDR4 memory, and two QSFP28x4 ports. PAC D5005 has Intel Stratix 10 GX FPGA with two regions in the FPGA: FPGA Interface Manager (FIM) and Accelerator Functional Unit (AFU). FIM provides fundamental functions for FPGA acceleration, including the PCIe IP core, the onboard DDR memory interface, and the network interface. AFU is a hardware accelerator implemented in FPGA logic to offload a computational operation. Developers need to design AFU and implement it by using a function of partial reconfiguration (PR).

3.2 Network

Figure 2 shows physical cable connections for VCSN in the prototype system. Since an FPGA node has two external ports, 32 cables are necessary for the network connection. We employ two 100 Gbps Ethernet switches (Mellanox MSN2100). As shown in the figure, we employ a dual-plane network in which each of the switches connects one of the external ports of FPGA nodes. The dual-plane network allows nodes to communicate with all other nodes. In addition, a switch with many ports can build a scalable indirect network because it keeps the number of communication hops in the network low. Therefore, the number of ports in a network switch is one of the critical factors for the performance of large-scale clusters. VCSN provides a virtual topology that does not depend on the physical constraints of the network because multiple virtual ports are implemented in one physical port of an FPGA node.

Fig. 2.

Fig. 2. Network connection for the FPGA cluster with VCSN.

We have also implemented another network, the direct connection network (DCN), to compare with VCSN for communication performance. In DCN, the external ports are connected directly without network switches. Since the FPGA nodes have two ports, DCN provides only a one-dimensional ring topology.

3.3 AFU Shell

Figure 3 shows a block diagram of our developed AFU Shell for VCSN. We have been developing our unique AFU Shell based on Intel Acceleration Stack. It provides interfaces to access the onboard memory, PCI express, and network interfaces. In the FPGA, there are FIM and AFU regions as mentioned.

Fig. 3.

Fig. 3. AFU Shell, the hardware platform for FPGA applications in our developed cluster.

FIM, the blue area in Figure 3, provides fundamental functions to access interfaces, including FPGA Interface Unit (FIU), Embedded Memory InterFace (EMIF), and High Speed Serial Interface (HSSI). FIU functions as a bridge between the PCIe and AFU interfaces. EMIF is the interface to communicate onboard memories. HSSI provides interfaces to the network through QSFP28 external ports.

In AFU, we employ AFU Shell we have developed [33], the green area shown in Figure 3, to provide user-friendly FPGA environment to implement applications. Users can easily access the FIM functions such as DMA and the network by connecting their application cores to the crossbar in the AFU Shell. AFU Shell consists of some bridges to PCIe and onboard memories, the crossbar to route data streams in the FPGA, and modular Scatter Gather DMA (mSGDMA) cores for DMA read/write access to the onboard memories. These modules are connected by Avalon-ST and Avalon-MM interfaces, Avalon-ST is mainly used for data movement in AFU Shell, and Avalon-MM for accessing registers from the CPU server. AFU Shell has two mSGDMA read and write pairs, each of which can independently access the corresponding address space of the onboard memory. This enables efficient collective communication in the FPGA cluster because the FPGA node can achieve two DMA transfers simultaneously.

3.4 Network Subsystem

The area surrounded by the red dashed line in Figure 3, which exists across the AFU and FIM, is the network subsystem for realizing VCSN. Figure 4 shows a detailed design of the subsystem for VCSN. This subsystem includes MUXs and DMUXs, encoders and decoders, control and status registers (CSRs), which we have developed from scratch (shown as orange colored boxes in Figure 3), and 100Gbps Ethernet MAC IPs provided by Intel. Input ports of MUX and output ports of DMUX can be considered as the virtual ports for VCSN. Users can select a virtual port by configuring the crossbar in AFU Shell. This subsystem is responsible for the generation of Ethernet frames for communication, the configuration of the virtual network topology, and the allocation of the available network bandwidth. As shown in the figure, a set of MUX and DMUX, the encoder and decoder, and Ethernet IP is necessary for each external port of the FPGA node.

Fig. 4.

Fig. 4. Network subsystem for VCSN.

MUX and DMUX are the interfaces between AFU Shell and the virtual network, which provides time-division multiplexing and demultiplexing of multiple data streams from/to AFU Shell. The multiplexing basically follows a round-robin fashion. However, it is possible to select a priority level for each input of MUX so that the virtual port with the highest priority is output. Therefore, MUX can output data from one virtual port continuously depending on the configuration when a large amount of data needs to be sent from a single virtual port. MUX encodes the number of input ports into the output data stream as Avalon-ST’s channel signals to inform the input port of data to the encoder. DMUX distributes received data to the output ports to AFU Shell by following Avalon-ST’s channel signals from the decoder.

The encoder generates Ethernet frames from input data streams. We employ Ethernet jumbo frames having 9170 Byte length at the maximum for high payload efficiency. Figure 6 shows communication protocols for VCSN and DCN. The encoder generates the Ethernet header and payload regions as an Ethernet frame. The other regions in the format follow the Ethernet protocol. In the payload region, necessary information is encoded, such as the number of the virtual port and Avalon-ST’s start-of-packet (SOP), end-of-packet (EOP), and channel signals. The encoder and decoder share register called MAC table containing destination MAC addresses of all the virtual ports in the FPGA. The MAC table can be written from the CPU server to configure the virtual network topology. The encoder refers to the MAC table to generate an Ethernet header that contains the destination and source MAC addresses. With these mechanisms, once setting up the MAC table, a virtual network topology is constructed, in which input/output ports of MUX/DMUX work as network ports, respectively.

From the VCSN protocol in Figure 6, we can calculate payload efficiency. Any transferred data is incorporated into the Ethernet frame according to the protocol shown in Figure 6. In the most efficient case, i.e., when more than 9,152 bytes of data are transferred in a batch, the payload efficiency is 9152/9194 = 99.5%. On the other hand, in the least efficient case, the payload efficiency is 64/106 = approximately 60%. In our current implementation, since data are handled in units of 64 bytes, data smaller than this is treated as 64 bytes by zero padding.

The current implementation of VCSN does not have a data re-transmission function nor flow control like TCP. Although we have no such functions in this study because of the laboratory-level evaluation, these functions are necessary for the actual operation to avoid data loss or errors. Instead, although not shown in the figure, each output port of the DMUX has a receive buffer to prevent data loss in current design. Our VCSN implementation has a mechanism to intentionally reduce the output bandwidth of the source virtual port by recognizing the status of the destination node. Since virtualization assigns multiple virtual ports to a single physical port, there is a possibility that several sources send data to the same physical port simultaneously. In such cases, since a large amount of data receives at a single output buffer of a network switch, data loss due to buffer overflow may occur. The decoder has a function to detect when the received bandwidth exceeds a certain threshold to avoid this problem. The decoder sends this information to the encoder then the encoder sends an indication frame to the source node by referring to the MAC table. The node that receives the indication frame reduces output bandwidth or stops the corresponding virtual port. To enable this bandwidth control feature that prevents the data loss, we prohibit multiple virtual ports to connect to a single virtual port concurrently in VCSN. However, with the flexibility of VCSN, it is possible to use one virtual port for communication with multiple ports in a time-division manner by updating the connections described in the MAC table during the communication.

3.5 Direct Connection Network

To compare the communication performance with VCSN, we also implement a direct connection network (DCN), where FPGA nodes are directly connected by cables. The difference in FPGA implementations between VCSN and DCN is only the network subsystem region. Figure 5 shows the subsystem for DCN. In DCN, instead of Ethernet IPs, there are SerialLiteIII (SL3) IPs that are Intel provided serial transceiver modules for inter-FPGA serial links. The DCN subsystem contains a credit-based flow controller (FC) [24], and a burst gap generator, which we have developed. FC manages output data from FPGA to prevent buffer overflows in the destination node. However, this provision of the back pressure function for inter-FPGA communication slightly reduces the payload efficiency, as compared in Figure 6. It is one of the reasons why the throughput is lower than that of VCSN with the large data size in the communication performance comparison, as described in [33]. There is room in the FC protocol for increasing efficiency. However, removing unused bits in the FC header in the current DCN protocol of the AFU Shell does not have a critical effect because modules connected to the Avalon-ST interface in AFU Shell are designed to process every cycle in 512-bit units. Also, for the relatively large (>4 KByte) data transfers covered in this study, the throughput improvement obtained by optimizing the DCN protocol will be only 2.36% at the maximum. For these reasons, we use the original specifications of the FC protocol [24] in this paper. The burst gap generator inserts gaps into the data streams following the specification of SL3.

Fig. 5.

Fig. 5. Network subsystem for DCN.

Fig. 6.

Fig. 6. Communication protocols in VCSN and DCN.

In this study, we compare the performance of DCN and VCSN and investigate in detail. As shown in Figure 6, the communication protocol in DCN is less efficient compared to VCSN. Since SL3 also employs 64B/67B physical layer encoding for transfer data, the payload efficiency in DCN is clearly lower than that in VCSN. Although both DCN and VCSN employ 100 Gbps network, we need to consider the difference of the protocols in performance comparison.

Skip 4INTER-FPGA COMMUNICATION Section

4 INTER-FPGA COMMUNICATION

This section provides the performance of the inter-FPGA node communication and the behavior of AFU Shell and network subsystems. A combination of the fundamental data transfers is required to realize collective communication in the cluster described in Section 6. It also describes how to perform an inter-FPGA data transfer using the system introduced above. As evaluations, we measure the throughput and latency of data transfers between two FPGAs in DCN and VCSN. In addition, we show the resource utilization in an FPGA for the DCN and VCSN systems.

4.1 Communication Procedures

The following describes the procedure for executing the communication between two FPGA nodes with VCSN and DCN. At the start of the procedure, it is assumed that the transmitted data exists in the on-board memory of the source node. For the communication in VCSN, the procedure involves these steps.

(1)

Configuration of the virtual topology by updating MAC tables in FPGAs,

(2)

Setting up the data path between the on-board memory and the external port on the communicating FPGA by configuring the crossbar,

(3)

Launching the SGDMA write in the destination node,

(4)

Launching the SGDMA read in the source node, and

(5)

When the memory write is completed at the destination node, an interrupt request (IRQ) signal is sent to the local CPU server to complete the communication.

In this paper, we assume this series of operations as a single data transfer with VCSN. In VCSN, we can execute the communication between any two nodes following the above procedure. Although a data transfer through a large number of switches is possible for communication in a large-scale network, this section deals with transfer through only one network switch.

For DCN, since network links are fixed, (1) in the above list is not necessary. Other procedures are the same as for VCSN. Therefore, communication in DCN requires fewer operations than communication in VCSN. It is possible to communicate with distant nodes when crossbars directly connect the input and output of two external ports in FPGAs in the communication route. However, since such usage requires to occupy multiple links by a single transfer, the communication efficiency decreases in an environment where many data transfers are executed simultaneously, such as the collective communication addressed in this paper.

4.2 Data Transfers between FPGA Nodes

4.2.1 Comparison between Host Network and FPGA Dedicated Network.

We measure communication time with three types of data transfers shown in Figure 7. In the figure, (A) is VCSN and (B) is DCN data transfers. (C) is a transfer passing through CPU servers and their network. (A) and (B) are communication using AFU Shell and networks described above. (C) does not use the FPGA-dedicated network, but uses the server-connected InfiniBand network. The other two operate locally two connected FPGA nodes directly from a single server, while (C) uses Message Passing Interface (MPI) functions because it needs to control two servers simultaneously. Therefore, it is necessary to consider an additional overhead for controls included in the total communication time in (C).

Fig. 7.

Fig. 7. Three types of the inter-FPGA node data transfers.

The evaluation measures the single data transfer time shown in the above procedure with various data sizes. Please note that the onboard memory bandwidth is 19.2 GB/s which is larger than the network bandwidth, 12.5 GB/s, so that does not affect the evaluation. Figure 8 shows the results of data transfer time in each network. For VCSN and DCN, we measure the data transfer time between the onboard memories of two FPGA boards connected to the same CPU server. The time measured was the time from driving the DMA read at the source to the completion of the DMA write at the destination. This measured time includes overheads such as FPGA register accesses to drive the DMA and CPU interrupts to notice the completion of the transfer. On the other hand, in the case of the host network, transfers between the FPGA and the host server were performed by DMA on the FPGA via PCIe, and transfers between hosts were done by MPI send/recv functions. The communication between FPGAs by the host network may include overheads such as FPGA register accesses and MPI overhead [34]. The Intel Stratix 10 PCIe Gen3 x16 utilized in this experiment provides a theoretical maximum bandwidth of 15.75 GB/s, however, our host server-FPGA memory transfer trials showed a maximum effective bandwidth of approximately 10.8 GB/s. Figure 8 also shows the theoretical data transfer time for each network, calculated using a simplified model when transfer is assumed using the theoretical maximum bandwidth, as shown by the dashed lines. These models intentionally ignore latency and memory copy time during data transfer to show clearly the impact from the overhead.

Fig. 8.

Fig. 8. Measured and theoretical data transfer time between two FPGA nodes.

It clearly shows that DCN is the fastest and the host network is the slowest one. Since the transfer in the host network requires multiple times of memory copies, the host network communication is the slowest in the estimated time as well. The measured values of DCN and VCSN gradually approach the theoretical values with increasing data size. It indicates that the communication overhead in DCN and VCSN is constant, and the impact of the overhead becomes smaller as the data transfer time increases. On the other hand, the transfer time by the host network does not asymptotically approach the theoretical value even when the data size is large. The cause of this is that the MPI overhead is affected by the data size. The data transfer time of the host network was about ten times longer than that of the dedicated FPGA network. Although it is considered to be possible to speed up communication in the host network, it is almost impossible to achieve the same communication performance as the FPGA dedicated network soon. We can learn from this result that inter-FPGA communication should be done over a dedicated network, not through the CPU.

Another point to note is that the communication time of VCSN appears to be approaching that of DCN as the data size increases. For small data transfers, communication in VCSN is slower than in DCN due to the higher latency of the transfer and the overhead of setting the topology in the communication procedure described above. Since the setting time and the latency are fixed even when the data size increases, the ratio of these times to the total transfer time becomes smaller for large data sizes. In addition, the protocols shown in Section 3 affect this. Since DCN has a less efficient protocol than VCSN, the theoretical communication bandwidth is higher in VCSN. Therefore, for large size data transfer, VCSN has an advantage due to its bandwidth compared to DCN.

4.2.2 Comparison between DCN and VCSN.

Based on the results, we further investigate the inter-FPGA node data transfers in DCN and VCSN. Figure 9 shows that the throughput for DCN and VCSN calculated from the transfer data size and the transfer time. According to the graph, when the data size is larger than approximately 10 MB, VCSN achieves better throughput than DCN. We verify this result against the theoretical maximum bandwidth based on the protocol.

Fig. 9.

Fig. 9. Throughput of communication between two FPGA nodes.

First, the maximum bandwidth of the DCN can be calculated from the protocol in Figure 6 and the 64B/67B coding by SL3 as follows, (1) \(\begin{equation} 25.78125 \times 4 \times \frac{3968}{3968+64+32} \times \frac{64}{67} = 96.18 [Gbps] = 12.02 [GB/s]. \end{equation}\) On the other hand, the maximum bandwidth of VCSN is (2) \(\begin{equation} 25.78125 \times 4 \times \frac{9152}{9152+8+14+4+4+12} \times \frac{64}{66} = 99.54 [Gbps] = 12.44 [GB/s]. \end{equation}\) Intel 100Gbps Ethernet IP employs 64B/66B encoding as defined IEEE Standard 802.3bj [5].

Observed maximum throughputs in the evaluation are 11.72 GB/s for DCN and 12.42 GB/s for VCSN. The reason why VCSN is closer to the theoretical value is assumed to be that there is no overhead due to backpressure control by the flow controller. The dashed line in Figure 9 shows the estimated throughput for VCSN with the flow controller. Even in this estimation, VCSN is superior to DCN in throughput. This is due to the high payload efficiency of the VCSN protocol with the Ethernet jumbo frames and the difference in coding between SL3 and Ethernet IP.

Next, we measure the latency of inter-FPGA communication in VCSN and DCN. To measure exact latency, we use loopback connections for both links and count the number of clock cycles between when data are output and when they are input again through the loopback connection We calculate their latency by the cycle counts and their operation frequencies.

Table 1 shows measured latencies. We counted the number of clock cycles at the AFU shell crossbar from when data is output to the network to when it is input from the network. According to the result, the latency of VCSN is approximately 1.7 times longer than that of DCN. This is because the data transfer in the VCSN takes a long path through the switches. Please note that all the cables used in this measurement have the same three meters length. From these results, DCN has the advantage in terms of latency, while VCSN has higher maximum bandwidth than DCN.

Table 1.
NetworkPathLatency [ns]
VCSNcrossbar to crossbar851.1
DCNcrossbar to crossbar490.9

Table 1. Link Latency

4.3 Resource Utilization

The original purpose of the FPGA cluster is to implement the application and let it execute the processing. Therefore, we should reserve as many computational resources as possible on the FPGA for applications. For this reason, we investigate the resource usage on an FPGA for DCN and VCSN. Tables 2 and 3 show the resource utilization of both network subsystems. The numbers in those tables are per one FPGA which has two physical external ports or four virtual ports in this implementation. The amount of resources required for the network subsystem scales in proportion to the number of external ports we used. An ALM is a logic module in Intel’s FPGAs that consists of an 8-input LUT, two adders, and four registers. They also show the total available resources of one Intel Stratix 10 SX280 FPGA. It is clear from the results that the VCSN occupies considerably more resources than DCN. Therefore, DCNs may have higher performance because the application can use more computational resources.

Table 2.
ALMregM20kDSP
AFU Shell41504.5830714790
Flow controller572.6745400
DCFIFO113.0212520
Width converter1285.4362600
Gap generator134.251800
SL3 IP10568.910356180
Total52073.4934274970
percentage1.89%0.85%4.24%0%
Stratix 10 SX2802753000110120001172111520

Table 2. Resource Usage of the DCN Subsystem

Table 3.
ALMregM20kDSP
AFU Shell100172.317740538050
MUX8144.61824100
DEMUX1051.33313280
Encoder19205.221859260
Decoder14071.021718580
100G MAC IP15664.81798400
Total149113.323896638890
percentage5.42%2.17%33.18%0%

Table 3. Resource Usage of the VCSN Subsystem

Our previous paper [33] reported DCN and VCSN performance with an actual application. For the 2D LBM application used in the above study, the maximum computational performance by a single FPGA is higher with the DCN system than that of VCSN. It is not only simply due to more residual resources available for DCN, but also to the lower maximum operating frequency for VCSN due to congestion in the routing on the FPGA. Therefore, DCN has an advantage in the maximum performance per one FPGA. On the other hand, VCSN showed slightly higher performance than DCN when processed by multiple FPGAs at the same operating frequency in the previous study [33].

The large consumption of M20k memory in the resources for VCSN is due to large buffers used for receiving since there is no flow control function for inter-node communication. In this implementation, we employ on-chip M20K block memories for the data receiving buffers. We can use onboard DDR memory instead of on-chip memory if necessary. If we implement a flow controller similar to implemented in DCN for each virtual port, the consumed M20k memory blocks can be significantly reduced. In addition, although encoders and decoders account for a large percentage of the VCSN implementation, they are implemented outside the rewritable AFU area, so they have almost no impact on performance, such as operating frequency, other than resources. Furthermore, both DCN and VCSN do not use DSP at all, which is essential for floating-point operations. In conclusion, DCN requires very few FPGA resources to implement, and VCSN consumes few computational resources, except for M20K.

4.4 Power Estimation

We also employ Intel Quartus Prime’s PowerPlay Power Analyzer to estimate power consumption from compiled DCN and VCSN FPGA designs. Assuming a toggle rate of 12.5%, we estimated power consumption per single FPGA board for the DCN and VCSN systems. The Quartus version is 19.2. Table 4 shows the estimated power consumption results. Overall, the VCSN consumes about 9% more power than the DCN. From Table 4, it is clear that the VCSN’s larger power consumption is due to the larger circuit size, as shown in Tables 2 and 3. The transceiver power consumption is also slightly higher for VCSN. In summary, the VCSN system consumes more power than the DCN, due to the larger circuit area of the VCSN implementation. We also discuss power consumption in the gather communication example based on this estimation in Chapter 6.

Table 4.
DCN (mW)VCSN (mW)
Total Power Dissipation54666.8159632.8
Transceiver Standby Power Dissipation8609.158819.87
Transceiver Dynamic Power Dissipation11262.5911791.31
I/O Standby Power Dissipation3573.023573.02
I/O Dynamic Power Dissipation18229.2818229.10
Core Dynamic Power Dissipation6642.4810808.22
Device Static Power Dissipation6350.286411.29

Table 4. Estimated Power Consumption for DCN and VCSN Systems

Skip 5COLLECTIVE COMMUNICATION IN THE FPGA CLUSTER Section

5 COLLECTIVE COMMUNICATION IN THE FPGA CLUSTER

This section shows a procedure for executing collective communication in the prototype FPGA cluster described in previous sections. Collective communication is one of the critical operations in HPC applications, and it is suitable as a benchmark for practical evaluation of network performance. We conduct an experimental evaluation of gather communication to demonstrate efficient communication in VCSN. In addition, we show an estimation of the gather communication time for DCN and VCSN and show that a redundant virtual network topology with VCSN can improve the efficiency of collective communication.

Note that our implementation does not have a specific module to speed up or optimize collective communication. In our experiments, we realize collective communication by combining the virtual link reconfiguration function of VCSN and point-to-point data transfer between FPGAs, without using any libraries for collective communication. Our objective is to show whether VCSN or traditional DCN is better suited for complex and heavily loaded communications such as collective communications in our experiments for large-scale FPGA clusters. We adopt gather communication as an example because it is commonly used for data validation, visualization, and so on, especially in HPC applications. In addition, since gather communication is a simple collective where the direction of data communication is toward a single node, we can explain the improvement in communication efficiency by VCSN in a comprehensible way.

5.1 Execution and Control

At first, we show how to execute and control the gather communication. In this paper, the collective communication in the FPGA cluster assumes to repeat the data transfer between the onboard memories of the FPGA nodes described in the previous sections. This is mainly due to a reason as follows. The size of the FPGA on-chip memory is smaller than that of the onboard memory. Therefore, fine-grained communication control is required to prevent data loss when we exploit the on-chip memory as a buffer. Such a complex and detailed control often cause inefficient operations because of their overhead. For this reason, we use data transfers in bulk between onboard memories for efficient collective communication.

5.1.1 DCN.

In the case of DCN, we need to appropriately set the order and combination of individual inter-node data transfers to sufficiently exploit network bandwidth. In this implementation, it is necessary to fully exploit the receiving bandwidth of the destination node to achieve efficient gather communication. Since two external ports are available per FPGA node, an ideal situation is that the two external ports always receive data in the destination node. The optimal procedure for gather communication in a one-dimensional ring network, which is feasible in this implementation, is described below.

Each inter-node communication follows the procedure explained in Section 3. Since DCN has no congestion control, multiple data transfers cannot share a single communication link at the same time. Figure 10(A) shows an example of gather communication in DCN with nine FPGA nodes. The black bidirectional arrows in the figure shows the physical links.

Fig. 10.

Fig. 10. Data transfers for the gather communication in DCN and VCSN.

When we assume FPGA 0 as the destination node in the gather communication, FPGAs 1-8 need to transfer data to FPGA 0. We divide the eight source nodes into two groups, each of which transfers data in the opposite direction as shown by the red arrows in the figure. By dividing the nodes into two groups, and each group forwards data in opposite directions, we can shorten the total distance of data transfers in the one-dimensional ring network the most. All inter-node data transfers are realized by communication between onboard memories. Since the on-chip memory in the FPGA node is not that large (244Mbits) and a part of it might be used for other purposes, we use the onboard memory instead.

In such implementations, we schedule data transfers in a pipelined fashion to make the best use of network bandwidth. First, all source nodes transfer data to adjacent nodes in the direction shown by the red arrows in parallel. At this time, we transfer all the data on a single node in a batch to streamline the DMA control. FPGA 0 stores the transferred data from two directions into memory by driving two SGDMAs. The other nodes transmit data to the adjacent nodes and receive data from nodes in opposite directions.

After these parallel transfers are complete, the same transfers start on nodes other than FPGA 4 and 5 that have no data at this point. By these transfers repeatedly, the onboard memory of FPGA 0 stores the data from all nodes at last. For the cluster with nine FPGA nodes in this example, the gather communication is complete by repeating the above parallel transfer four times total. We use the MPI_Barrier, a function of MPI, for the complete process of each parallel transfer. The IRQ described in Section 3 indicates the completion of an individual data transfer from SGDMA modules to the local CPU server. Using the MPI_Barrier function, once all the CPU servers that connect to the FPGA nodes involved in communication have confirmed the IRQ, it moves on to the next transfer phase.

As the number of dimensions of the network topology changes using different FPGA boards, the number of SGDMAs to be implemented must also change to perform communication in the same fashion. For example, one FPGA node can send or receive four data transfers simultaneously for a two-dimensional torus network. In this case, we need to perform four DMA transfers simultaneously, which requires four SGDMA read/write pairs in AFU Shell. The communication time might shorten due to the larger available bandwidth for networks with higher dimension topologies. However, the control for efficient communication becomes more complicated.

Figure 11 shows the breakdown of the gather communication time in DCN. In this case, we assume that all FPGA nodes have the same size of data before the communication. The horizontal direction represents the elapsed time. Each row in the figure corresponds to each FPGA node and shows the process in that node. At the beginning of the process, we configure the crossbar in all nodes to establish the data transfer path shown as blue boxes. Next, we start transferring data to the destination indicated in the pentagonal arrows in the figure. Then, the synchronization process, shown as pink boxes in the figure, runs on all nodes to ensure that all the transfers finish before moving to the next stage. We define a data transfer stage as the part from the crossbar setting to the complete process. For example, in the first transfer stage, the initial data from FPGA 4 are sent to FPGA 3. Then, the data are repeatedly transferred until that data arrives at FPGA 0 in the following three stages. Thus, the gather communication of nine FPGAs with DCN consists of four transfer stages, each of which all transfers have the same size of data in parallel.

Fig. 11.

Fig. 11. Breakdown of the gather communication time in DCN.

5.1.2 VCSN.

In the case of VCSN, appropriate order and combination of performing individual data transfers can achieve efficient gather communication, too. In addition, a redundant virtual topology that takes into account communication efficiency can provide even more efficient gather communication. Since changing the virtual topology requires time to access the MAC table of each FPGA node, our strategy is not to change the topology after setting the appropriate topology at the beginning. In this implementation, we prepare two virtual ports for each external port of the FPGA node, which means four virtual ports per one FPGA node.

The congestion control function in VCSN can handle multiple data transfers to an external port of an FPGA node concurrently. However, we do not use multiple data transfers that cause congestion in this communication, as in the DCN. When two virtual ports of one external port receive data at the same time, control and buffering in the AFU Shell are required. Since the prototype implementation of VCSN does not have the flow controller, buffering with on-chip memory is necessary to avoid lost data. However, as mentioned earlier, the available on-chip memory is limited. It is also possible to write inputs from two virtual ports of a single external port directly to the onboard memory. But in this case, only one of the external ports of an FPGA node occupies both SGDMAs, making communication inefficient.

Figure 10(B) shows an example of the virtual network topology for the gather communication with nine FPGA nodes. The bidirectional arrows with dashed lines in the figure shows the virtual links. At first, we configure the virtual topology shown in the figure by updating the MAC tables of all nodes. Then we start the data transfers following the procedure explained in Section 3.

As mentioned before, we use only two of the four virtual ports in FPGA 0 simultaneously to receive data. To sufficiently utilize the bandwidth of the physical external ports, one of the virtual ports assigned to each external port is selected and used simultaneously. Therefore, FPGA 0 receives data from FPGA 1 and 5 (red arrows in the figure) at first. Data transfers from FPGA 1, 2, 5, and 6 follow the same pipelined fashion described in the case of DCN. On the other hand, other FPGA nodes waste their available bandwidth during the above data transfers. Although virtual links for FPGA 0-3 and 0-7 are not available at that time, links for FPGA 3-4 and 7-8 are available. Therefore, we perform data transfers from FPGA 4 to 3 and 8 to 7 during the data transfers of the red arrow. After the transfers shown by red arrows are complete, the data collected in FPGA 3 and 7 are finally transferred together to FPGA 0. As a result, VCSN can reduce the number of data transfer stages compared to DCN by utilizing unused virtual links.

Figure 12 shows the breakdown of the total communication time for gather in VCSN. Comparing with DCN, the number of data transfer stages and the total number of data transfers have been reduced. Also, in the last transfer stage, the initial data of two FPGA nodes (FPGA 5 and 6, or 7 and 8) are transferred together. Thus, we can reduce the communication overhead, such as the crossbar setup and completion process, by reducing the number of transfer stages. It also reduces the effect of the latency shown in gray in the figure. This approach reduces the number of stages in VCSN by \(\lceil \frac{n}{4} \rceil - 1\) compared to DCN, \(n\) is a number of nodes.

Fig. 12.

Fig. 12. Breakdown of the gather communication time in VCSN.

5.2 Comparison of the Gather Communications in DCN and VCSN

In the above example, the difference between DCN and VCSN is slight, but in a large-scale network, the communication efficiency provided by VCSN shows its advantage. Describing the number of nodes with \(n\), the number of the transfer stages of gather communication in DCN is represented as \(\lceil \frac{n-1}{2} \rceil .\) Compared to this, the number of the stages in VCSN for the same communication is represented as \(\lceil \frac{n-1}{4} \rceil + 1 .\)

Since the amount of transferred data is equal, the sum of the transfer times shown as the pentagon arrows without latencies in Figures 11, 12 are equal for DCN and VCSN. However, VCSN reduces the number of the transfer stages that include the setting crossbar and complete process by \(\lceil \frac{N-1}{4} - 1 \rceil\) compared to the DCN. From these equations, the larger the network is, the more advantageous VCSN becomes. In addition, the larger the number of nodes, the larger the cost of one complete process by MPI_Barrier, so VCSN that requires fewer complete processes than DCN is also advantageous in this point.

The above equation is satisfied when the number of external ports per node is two. If we consider a situation where the configuration is exactly the same as in the previous example except for the network subsystem and the number of external ports is four, the number of the transfer stages for DCN and VCSN are \(\lceil \frac{N-1}{4} \rceil + 1\), \(\lceil \frac{N-1}{8} \rceil + 3\), respectively. Please note that we assume that AFU Shell has proper number of mSGDMA for each data transfer in above case. Also, if the number of virtual ports per an external port is set to four in the same configuration as the first example, the number of transfer stages for gather communication by VCSN is \(\lceil \frac{N-1}{8} \rceil + 3\) when the number of external ports per one node is two. Thus, the higher-dimensional redundant virtual topology provided by VCSN can accelerate the collective communication in the cluster.

5.3 Virtual Topology

We also need to consider an appropriate virtual topology for the gather communication. The proposed topology above intends to sufficiently utilize the input bandwidth of the destination FPGA node and minimize the overhead of the configuration of the virtual topology. For this purpose, the emphasis is on scheduling data transfers so that the destination node always receives the data and transferring as much data as possible in batches.

Since there are other approaches, we will select the best one from several candidates. First, we consider a one-dimensional (1-D) ring topology the same as in DCN. It can fully exploit the input bandwidth of the destination node. Second, we can consider using a tree topology, in which the optimal procedure is the same as the first example given in the last subsection. Finally, we consider a method to change the topology during communication. This method can minimize the number of data transfers by reconfiguring virtual topology in each stage to link source and destination nodes directly. For the above three cases, we estimate the total communication time of the communication of an N-node cluster.

The following equations represent the total communication time with \(n\) FPGA nodes in cases of the three examples listed above, a ring topology (3), a tree topology (4), and virtual link reconfiguring (5), respectively. (3) \(\begin{equation} T^{ring}_{total} = t + \left\lceil \frac{n-1}{2} \right\rceil \left(c + l + \frac{d}{b} + p \right) \end{equation}\) (4) \(\begin{equation} T^{tree}_{total} = t + \left\lceil \frac{n-1}{2} \right\rceil \frac{d}{b} + \left\lceil \frac{n-1}{4} \right\rceil (c + l + p) \end{equation}\) (5) \(\begin{equation} T^{vlink}_{total} = \left\lceil \frac{n-1}{4} \right\rceil t + \left\lceil \frac{n-1}{2} \right\rceil \left(c + l + \frac{d}{b} + p \right) \end{equation}\)

In these equations, we assume that overheads for the communication take a certain amount of time; overhead for the topology reconfiguration \(t\), for the crossbar setting \(c\), and for the complete process \(p\). We also suppose that the latency \(l\) is a fixed value for paths with the same hop number. In addition, \(d\) and \(b\) are represented the data size in each node and the bandwidth of the virtual links, respectively.

Since the tree topology achieves the gather communication in the shortest time in those candidates, we employ the tree topology for the evaluation with VCSN. In this tree topology, the data are finally gathered at the root node, FPGA0 of Figure 10. The root node is connected to four nodes by virtual links, and all but the root node connect to at most two lower nodes. In this paper, we build the virtual topology to minimize the number of layers in the tree structure. All nodes with data transfer data to the node at the above level at every step. Since the current system on FPGA can receive two data transfers simultaneously, data move to one upper layer in each step. However, the root node cannot receive four inputs simultaneously, so it receives all the forwarded data from the two priority directions before receiving the data from the remaining two directions. In this case, while receiving data from the priority direction, all data in the remaining two directions are gathered at the adjacent node of the root. Then the root node receives this remained data together at the end. We use this algorithm, also shown in Figure 12, to achieve efficient gather communication using VCSN.

Skip 6PERFORMANCE EVALUATION AND ESTIMATION Section

6 PERFORMANCE EVALUATION AND ESTIMATION

This section provides an evaluation of the gather communication in the FPGA cluster prototype following the description in Section 5. We demonstrate the acceleration of the collective communication on the FPGA cluster by VCSN on the system described so far. We also estimate the performance of collective communication on a larger system using the detailed data obtained from the experiments.

6.1 Systems and Tools for the Experiments

Each CPU server in the FPGA cluster has an Intel Xeon Gold 5122 CPU consisting of four cores. All eight servers have the same structure. Our FPGA cluster system employs AFU Shell Class (ASC) in CPU servers to control FPGAs connected locally. ASC is an API library we developed based on Intel Open Programmable Acceleration Engine (OPAE) API to provide functions to abstract control of Intel PAC boards.

However, ASC only handles the local two FPGA nodes connected to a single server. To control all FPGA nodes concurrently from one root CPU server, we install MPICH 3.0.4 in every server to introduce MPI functions. We do not use MPI’s functions of collective communications, such as MPI_Send or MPI_Recv, but use only the parallel processing function and MPI_Barrier to remotely control multiple FPGA nodes. We also use MPI_Wtime to obtain processing times among the servers for the evaluation. We use the InfiniBand network of CPU servers for only to control the entire cluster through MPI functions.

The physical cable connection of VCSN is dual-plane as described in Section 3. All nodes connect two network switches that have 16 ports. Therefore, this experiment does not subject to communication through multiple switches.

6.2 Experiments and Results

We evaluate the gather communication with 4, 9, and 16 FPGA nodes in the cluster. By showing the communication performance in different scale networks, we demonstrate the effectiveness of VCSN in large-scale clusters. The experiment follows the processes described in previous sections. We use MPI functions, especially MPI_Barrier and MPI_Wtime, to control the entire cluster and measure communication time for the evaluation. To reduce the communication overhead, each node collectively performs the data transfer and sets up the crossbar for the next transfer in Figure 12, followed by synchronization using MPI_Barrier. It reduces the overhead caused by MPI_Barrier compared to the case where synchronization is performed after data transfer and crossbar setting, respectively.

Figures 1315 show the gather communication times and throughput in the FPGA cluster with VCSN and DCN containing 4, 9, and 16 nodes in the experiment, respectively. Total communication time in the graphs includes all processes and overheads as shown in Figures 11 and 12. We calculate the throughput from the total amount of data sent to the destination and the observed total communication time.

Fig. 13.

Fig. 13. Total communication time and throughput for the gather communication with 4 nodes in DCN and VCSN.

Fig. 14.

Fig. 14. Total communication time and throughput for the gather communication with 9 nodes in DCN and VCSN.

Fig. 15.

Fig. 15. Total communication time and throughput for the gather communication with 16 nodes in DCN and VCSN.

The graphs for 4 and 9 FPGAs also include measured gather communication performance with 4 and 8 CPU servers as a comparison. The gather by CPU servers utilizes the InfiniBand 100 Gbps network as described above. We use MPI_Gather of MPICH for execution and measure the communication time using MPI_Wtime same as the experiment on the FPGA dedicated network. Please note that the gather communication by CPU server does not involve FPGAs, and the data is provided for comparison purposes only.

The cluster with VCSN has a shorter communication time than the one with DCN in all three graphs, except for the cases with four nodes where the communication data size is smaller. In the cases of small data sizes, short latencies in DCN communication have advantages because the impact of communication latency is large. However, in the results of 9 and 16 nodes, VCSN is advantageous even though the communication latency is short. We believe that this is the effect of VCSN’s reduction in the number of data transfer stages described in Section 5, which reduces the communication overhead.

It is worth noting that VCSN is more advantageous as the size of the cluster becomes larger. This is consistent with the discussion of gather communication in DCN and VCSN shown in Section 5.2.

Comparing the throughput of each graph, both VCSN and DCN are the highest in the case with 9 nodes. This is because there are biases in the amount of data received by the two external ports in the destination node when the number of nodes in the network is 4 or 16 due to the odd number of source nodes, 3 and 15, respectively. In all cases, the bigger the amount of data, the more time is required for data transfers themselves, resulting in a smaller percentage of overhead and higher throughput. As shown in Equations (1) and (2), the theoretical maximum bandwidths of an FPGA node are \(12.02 \times 2 = 24.04\) and \(12.44 \times 2 = 24.88 GB/s\) for DCN and VCSN, respectively. From Figures 1315, we can observe that VCSN utilizes about 98% of the theoretical bandwidth at maximum, while about 72% in DCN.

In the DCN result, the achieved bandwidths are much lower than the theoretical bandwidth. Although the main target of this paper is VCSN, we briefly discuss why the achieved bandwidths in DCN are so low. First, DCN employs a protocol with lower payload efficiency than VCSN shown in Section 4. In particular, the encoding for control of the flow controller has a certain impact on the difference in communication bandwidth between DCN and VCSN. In addition, the flow controller employs store-and-forward buffering to generate transfer packets, which can also reduce the efficiency of the communication further [23]. The second is the effect of backpressure control by the flow controller. For point-to-point communication shown in Section 4, the backpressure controls are rarely necessary. In our experiments, we schedule data transfers so that congestion does not occur in each data path, but if the interval between the transfers is short, the backpressure control may reduce the communication bandwidth. The third is the effect of the complete processes, especially MPI_Barrier. In the experiment, the total numbers of FPGA nodes actually involved in the communication are different between DCN and VCSN. For example, as shown in Figures 11 and 12, with nine nodes, the number of individual data transfers in a DCN is twice that of a VCSN. Therefore, the total number of nodes involved in the completion process in the DCN will be twice the VCSN. Although the MPI_Barrier actually synchronizes among CPU servers, the overhead of the completion process in the DCN is definitely higher than that in the VCSN, regardless of the combination of FPGA node and CPU server.

Next, we compare MPI_Gather operations by the server network to gather operations by the FPGA-dedicated networks. The server network is superior when the data size is roughly 100 KB or less, while the FPGA-dedicated network is faster for larger data. From these results, we can say that the server network’s MPI collective has an advantage in terms of latency, while the FPGA-dedicated networks have the advantage of high effective throughput. We also compare the performance with previous studies on accelerating FPGA-based collective communication. Haghi et al. [14] proposed a hardware accelerator called MPI-FPGA for high-performance collective communication. Their paper showed that gather communication of fewer than 1KB by 64 FPGAs can be achieved in order of a few microseconds. Compared to our gather communication evaluation which has several hundred-microsecond latencies, their collective communication has a much smaller latency. Although the principal purpose of our paper is not to accelerate collective communication, we expect that our VCSN can also provide high-performance collective communication with their low-latency MPI-FPGA implementation.

The experiments demonstrate that VCSN provides highly efficient collective communication with high bandwidth utilization. Now we discuss the performance estimation of larger-scale FPGA clusters based on these results.

6.3 Gather Communication Model for Large-Scale FPGA Clusters with VCSN

In this section, based on the actual communication evaluation results with VCSN and DCN, we model gather communication for large-scale FPGA clusters and estimate the performance by simulation. The purpose of this simulation is not to obtain accurate communication performance with each network but to clarify which one is more suitable as a network in a large FPGA cluster.

First, we define an equation to simulate the gather communication time. As a policy, we generalize the gather communication procedure, as shown in Figure 12, which cannot be realized in our prototype system. Then, we calculate the total communication with different parameters time based on theoretical and observed values. In this estimation, we assume to ignore the limitation due to the size of the onboard memory of each FPGA node.

We assume that a fat tree network [3] connects FPGA nodes in the estimation, of which Figure 16 shows an example. The figure shows a 3-level fat-tree network that provides sufficient bandwidth even for collective communications. Our estimation considers the latency of data transfers based on the number of levels of the fat-tree network. The latency of inter-node communication varies depending on the number of switches data passed through in a large-scale cluster. As shown in Figure 16, the larger the cluster, the greater the number of hops for inter-node communication via virtual links. To accurately represent this latency variation, we treated the theoretical bandwidth and latency separately, instead of using actual measured values that include the effect of latency. We also assume that the network employs 100 Gbps network switches with 36 ports, which are realistic in terms of installation cost.

Fig. 16.

Fig. 16. An example of the fat tree network for the FPGA cluster.

Another critical aspect of the estimation is the overhead of MPI_Barrier. Since our prototype system employs MPICH, we assume the overhead of MPI_Barrier based on its implementation in this estimation. According to [16], which proposed the implementation of MPI_Barrier in MPICH, MPI_Barrier among \(N\) servers completes in \(\lceil \log _2 N \rceil\) synchronized rounds. We assume the following approximate formula to represent the overhead of MPI_Barrier, \(T_{barrier}(N)\), based on the observation in the FPGA cluster. (6) \(\begin{equation} T_{barrier}(N) = 3.34 \times 10^{-5} \lceil \log _2 N \rceil + 5.84 \times 10^{-6} \end{equation}\)

Then we simulate the total time of the gather communication in the FPGA cluster following the procedure shown in Section 5. At first, VCSN requires the topology configuration, whose overhead is represented by \(t\) we assumed. In this process, we actually access some other registers in FPGAs to set up some modules. Therefore, \(t\) includes these overheads. Then we set up the crossbars to establish paths between the onboard memory and the network, whose overhead is represented by \(c\). After these processes, data transfers start until the destination node receives all data. As mentioned in the last subsection, we perform the data transfer process and the crossbar setting for the next transfer in succession. We calculate the data transfer time by using the theoretical bandwidth 2, \(b\), and transfer data size, \(d\). The order of data transfer is according to Figure 12.

We also introduce MPI_Barrier to guarantee the completion of each process among all CPU servers. The processes synchronized by MPI_Barrier will be the topology configuration, the initial crossbar configuration, and each data transfer including the crossbar configuration for the next step. Based on these assumptions, we use the following equation to estimate the total gather communication time, \(T_{gather}\), in the FPGA cluster with \(n\) FPGA nodes. (7) \(\begin{equation} T_{gather} = t + \left\lceil \frac{n-1}{4} \right\rceil \left(c + fl+ \frac{2d}{b}\right) + \left(\left\lceil \frac{n-1}{4} \right\rceil + 2 \right) T_{barrier} \left(\left\lceil \frac{n}{2} \right\rceil \right) \end{equation}\)

Please note that \(N\), mentioned above, is the number of CPU servers synchronized by MPI_Barrier, which is different from \(n\), the number of FPGA nodes, here. \(N\) can be obtained by the following equation, \(N=\lceil \frac{n}{2} \rceil\).

Table 5 shows the list of parameters in Equation (7). We use the theoretical value of 12.44 GB/s for \(d\) from Equation (2), and the actual measurement value in Table 1 for \(l\). For \(t\) and \(c\), we use measured times on a single CPU server, and the other values are determined by the number of nodes \(n\). To verify the accuracy of the estimation of the gather communication time by this equation, Figure 17 compares between the estimated and observed times. In the graph, the difference between the estimated and measured values is particularly large when the data size is small. We discuss the cause of this with verification using actual equipment.

Fig. 17.

Fig. 17. Comparison of estimated and observed gather communication times.

Table 5.
parameterexplanationvalue
\(n\)number of FPGA nodes
\(N\)number of CPU servers
\(t\)topology reconfiguration time [s]\(1.24 \times 10^{-4}\)
\(c\)crossbar configuration time [s]\(1.35 \times 10^{-6}\)
\(f\)number of levels in fat treedepending on the number of FPGAs
\(l\)latency [seconds]\(8.47 \times 10^{-7}\) for VCSN, \(4.91 \times 10^{-7}\) for DCN
\(d\)initial data size in each node [Byte]
\(b\)theoretical bandwidth12.44 [GB/s] for VCSN, 12.02 [GB/s] for DCN
\(T_{barrier}\)overhead of MPI_BarrierEquation (6)

Table 5. List of Parameters in the Gather Communication Model

Regardless of the synchronization by MPI_Barrier, there are still time gaps in the processing on each node. Figure 18 shows the time required for a 1 ms dummy process with MPI_Barrier for 1,000 attempts in parallel on eight CPU servers. From this graph, the average of the 1,000 attempts is about 1.28 ms. However, the estimated MPI_Barrier time for this case is about 1.11 ms by Equation (6), and this 0.17 ms difference is not negligible in this estimation. The smallest value out of 1,000 time observations is almost identical to the estimated value. Thus, some factors randomly increase the MPI_Barrier time. We assume these factors to affect this phenomenon; the difference of start times among the processes on nodes and the effect of network latency and jitter. However, since such time fluctuations in parallel processing are difficult to formalize, we conduct a quantitative evaluation of the overhead on our system to find the most reliable correction value.

Fig. 18.

Fig. 18. Time fluctuation of 1ms dummy process plus MPI_Barrier for each attempt among 8 CPU servers.

Figure 19 shows the differences between the times of the process itself and observed times when we run dummy processes in parallel on eight CPU servers with MPI_Barrier synchronization. The blue and red broken lines located at the bottom show the time of the MPI_Barrier only and the approximate equation for the MPI_Barrier time, as shown in Equation (6), respectively. From this graph, we can infer that the length of the processing time affects the length of extra overhead for MPI_Barrier. In the graph, we can divide the observed extra time into two major categories, depending on whether the length of the dummy process is longer than or shorter than about 500 microseconds. Therefore, this paper uses two average values obtained from these two categories to modify the time for MPI_Barrier. We add 210 microseconds as an extra overhead if the processing time synchronized by MPI_Barrier is longer than 500 microseconds and 52 microseconds if it is shorter than 500 microseconds. These corrections are averages of differences between the MPI_Barrier-only time and obtained times shown in the graph. Note that almost all of the individual processing dealt with in our estimation takes less than 80 milliseconds.

Fig. 19.

Fig. 19. Overheads of MPI_Barrier for synchronization of parallel processing in multiple servers. It shows how the overhead of MPI_Barrier behaves for various processing time shown in the legend (in seconds).

Table 6 shows the error ratios of the estimated and measured results before and after the modification. Although there are still some errors, the estimations for 9 and 16 nodes seem good even for small size data. Therefore, we use the corrected value for the estimation.

Table 6.
Original modelModified model
Data size in each node4 node9 node16 node4 node9 node16 node
256 MiB0.17%0.60%0.49%0.23%0.24%0.18%
128 MiB1.94%1.22%0.88%0.40%0.43%0.43%
64 MiB3.83%2.25%1.91%0.70%0.97%0.66%
32 MiB6.80%3.98%3.50%1.74%2.16%1.45%
16 MiB11.90%6.34%5.66%3.38%4.93%3.64%
8 MiB20.15%9.83%10.41%10.23%2.94%5.24%
4 MiB27.24%14.42%12.70%12.12%3.56%3.91%
2 MiB30.57%17.19%16.33%8.80%1.55%2.90%
1 MiB34.51%15.66%21.98%6.97%5.45%4.15%
512 KiB38.92%17.88%22.38%7.96%6.66%0.12%
256 KiB42.21%19.76%23.68%9.578%6.78%1.89%
128 KiB43.36%16.91%27.52%9.43%12.13%1.20%
64 KiB45.49%21.14%26.19%11.81%7.23%1.79%

Table 6. Error Rate of Gather Communication Time Estimated by the Model Compared to the Actual Measurement

We also estimate the performance of the gather communication on DCN based on the procedure shown in Figure 11. The estimation of the gather communication is based on theoretical parameters, not on actual experimental results that achieved only about 70% of the theoretical bandwidth for DCN. Figure 20 shows the results of the modified estimation for the gather communication time on large-scale FPGA clusters with VCSN and DCN.

Fig. 20.

Fig. 20. Estimation of gather communication for the large-scale FPGA clusters based on our experiment.

In Figure 20, the color of the line indicates the data size, the solid line indicates the estimated results in VCSN, and the dashed line indicates them in DCN. From the graph, there is almost no difference between VCSN and DCN when the data size is large, while VCSN becomes more advantageous when the data size is small. For data size of 256MiB, the communication time with DCN is equivalent to 105% of that with VCSN, while it becomes 197% for 64kiB. It is due to that VCSN improves communication efficiency not by increasing bandwidth or reducing latency but by reducing the overhead of configuration and completion processes. The smaller the data size, the larger the percentage of the total time that such overheads account for, making VCSN advantageous. The estimation results show that VCSN is more suitable for large clusters due to its high scalability than DCN.

6.4 Discussion on Performance Difference between DCN and VCSN

Based on the results presented above, we discuss the essential difference between DCN and VCSN about performance of gather communication. The performance evaluations and estimates so far have been affected by implementation and protocols for each. Here we eliminate them and discuss the performance that can be achieved by DCN and VCSN, respectively, under exactly the same conditions.

First, we discuss control methods that minimize the overhead as much as possible. For DCN and VCSN, reducing control processing by the CPU is critical for efficient control with low overhead. Therefore, we assume local control of data transfers with flow control and batch configuration of transfer controls. Flow control almost eliminates the possibility of data loss in inter-FPGA communication, allowing individual FPGAs to control data transfers locally without global synchronization. Furthermore, if we upload the scheduled transfer process on the FPGA in advance, CPU processing during the communication is not necessary. Thus, assuming the elimination of CPU-involved processing between data transfer stages, we consider the performance by each network.

The gather communication time in such a control scheme can be estimated by removing the control overhead between data transfer stages from Figures 11 and 12. We will need only the first CPU processing in both DCN and VCSN cases. In this situation, both DCN and VCSN do not need CPU controls except for the very first one. Although the number of transfer stages is greater for DCN, the actual amount of data transferred is the same for VCSN and DCN. This means that the difference between DCN and VCSN is only the difference in the number of latency (gray parts in Figures 11 and 12) essentially.

Therefore, the improvement in communication performance with VCSN is due to the increased efficiency achieved by aggregating multiple data transfers to hide effects by their overheads and latency. In conclusion, we believe that VCSN can achieve more effective than DCN for communication including a large number of data transfers similar to gather communication.

6.5 Comparison with Related Studies

Since the graph in Figure 20 alone is not enough for the contribution of this study, we compare it with other previous studies. Haghi et al. [14] proposed a unique FPGA-based approach to accelerate MPI collective on conventional clusters and simulated the performance of collective communication by their proposed method. In their simulation, the gather communication for 128 nodes with data of 1 KB took 17 microseconds. The simulation is about 1,000 times faster than our estimation result. However, their simulation ignored the complete process as MPI_Barrier and other configuring overheads. In the case of the gather communication for data as small as 1 KB, the communication itself takes about only 12 microseconds in VCSN, while the other overheads account for more than 99.9% of the total operation time. Also, since their simulations are for a limited data size of less than 1kB, we cannot compare properly with our estimation results. Therefore, the purpose of their simulation is clearly different from our goal of improving the efficiency of collective communication of large-scale data in consideration of operation as an HPC system.

Since there are few studies on collective communication on FPGA clusters, especially about the gather communication comparable to our study, we also discuss the comparison with studies on collective communication on conventional clusters. Kandalla et al. [18] proposed a topology-aware algorithm for MPI_Gather operation. Their optimized approach achieved MPI_Gather operation with 1 MB message size under quiet condition in about 600 milliseconds. Since the gather communication time for 512 nodes and 1 MiB in our estimated results is about 130 milliseconds, our approach is faster. However, this study was published more than 10 years ago, and the assumptions for comparison are very different, such as the use of InfiniBand DDR as the network.

A relatively recent study that provides an appropriate comparison with this paper is the one by Chakraborty et al. [9]. It proposes an algorithm to avoid contention in collective communication and discusses the acceleration of MPI collectives by introducing the concept of Throttle which represents number of processes that can be accessed to the root process at the same time. In their work, the gather communication between 16MB and 28 processes is achieved in about 20ms. However, this result is for Throttle of 14. For Throttle of 2, which is the same condition as ours, their result is about 100ms. Our estimation shows that the gather communication between 16 MiB and 32 nodes is about 25 msec, therefore, ours is faster under the same condition. Also, since their paper targets MPI collectives between processes in a single node, the time required for gather communication over a network is expected to increase operation time significantly.

As mentioned above, as far as we can find, there are not many previous studies that are suitable for comparison in our study. However, in a comparison under similar conditions, we can say that our results are fast as the estimated time considering the actual communication overhead, regardless of whether it is related to the FPGA or not.

6.6 Discussion on the Performance of Other Collective Communications

The results of the experiments and estimations in this study enable us to discuss the collective communication performance other than gather communication in FPGA clusters. We discuss Scatter and Allgather with VCSN in the environment of this study below.

6.6.1 Scatter.

In scatter communication, the root node distributes data to all nodes in the network. Assuming the same approach as the gather communication in VCSN described earlier, the directions and order of individual data transfers in the scatter communication are opposite to those in the gather communication shown in Figure 12. Since the number of data transfers and control overhead is expected to be the same as for the gather communication, the total operation time required for the scatter communication will be almost equal to that for gather communication, shown in the above evaluation and estimation. In scatter communication, the network I/O bandwidth of the root node determines the overall performance as in the gather communication.

6.6.2 Allgather.

Allgather communication gathers all of data in the network into all nodes. The most simple approach to achieve Allgather is to gather data from all nodes in one node and then send the gathered data to all nodes by broadcast transfer. In our implementation, we can realize it by transferring the data gathered by the gather communication to all nodes utilizing a function of the broadcast frame of the Ethernet switch. This approach can be effective if the entire network is connected by a single switch or if the connections between switches are simple tree topologies. However, if there are loop paths in the connections between switches, such as the fat tree used in this estimation, the above method is impossible due to broadcast storms. In summary, in a small network such as the one used in this experiment, we can realize the allgather communication by broadcasting gathered data from the root node after the gather communication in this experiment. On the other hand, in a large-scale network that considers the available bandwidth, such as the fat tree used in this estimation, an appropriate method to prevent broadcast storms is necessary. In VCSN, in addition to the virtual network reconfiguration function, multicast and broadcast are also available as Ethernet functions. We believe that Allgather can be efficiently realized by combining these functions, however, the proposal of a concrete method is a subject for our future work.

Skip 7CONCLUSIONS Section

7 CONCLUSIONS

This paper proposed a Virtual Circuit-Switching Network (VCSN) that provides a flexible and simple-to-operate network for large-scale HPC FPGA clusters. VCSN can deal with diverse applications with multiple users concurrently by providing appropriate network partitioning and low-cost topology reconfiguration for the FPGA cluster. In addition, users of the cluster can control inter-FPGA communication with simple operations as if a circuit-switching network. We designed and implemented a network subsystem to realize VCSN as hardware on FPGA. It allows the user to freely change the virtual topology by simply accessing the registers in the FPGA, which is more user-friendly and flexible network communication than the conventional Ethernet connection between FPGAs. In addition, we employed Ethernet jumbo frames that provide high payload efficiency to realize high available bandwidth for practical communication.

We evaluated the communication performance of a virtual link in VCSN and compared it with that of the direct connection network (DCN). The result showed that the inter-FPGA communication by VCSN has sufficient communication performance comparable with DCN. Furthermore, we built a prototype system for performance evaluation that includes AFU Shell implementation to verify the complicated and practical communication performance with VCSN.

To demonstrate the high scalability of VCSN, we evaluated the collective communication performance that is important for HPC systems on the prototype FPGA cluster. First, we showed a concrete procedure for performing efficient gather communication using VCSN and compared the theoretical performance with DCN. In this context, we showed that the redundant network topology provided by VCSN improves the performance of gather communication in large-scale networks. Also, it shows that our virtualization allows users to achieve complicated communication such as collective communication within a relatively simple operation. Next, we deployed MPI in a prototype FPGA cluster and measured the gather communication time, including the overhead for control such as MPI_Barrier. In this experiment, we showed that communication by VCSN can realize gather communication faster than that using DCN. Finally, we estimated gather communication time in a large FPGA cluster based on the above experimental results. As in our experiments, we estimated the total communication performance with overhead for an actual operation to prove the high scalability of VCSN. The estimation results show that VCSN can reduce the communication overhead significantly for gather communication compared with DCN especially when the data size is small.

Skip ACKNOWLEDGMENTS Section

ACKNOWLEDGMENTS

This work was partially supported by Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers JP20H00593, JP17H01706, JP22K17870. The authors also thank the support of FS2020 project.

REFERENCES

  1. [1] Abel Francois, Weerasinghe Jagath, Hagleitner Christoph, Weiss Beat, and Paredes Stephan. 2017. An FPGA platform for hyperscalers. In 2017 IEEE 25th Annual Symposium on High-Performance Interconnects (HOTI). 2932. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Ajima Yuichiro, Kawashima Takahiro, Okamoto Takayuki, Shida Naoyuki, Hirai Kouichi, Shimizu Toshiyuki, Hiramoto Shinya, Ikeda Yoshiro, Yoshikawa Takahide, Uchida Kenji, and Inoue Tomohiro. 2018. The Tofu interconnect D. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). 646654. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Al-Fares Mohammad, Loukissas Alexander, and Vahdat Amin. 2008. A scalable, commodity data center network architecture. SIGCOMM Comput. Commun. Rev. 38, 4 (Aug.2008), 6374. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Alachiotis Nikolaos, Berger Simon A., and Stamatakis Alexandros. 2010. Efficient PC-FPGA communication over gigabit ethernet. In 2010 10th IEEE International Conference on Computer and Information Technology. 17271734. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Association IEEE Standards. 2022. IEEE 802.3bj-2014. https://standards.ieee.org/ieee/802.3bj/5551/. (Jan.2022).Google ScholarGoogle Scholar
  6. [6] Awad Mariette. 2009. FPGA supercomputing platforms: A survey. In 2009 International Conference on Field Programmable Logic and Applications. 564568. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Benner Alan. 2012. Optical interconnect opportunities in supercomputers and high end computing. In OFC/NFOEC. 160.Google ScholarGoogle Scholar
  8. [8] Caulfield Adrian M., Chung Eric S., Putnam Andrew, Angepat Hari, Fowers Jeremy, Haselman Michael, Heil Stephen, Humphrey Matt, Kaur Puneet, Kim Joo-Young, Lo Daniel, Massengill Todd, Ovtcharov Kalin, Papamichael Michael, Woods Lisa, Lanka Sitaram, Chiou Derek, and Burger Doug. 2016. A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 113. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Chakraborty Sourav, Subramoni Hari, and Panda Dhabaleswar K.. 2017. Contention-aware kernel-assisted MPI collectives for multi-/many-core systems. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 1324. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Matteis Tiziano De, Licht Johannes de Fine, Beránek Jakub, and Hoefler Torsten. 2019. Streaming message interface: High-performance distributed memory programming on reconfigurable hardware. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19). Association for Computing Machinery, New York, NY, USA, Article 82, 33 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Computing Paderborn Center for Parallel. 2022. Noctua. https://wikis.uni-paderborn.de/pc2doc/Noctua. (Jan.2022).Google ScholarGoogle Scholar
  12. [12] Gao Shanyuan, Schmidt Andrew G., and Sass Ron. 2009. Hardware implementation of MPI_Barrier on an FPGA cluster. In 2009 International Conference on Field Programmable Logic and Applications. 1217. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] George Alan D., Herbordt Martin C., Lam Herman, Lawande Abhijeet G., Sheng Jiayi, and Yang Chen. 2016. Novo-G#: Large-scale reconfigurable computing with direct and programmable interconnects. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). 17. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Haghi Pouya, Guo Anqi, Xiong Qingqing, Patel Rushi, Yang Chen, Geng Tong, Broaddus Justin T., Marshall Ryan, Skjellum Anthony, and Herbordt Martin C.. 2020. FPGAs in the network and novel communicator support accelerate MPI collectives. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). 110. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] He Zhenhao, Parravicini Daniele, Petrica Lucian, O’Brien Kenneth, Alonso Gustavo, and Blott Michaela. 2021. ACCL: FPGA-accelerated collectives over 100 Gbps TCP-IP. In 2021 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC). 3343. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Hensgen Debra, Finkel Raphael, and Manber Udi. 1988. Two algorithms for barrier synchronization. International Journal of Parallel Programming 17, 1 (1988), 117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Hur Rotem Ben and Kvatinsky Shahar. 2016. Memory processing unit for in-memory processing. In 2016 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH). 171172. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Kandalla Krishna, Subramoni Hari, Vishnu Abhinav, and Panda Dhabaleswar K.. 2010. Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW). 18. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Khalilzad Nima Moghaddami, Yekeh Farahnaz, Asplund Lars, and Pordel Mostafa. 2011. FPGA implementation of real-time Ethernet communication using RMII interface. In 2011 IEEE 3rd International Conference on Communication Software and Networks. 3539. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Lofgren A., Lodesten L., Sjoholm S., and Hansson H.. 2005. An analysis of FPGA-based UDP/IP stack parallelism for embedded Ethernet connectivity. In 2005 NORCHIP. 9497. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Mariantoni Matteo, Wang H., Yamamoto T., Neeley M., Bialczak Radoslaw C., Chen Y., Lenander M., Lucero Erik, O’Connell A. D., Sank D., Weides M., Wenner J., Yin Y., Zhao J., Korotkov A. N., Cleland A. N., and Martinis John M.. 2011. Implementing the quantum von Neumann architecture with superconducting circuits. Science 334, 6052 (2011), 6165. DOI: arXiv:.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Mavroidis Iakovos, Papaefstathiou Ioannis, Lavagno Luciano, Nikolopoulos Dimitrios S., Koch Dirk, Goodacre John, Sourdis Ioannis, Papaefstathiou Vassilis, Coppola Marcello, and Palomino Manuel. 2016. ECOSCALE: Reconfigurable computing and runtime system for future exascale systems. In 2016 Design, Automation and Test in Europe Conference and Exhibition (DATE). 696701.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Mondigo Antoniette, Ueno Tomohiro, Sano Kentaro, and Takizawa Hiroyuki. 2020. Comparison of direct and indirect networks for high-performance FPGA clusters. In Applied Reconfigurable Computing. Architectures, Tools, and Applications, Rincón Fernando, Barba Jesús, So Hayden K. H., Diniz Pedro, and Caba Julián (Eds.). Springer International Publishing, Cham, 314329. Google ScholarGoogle Scholar
  24. [24] Mondigo Antoniette, Ueno Tomohiro, Tanaka Daichi, Sano Kentaro, and Yamamoto Satoru. 2017. Design and scalability analysis of bandwidth-compressed stream computing with multiple FPGAs. In 2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC). 18. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Niu Xin Yu, Tsoi Kuen Hung, and Luk Wayne. 2011. Reconfiguring distributed applications in FPGA accelerated cluster with wireless networking. In 2011 21st International Conference on Field Programmable Logic and Applications. 545550. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Papaphilippou Philippos, Meng Jiuxi, Gebara Nadeen, and Luk Wayne. 2021. Hipernetch: High-performance FPGA network switch. ACM Trans. Reconfigurable Technol. Syst. 15, 1, Article 3 (Nov.2021), 31 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Papaphilippou Philippos, Meng Jiuxi, and Luk Wayne. 2020. High-performance FPGA network switch architecture. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY, USA, 7685. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Putnam Andrew, Caulfield Adrian M., Chung Eric S., Chiou Derek, Constantinides Kypros, Demme John, Esmaeilzadeh Hadi, Fowers Jeremy, Gopal Gopi Prashanth, Gray Jan, Haselman Michael, Hauck Scott, Heil Stephen, Hormati Amir, Kim Joo-Young, Lanka Sitaram, Larus James, Peterson Eric, Pope Simon, Smith Aaron, Thong Jason, Xiao Phillip Yi, and Burger Doug. 2015. A reconfigurable fabric for accelerating large-scale datacenter services. IEEE Micro 35, 3 (2015), 1022. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Sheng Jiayi, Humphries Ben, Zhang Hansen, and Herbordt Martin C.. 2014. Design of 3D FFTs with FPGA clusters. In 2014 IEEE High Performance Extreme Computing Conference (HPEC). 16. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Sheng Jiayi, Xiong Qingqing, Yang Chen, and Herbordt Martin C.. 2017. Collective communication on FPGA clusters with static scheduling. SIGARCH Comput. Archit. News 44, 4 (Jan.2017), 27. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Sheng Jiayi, Yang Chen, Sanaullah Ahmed, Papamichael Michael, Caulfield Adrian, and Herbordt Martin C.. 2017. HPC on FPGA clouds: 3D FFTs and implications for molecular dynamics. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). 14. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Tarafdar Naif, Lin Thomas, Fukuda Eric, Bannazadeh Hadi, Leon-Garcia Alberto, and Chow Paul. 2017. Enabling flexible network FPGA clusters in a heterogeneous cloud data center(FPGA’17). Association for Computing Machinery, New York, NY, USA, 237246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Ueno Tomohiro, Koshiba Atsushi, and Sano Kentaro. 2021. Virtual circuit-switching network with flexible topology for high-performance FPGA cluster. In 2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP). 4148. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Ueno Tomohiro, Miyajima Takaaki, and Sano Kentaro. 2022. FPGA-dedicated network vs. server network for pipelined computing with multiple FPGAs. In International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2022). Association for Computing Machinery, New York, NY, USA, 9091. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Tsukuba Center for Computational Sciences University of. 2022. Cygnus. https://www.ccs.tsukuba.ac.jp/eng/supercomputers/. (Jan.2022).Google ScholarGoogle Scholar
  36. [36] Vesper Malte, Koch Dirk, Vipin Kizheppatt, and Fahmy Suhaib A.. 2016. JetStream: An open-source high-performance PCI Express 3 streaming library for FPGA-to-Host and FPGA-to-FPGA communication. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL). 19. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Véstias Mário and Neto Horácio. 2014. Trends of CPU, GPU and FPGA for high-performance computing. In 2014 24th International Conference on Field Programmable Logic and Applications (FPL). 16. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wang Tianqi, Geng Tong, Li Ang, Jin Xi, and Herbordt Martin. 2020. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters. IEEE Trans. Comput. 69, 8 (2020), 11431158. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Weerasinghe Jagath, Abel Francois, Hagleitner Christoph, and Herkersdorf Andreas. 2015. Enabling FPGAs in hyperscale data centers. In 2015 IEEE 12th Intl. Conf. on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl. Conf. on Autonomic and Trusted Computing and 2015 IEEE 15th Intl. Conf. on Scalable Computing and Communications and its Associated Workshops (UIC-ATC-ScalCom). 10781086. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Zhang He, Kang Wang, Wang Lezhi, Wang Kang L., and Zhao Weisheng. 2017. Stateful reconfigurable logic via a single-voltage-gated spin hall-effect driven magnetic tunnel junction in a spintronic memory. IEEE Transactions on Electron Devices 64, 10 (2017), 42954301. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. VCSN: Virtual Circuit-Switching Network for Flexible and Simple-to-Operate Communication in HPC FPGA Cluster

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 16, Issue 2
      June 2023
      451 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3587031
      • Editor:
      • Deming Chen
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 March 2023
      • Online AM: 13 January 2023
      • Accepted: 18 December 2022
      • Revised: 29 July 2022
      • Received: 7 February 2022
      Published in trets Volume 16, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)529
      • Downloads (Last 6 weeks)301

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!