Abstract
While FPGA accelerator boards and their respective high-level design tools are maturing, there is still a lack of multi-FPGA applications, libraries, and not least, benchmarks and reference implementations towards sustained HPC usage of these devices. As in the early days of GPUs in HPC, for workloads that can reasonably be decoupled into loosely coupled working sets, multi-accelerator support can be achieved by using standard communication interfaces like MPI on the host side. However, for performance and productivity, some applications can profit from a tighter coupling of the accelerators. FPGAs offer unique opportunities here when extending the dataflow characteristics to their communication interfaces.
In this work, we extend the HPCC FPGA benchmark suite by multi-FPGA support and three missing benchmarks that particularly characterize or stress inter-device communication: b_eff, PTRANS, and LINPACK. With all benchmarks implemented for current boards with Intel and Xilinx FPGAs, we established a baseline for multi-FPGA performance. Additionally, for the communication-centric benchmarks, we explored the potential of direct FPGA-to-FPGA communication with a circuit-switched inter-FPGA network that is currently only available for one of the boards. The evaluation with parallel execution on up to 26 FPGA boards makes use of one of the largest academic FPGA installations.
1 INTRODUCTION
The heterogeneity of High Performance Computing (HPC) systems increased over the past years, and heterogeneous systems take an important role in the Top 500 list [20]. Field Programmable Gate Arrays (FPGAs) are good candidates for the acceleration of HPC systems, since current High Level Synthesis (HLS) tool flows drastically decreased the development time while still offering high-quality results. With novel HPC systems emerging, coming with a hybrid network consisting of an inter-CPU and a separate inter-FPGA network, also new opportunities for scaling applications on these systems arise. Handling the required communication for the accelerator workloads over communication interfaces like Message Passing Interface (MPI) through the inter-CPU network and updating the data on the device via PCIe comes with limitations. The workloads need to be reasonably decoupled to allow efficient, concurrent operation on multiple accelerators, which is not always possible. Some applications can profit from higher bandwidths and shorter latencies achieved by tighter coupling of the accelerators using direct inter-FPGA communication. This direct communication allows better utilization of the dataflow characteristics of reconfigurable execution pipelines by directly integrating the communication into the pipelines. One important tool to evaluate the performance characteristics of the different inter-FPGA communication interfaces that are used in multi-FPGA systems are benchmarks.
However, for the HPC area, benchmark suites with relevant benchmark applications that allow the evaluation of these systems are rare. To overcome this shortage, we earlier proposed HPCC FPGA [15] based on the HPC Challenge Benchmark suite [6] targeting the HPC domain. In this previous work, we focused on the performance characterization of an single FPGA with regards to memory access patterns of the applications. Some important benchmarks of the HPC Challenge are missing in the proposed suite. The missing benchmarks b_eff (a synthetic network bandwidth benchmark), PTRANS (a parallel matrix transposition), and LINPACK are good candidates to scale over multiple FPGAs and stress the communication interfaces.
For inter-FPGA communication, there does not exist a standard comparable to MPI on CPUs. The serial interfaces of recent FPGAs can be used to establish circuit-switched and packet-switched networks. This includes implementations of an Ethernet core [12] that can be directly used from HLS code or application-specific protocols [18]. Another approach is the Intel-specific Open Computing Language (OpenCL) extension for point-to-point connections Intel External Channels (IEC). With SMI [3], also a publicly available library for the communication in a Circuit Switched Network (CSN) based on IEC has been proposed. This approach abstracts away the routing but is still vendor-specific.
With EasyNet [9] and ACCL [10] there exist libraries for Xilinx FPGAs that offer point-to-point and collectives communication over packet-switched networks. However, these libraries are vendor-specific and require optimized versions of the benchmarks. In consequence, the only way of communication that can be used on both—Intel and Xilinx FPGAs—without vendor- or device-specific changes in the OpenCL C code, is the data exchange via PCIe and MPI via the inter-CPU network. With this approach, only standardized OpenCL and MPI calls are used for the data transfer between FPGAs, and the same OpenCL C kernel code is executed on the FPGAs independent of the vendor or device family.
Therefore, to create a widely usable benchmark suite for multi-FPGA systems, we make the following contributions:
• | We extend the benchmark suite with three new communication-focused multi-FPGA benchmarks, including LINPACK, and provide baseline implementations compatible with a wide range of FPGAs. | ||||
• | We add support for multi-FPGA execution and validation for all existing benchmarks of the suite and propose improved designs that allow a well-scaling execution over dozens of FPGAs. | ||||
• | We provide a vendor-specific, optimized implementation using IEC for all three benchmarks to show the easy extendability of the benchmark suite with vendor-specific communication interfaces. | ||||
• | We evaluate all benchmarks on two different multi-FPGA systems with Intel and Xilinx FPGAs. The results show that distributed FPGA systems can reach HPC performance and thus require corresponding benchmarking techniques. | ||||
All communication libraries mentioned above are good candidates for vendor-specific, optimized versions of the benchmarks. Since the implementation of well-optimized benchmarks for a specific system or library is a non-trivial task, we focused in this work on the implementation of meaningful baseline versions for broad range of devices. Additionally, we provide optimized implementations for one of the communication approaches discussed above. We made the extensions of the suite publicly available and contributed the proposed changes to the official sources of the HPCC FPGA benchmark suite.1 In the future, we plan to extend the benchmark suite to support additional communication libraries and invite everyone to submit optimized, vendor-specific versions of the benchmarks to the official repository.
2 PARALLEL IMPLEMENTATION OF HPC CHALLENGE BENCHMARKS FOR FPGA
The existing benchmark kernels of HPCC FPGA are called base implementations and are designed to provide good performance on different FPGA architectures. On the one hand, this is achieved with configuration parameters that allow to scale the benchmark kernels and, on the other hand, with code optimizations that apply to a broad range of FPGAs. This allows the creation of efficient designs without manual code changes for different FPGA architectures. The configuration parameter
Different hardware interfaces can be utilized for inter-FPGA communication with recent FPGA boards. Direct inter-FPGA communication via the serial interfaces requires vendor-specific extensions and libraries, which makes it impossible to create base implementations with this approach. However, sending the data over the host via PCIe and MPI can be implemented in vendor-independent OpenCL code. Thus, we use this communication approach in the base implementations.
Since we use the same structure for code organization and the build process as the existing benchmarks in HPCC FPGA, the new benchmarks come with support for custom kernels. This allows easy extension of the benchmarks with additional OpenCL kernels. One restriction is that the OpenCL kernels need to have the same kernel signature to work with the existing host code. This means the types of the input and output parameters of the kernels and their ordering need to be the same. For some communication schemes and optimized designs it may be required to slightly change the kernel signature to pass additional information to the kernels. We extended the host code architecture as shown in Figure 1 to also simplify the extension of the benchmarks from the host side. The
Fig. 1. Improved architecture of the benchmark host code to increase extendability with different OpenCL kernels. NetworkBenchmark is the host implementation of one of the new benchmarks. It itself contains different implementations for the execution of the OpenCL kernels on the FPGA that depend on the communication scheme. To extend a benchmark for another communication scheme, only a new execution implementation needs to be added.
For every benchmark, there has to be one MPI rank per FPGA, so the number of MPI ranks needs to match the number of used FPGAs. Before every kernel execution on the FPGA, the hosts synchronize using an MPI barrier to reduce the measurement error. Always the slowest execution time among all FPGAs is reported for each repetition of the benchmark execution. The best repetition is used to calculate the derived performance metric of the benchmark.
In the following, we give a detailed description of the new benchmarks implementations.
2.1 Effective Bandwidth Benchmark
In this benchmark, we use the rules of the Effective Bandwidth (b_eff) benchmark given in Reference [19]. It is a synthetic benchmark that uses the derived metric effective bandwidth to combine both—the network latency and bandwidth—into a single metric. The original benchmark sends messages of sizes \(2^0, 2^1, \dots , 2^{20}\) B to neighbor nodes in a ring topology. The effective bandwidth is calculated from the measured bandwidth for the different message sizes as shown in Equation (1): (1) \(\begin{equation} b_{\mathit {eff}} = \frac{\sum _L(\mathit {max}_{\mathit {rep}}(b(L, \mathit {rep})))}{21}, \end{equation}\) where \(L\) are the message sizes, \(\mathit {rep}\) the repetitions of the execution, and \(b(L, \mathit {rep})\) the measured bandwith for message size \(L\) during repetition \(\mathit {rep}\).
The base implementation of this benchmark does not require a FPGA kernel, because data is transferred between the FPGAs solely by the host. The optimized version for Intel FPGAs is configurable with the parameters given in Table 1. Next to the number of kernel replications, it contains the width of the external channels in Bytes.
2.1.1 Base Implementation.
The base implementation exchanges the messages between the global memory of neighboring FPGAs in the ring. Therfore, it reads a memory buffer representing the message using the OpenCL directive
The expected performance is limited by the required time to read (\(\mathit {pcie\_read}_t\)) and to write (\(\mathit {pcie\_write}_t\)) a message of the given size to the FPGA via PCIe, plus the time required to exchange the message between the nodes using MPI (\(\mathit {mpi}_t\)). All three steps need to be executed sequentially, so the expected bandwidth for a message size \(L\) can be modeled with Equation (2): (2) \(\begin{equation} b_{L} = \frac{2 \cdot L}{\mathit {pcie\_write}_t + \mathit {mpi}_t + \mathit {pcie\_read}_t} \end{equation}\)
2.1.2 Intel External Channels Implementation.
The Intel-optimized implementation requires OpenCL kernel code and consists of two different kernel types: a send kernel and a receive kernel. During execution, they continuously send or receive data over two external channels of the width specified in
A schematic view of the channel connections for the kernel implementation is given in Figure 2. The kernels are connected to another kernel pair on a different FPGA. In the figure, the kernels form a small ring over two FPGAs and the topology can be arbitrarily scaled by adding more FPGAs.
Fig. 2. Data exchange of two kernel pairs over the external channels. The kernel pairs are executed on two different FPGAs and communicate over the bi-directional external channels. The kernel pairs are connected over an internal channel to forward the received data to the send kernel for the next iteration.
The arrows describe the path of a single data chunk through the kernels. A message chunk will be repeatedly sent over the external channel until the sum of all sent chunks matches the desired message size. Messages are sent in parallel in both directions. After a complete message is sent, the message chunk is forwarded from the receive to the send kernel over the internal channel. Only then, the next message is sent now using the message chunk received over the internal channel. The message chunk is stored in a global memory buffer after the last message is exchanged and used for validation on the host side. A single send-receive kernel pair will use two external channels in both directions.
The performance metric of the benchmark combines latency and the total bandwidth of the network. To model the performance, we need precise information about the latency and bandwidth of the external channels as they are given in Table 2. Every kernel replication can only make use of two external channels, which means that \(c_n^{\prime } = 2\) and the total number of external channels is utilized by using a replication count of 2. The execution time of a kernel pair for a given message size can then be modeled with Equation (3), where \(L\) is the used message size and \(i\) the number of messages that are sent. (3) \(\begin{equation} t_{L, i} = \frac{\lceil \frac{L}{c_n^{\prime } \cdot c_w}\rceil \cdot i}{c_f} + i \cdot c_l \end{equation}\)
Table 2. Characteristics of the Serial Channel IP of the BittWare 520N Board Taken from the Specification [1]
For the bandwidth model, we insert the values for the IP core taken from Table 2, which results in Equation (4). (4) \(\begin{equation} b_{L} = \frac{2 \cdot L}{\lceil \frac{L}{64B}\rceil \cdot 6.4ns + 520ns} \end{equation}\)
This equation models the bandwidth for a single send-receive kernel pair and is expected to scale linearly with the number of kernel pairs.
2.2 Parallel Matrix Transposition
The parallel matrix transposition (PTRANS) benchmark computes the solution of \(C = B + A^T\) where \(A,B,C \in \mathbb {R}^{n \times n}\). The matrix \(A\) is transposed and added to another matrix \(B\). The result is stored in matrix \(C\). All matrices are divided into blocks, and the blocks are distributed over multiple FPGAs using a PQ distribution scheme shown in Figure 3.
Fig. 3. Diagonal distribution of the 16 blocks of a \(4 \times 4\) block matrix on four FPGAs with \(P=Q=2\) . The original matrix is shown on the left. Colors equal the FPGA in which global memory the data block of the matrix will reside at the beginning of the calculation. On the right, the placement of the data on the different FPGAs is shown. The bold lines represent the borders of the memory of a single FPGA.
2.2.1 Base Implementation.
The configuration parameters for the implementation are given in Table 3. The number of kernel pairs can be defined with the
| Parameter | Description |
|---|---|
| Size of the matrix blocks that are buffered in local memory and also distributed between the FPGAs. | |
| Width of the channels in data items. Together with the used data type, the width of the channel in bytes can be calculated. | |
| Specifies the used data type for the calculation. |
Table 3. Configuration Parameters of the PTRANS Benchmark
The implementation consists of a single OpenCL kernel that sequentially executes three pipelines for every matrix block. In the first pipeline, a block of matrix \(A\) is read from global memory and written into a buffer. The second pipeline reads the block of \(A\) transposed from the buffer, reads a block of \(B\) from global memory, adds both blocks, and stores the result in an additional buffer. The content of this buffer is written back to global memory in the last pipeline.
Every pipeline is reading or writing a single block of data from the global memory. This is a similar approach, as it is used for the STREAM benchmark in the suite and leads to an efficient use of the global memory. Before the kernel can be executed, the matrix \(A\) needs to be exchanged by the host ranks using
With Equation (5), the expected execution time for a single matrix block is given. It consists of the time required to exchange the blocks via MPI and write them into the global memory of the FPGAs (\(t_\mathit {MPI}\)) and the execution time of the OpenCL kernel. The kernel execution time is based on the three pipelines that are executed sequentially, the block size \(b\), the channel width in number of values \(c_w\), and the clock frequency of the used channel \(c_f\). Depending on the number of kernel replications, the FPGA may be able to process multiple matrix blocks simultaneously without interference. (5) \(\begin{equation} t_{\mathit {PTRANS}} = t_\mathit {MPI} + 3 \cdot \frac{b^2}{c_w} \cdot c_f \end{equation}\)
For the verification of the data, the non-transposed blocks of matrix \(A\) are exchanged by the hosts using MPI. Then, each host re-calculates the result using a CPU reference implementation. The reported error is the maximum residual error between the FPGA and CPU result.
2.2.2 Intel External Channels Implementation.
The Intel-specific implementation comes with the restriction that \(P\overset{!}{=}Q\). This allows to set up a static circuit-switched network between the pairs or FPGAs and exchange the matrix blocks without additional routing. The FPGA logic is implemented in two kernels per external channel, similar to the b_eff benchmark. For this implementation, the width of the channel defined by
One of the kernels reads a block of \(A\) into local memory. The size of this local memory buffer can be defined with
The second kernel will receive chunks of a transposed block of \(A\), add a block of \(B\) to it, and store it in global memory. In consequence, no local memory is needed in this kernel. One major goal of this implementation is to continuously send and receive data over all available external channels to utilize the available network bandwidth, which is most likely the performance bottleneck. Nevertheless, the kernels may also suffer from low global memory bandwidth, because they need to concurrently read and write to three different buffers for every kernel replication.
This leads to a total required global memory bandwidth on a single FPGA given in Equation (6): (6) \(\begin{equation} b_{\mathit {global}} = 3 \cdot r \cdot c_w \cdot c_f, \end{equation}\) where \(r\) the number of external channels per FPGA (or number of kernel replications), and \(c_f\) and \(c_w\) the frequency and the width of an external channel as defined in Table 2. This means the required global memory bandwidth is three times higher than the network bandwidth to keep the benchmark network-bandwidth-bound. As a performance metric, the Floating Point Operation (FLOP) per second are derived from the execution time as a performance metric for the benchmark. For the calculation, it is assumed that \(n^2\) additions are required for the computation on matrices of width \(n\). Considering the characteristics of the external channels of the used BittWare 520N boards, the maximum performance will be \(p = i \cdot r \cdot 32B \cdot 156.25MHz\) for a sufficiently large matrix, where \(i\) is the number of used FPGAs. Note that the block size is not considered in this performance model. It is used to allow larger memory bursts from global memory, which are defined by the width of the block. This will lead to a higher efficiency of the global memory accesses, but, since the performance model covers the case where the network bandwidth is the bottleneck, this parameter can be neglected. However, for very small block sizes, the efficiency of the global memory may be reduced to the point that it becomes the bottleneck.
2.3 High-performance LINPACK
The High-performance LINPACK benchmark solves a large equation system \(A \cdot x = b\) for \(x\), where \(A \in \mathbb {R}^{n \times n}\) and \(b,x \in \mathbb {R}^{n}\). This is done in two steps: First the matrix \(A\) is decomposed into a lower matrix \(L\) and an upper matix \(U\). In a second step, these matrices are used to first solve \(L \cdot y = b\) and finally \(U \cdot x = y\) to get the result for the vector \(x\). For the implementation of the benchmark on FPGA, the rule set for the HPL-AI mixed-precision benchmark [4] was adapted, which defines \(A\) to be a diagonally dominant matrix. Thus, the LU factorization does not require pivoting. In contrast to the original benchmark it is possible to choose between single-precision and double-precision floating-point values. Since the benchmark suite is designed to only measure the FPGA performance, no additional iterative method is used to refine the result if a lower precision is used. Only the LU decomposition, which is the most compute-intensive step in this calculation, is executed on the FPGAs. The number of FLOPs for this step is defined to be \(\frac{2 \cdot n^3}{3}\) for a matrix \(A\) with width \(n\) in contrast to \(2 \cdot n^2\) for solving the equation systems for the LU-decomposed matrix. Only the performance of the LU factorization on the FPGA is reported.
2.3.1 Base Implementation.
The base implementation uses a blocked, right-looking variant for the LU factorization as it is described in Reference [5]. Therefore, the matrix will be divided into sub-blocks with a width of \(2^\mathit {BLOCK\_SIZE\_LOG}\) elements. The exact size of the blocks is defined over a configuration parameter. For the update of a single row and column of blocks, we need to perform four different operations. A single iteration of the blocked LU decomposition is shown in Figure 4. In every iteration, the LU factorization for a diagonal block of the matrix is calculated, which is marked green in the visualization. All grey-colored blocks on the left and top of this block are already updated in previous iterations and will require no further processing. This is why this approach is called right-looking, since we will always update the blocks on the right of the LU block. After the LU block is decomposed, the lower matrix block \(L\) is used to update all blocks on the right of the LU block. Since they are the top-most blocks that still require an update, they are in the following called top blocks. The upper matrix block \(U\) is used to update all blocks below the current LU-block. These are the left-most blocks that require an update, so they are referred to as left blocks. The left and top blocks again are used to update all inner blocks, which can efficiently be done using matrix multiplication. The design contains a separate kernel for each of the four operations.
Fig. 4. In every iteration of the algorithm, a single block in the matrix is decomposed into a lower and upper matrix (green). The lower matrix is used to update all blocks on the right of this block (blue) and the upper matrix to update all blocks below this block (orange). The updated top and left blocks are then used to update all inner blocks (red), and the dark-red blocks need to be updated before the next communication phase can start.
Additionally, a single iteration of the LU decomposition is split into two subsequent steps in the design. In the communication phase, the LU, left, and top blocks are updated, which also involves data exchange between kernels on the same FPGA and between the FPGAs. In the update phase, the exchanged data is used to update all inner blocks locally using matrix multiplication kernels.
Both phases can overlap as shown by the timeline of kernel executions in Figure 5 based on the matrix given in Figure 4. The number of matrix multiplications required for a single iteration of the algorithm increases quadratically with the matrix size. No data dependency exists between the light-red matrix multiplications and the operations of the next communication phase, which allows overlapping of the two phases. For large matrices, this means that the performance of the implementation is limited by the aggregated performance of the matrix multiplication kernels. During the communication phase, matrix blocks are exchanged via the host using PCIe and MPI.
Fig. 5. Kernel executions over time for a single iteration of the LU decomposition in the base implementation. During the communication phase, data needs to be exchanged two times between FPGAs using MPI and PCIe. The matrix multiplication kernels are executed in the update phase. Communication and update phase of subsequent iterations can overlap, so communication latency can partially be hidden.
2.3.2 Intel External Channels Implementation.
Figure 6 shows connections between the kernels used in the communication phase. For the execution over multiple FPGAs, the boards are arranged into a quadratic 2D torus of variable size using the point-to-point connections. Not all kernels need to be active on every FPGA within a single iteration. Instead, data can also be received over the external channels if it is computed on another FPGA. If the FPGA is in charge of calculating the LU block, then the LU kernel is executed and the decomposed L and U blocks are forwarded row- and column-wise to a network kernel. The network kernel forwards the data over the external channels to neighboring FPGAs in the torus. The four possible directions are used for different types of data, as it is indicated by the red arrows. Moreover, the network kernel forwards the locally computed L and U block to the left and top kernel. The top and left kernel use the data to update a block with the L or U block and forward the updated block to the next network kernels. Here, the input data is selected either from the internal or external channels and data is forwarded over the external channels if required. Besides that, incoming data is stored in global memory buffers for later use in the update phase. By splitting the network communication into three kernels, it is possible to establish a cycle-free data path through the torus during the communication phase. This reduces the impact of pipeline and channel latencies during this phase.
Fig. 6. Data flow through the kernels of the communication phase on a single FPGA. The kernels are connected over internal channels. Between the calculation kernels, network kernels are used to select the correct input for the next kernel from the internal or external channels. The network kernels for the top and left direction will also store incoming matrix blocks as input for the matrix multiplication. The bold arrows represent the serial channels and their direction in the 2D torus.
During the transfer from the left kernel to the network kernel, the left blocks are transposed. This allows a simplified design of the matrix multiplication for the inner blocks, since all input matrices can be processed row-wise. Figure 7 shows a part of the execution of the kernels over time for the iteration given in Figure 4. It can be seen that the LU kernel is only executed once per communication phase. The lower and upper matrices are buffered by the left and top kernel to allow the update of subsequent blocks. All network kernels are summarized under Network in the graph. In the example, two matrix multiplication kernels are used, and the blocks a redistributed between the two replications. The next communication phase starts as soon as the first row and column of the inner blocks is updated, which is represented by the dark-red blocks.
Fig. 7. Kernel executions over time for a single iteration of the LU decomposition. During the communication phase, the network kernels are active, whereas during the update phase the matrix multiplication kernel is executed. Communication and update phase of subsequent iterations overlap.
A 2D torus is used to connect multiple FPGAs for the LU decomposition. Every FPGA is programmed with the same bitstream and the host schedules the kernels in the required order and configuration. Matrix blocks are distributed between the FPGAs using a PQ-grid of the size of the torus. This allows balancing the load between the devices more evenly, since the matrix will get smaller with every iteration of the algorithm. In Figure 8, the active kernels and the data exchange between FPGAs in a \(3 \times 3\) torus is shown for a global matrix size of more than 12 blocks, so FPGA has to update more than four blocks. In this case, only the FPGA on the top left needs to execute all four compute kernels, but every FPGA will use its matrix multiplication kernel.
Fig. 8. For the LU decomposition, the FPGAs use a 2D torus network topology to exchange data. This example shows the active kernels during a single iteration in a 3 \(\times\) 3 torus. The black boxes are the FPGAs, and the colors within the boxes indicate the active kernels. The direction and type of data that is forwarded between the FPGAs is given by the arrows. In every iteration of the algorithm this communication scheme shifts one FPGA to the bottom-right in the torus.
The base implementation of the HPL benchmark uses a similar two-leveled blocked approach than the GEMM benchmark described in Reference [15]. Thus, it uses two parameters to specify the block sizes of the local memory buffers and of the compute units, as described in Table 4. Additionally, it is possible to specify the data type and specify the number of matrix multiplication kernels using the
| Parameter | Description |
|---|---|
| Logarithm of the size of the matrix blocks that are buffered in local memory and also distributed between the FPGAs. | |
| Logarithm of the size of the second-level matrix blocks. The kernels contain completely unrolled logic to start the computation of such a sub-block every clock cycle. | |
| Specifies the used data type for the calculation. |
Table 4. Configuration Parameters of the LINPACK Benchmark
Only the LU factorization of matrix \(A\) is calculated on the FPGAs. After this step, the equation system is solved using a distributed CPU reference implementation among all MPI ranks. The input matrix was generated such that the resulting vector is a vector of all ones. The reported error is the normalized maximum residual error calculated with \(\frac{||x||}{n \cdot ||b|| \cdot \epsilon }\), where \(n\) is the width of matrix \(A\) and \(\epsilon\) the machine epsilon.
2.4 Extend Existing Benchmarks for Multi-FPGA Execution
In addition to the new benchmarks proposed in this article, we extend the existing benchmarks of our previous work [15] for the execution in a multi-FPGA environment. An essential configuration parameter for all benchmarks is the specification of kernel replications
The RandomAccess benchmark was not well scalable, because it could at best update a single data item per clock cycle even when scaled over multiple FPGAs. This is limited by the way the pseudo-random numbers for the address calculation are generated. We now allow the generation of multiple pseudo-random numbers per clock cycle to overcome this limitation by replicating the Random Number Generator (RNG). This also changed the configuration parameters of the benchmark as given in Table 5.
| Parameter | Description |
|---|---|
| Logarithm of the size of the data buffer that is randomly updated in number of values. | |
| Logarithm of the number of RNGs that are created per kernel replication. | |
| Distance between RNGs in the shift register. |
Table 5. Configuration Parameters of the RandomAccess Benchmark
A single replication of the improved RandomAccess kernel is given in Figure 9. The RNGs are initialized with different seeds to generate a sub-part of the random-number sequence. In consequence, the same random number as with the old version are generated, only the order of updates may vary. Every clock cycle, the RNG outputs a new random number. This number is placed into a shift register if two conditions hold:
Fig. 9. RandomAccess shift register used to connect the RNGs to the update logic. The RNG will only put the generated number in the shift register if no valid number is at the current position.
(1) | The buffer address derived from the random number is in range of the kernel replication. | ||||
(2) | The shift register does not already contain a valid random number at the position where it should be inserted. | ||||
In the latter case, the RNG will stall until the random number can be placed into the shift register. So, in other terms, the produced random numbers are sequentialized by the shift register for the input into the actual update logic. This approach increases the probability that the update logic processes a valid address for high numbers of kernel replications. Since scaling over multiple FPGAs corresponds to increasing the number of kernel replications, this does also improve the performance in multi-FPGA execution.
3 BENCHMARK EXECUTION AND EVALUATION
In the following, we execute and evaluate the scaling behavior of the existing and the three new benchmarks of the benchmark suite on two different multi-FPGA systems containing Xilinx or Intel FPGAs.
3.1 Evaluation Setup and Synthesis Results
For the evaluation of the benchmarks, we used version 0.5.1 of the benchmark suite cited in Reference [14], and we made all artifacts and code modifications for additional experiments publicly available [13].
We synthesized and executed the benchmarks on two multi-FPGA systems: the Noctua system of Paderborn Center for Parallel Computing (PC\(^2\)) at Paderborn University and the Xilinx FPGA evaluation system of the Systems Group at ETH Zurich. A visualization of the topology of the FPGA nodes in the Noctua system is given in Figure 10. For the benchmark execution, the data is generated on the CPUs and moved to the DDR memory on the FPGA boards. Results are copied back to the CPU memory from the DDR memory for validation after the benchmark execution.
Fig. 10. Topology of the Noctua FPGA nodes. 16 nodes are connected to a single 100 G Omni Path switch, and every node is equipped with two CPUs and two FPGA boards. All four QSFP28 ports of each FPGA are connected to the optical circuit switch that allows the configuration of arbitrary point-to-point connections between the QSFP28 ports. Every QSFP28 port offers a maximum communication bandwidth of 40 Gbps. The optical switch is configured once before the execution of the benchmarks.
The Noctua system uses Nallatech/BittWare 520N boards equipped with Intel Stratix 10 GX2800 FPGAs, where every node is a two-socket system equipped with Intel Xeon Gold 6148 CPUs, 192 GB of DDR4-2666 main memory, and two FPGAs connected via x8 PCIe 3.0. Moreover, the nodes in the cluster communicate over a hybrid network: The CPUs use an Intel Omni Path network with 100 Gbit/s per port, whereas the FPGAs can exchange data over the four serial interfaces with up to 40 Gbit/s per port. The serial interfaces are connected to a CALIENT S320 Optical Circuit Switch that allows the configuration of arbitrary full-duplex point-to-point connections between the serial interfaces of the FPGAs. This functionality allows creating the desired network topology and is used in the following to establish connections between up to 26 FPGAs to execute the optimized versions of the benchmarks. The network topology is set up before running the benchmarks and stays unchanged during execution.
Board Support Package (BSP) version 20.4.0 and Intel OpenCL Software Development Kit (SDK) for FPGA version 21.2.0 are used to synthesize all benchmark kernels. This BSP version comes in two different sub-versions with and without support for the external channels. All benchmarks were synthesized with the HPC sub-version, which offers no support for communication over the external channels. This BSP requires slightly fewer resources than the MAX sub-version with external channel support. Only the optimized versions of b_eff, PTRANS, and LINPACK are synthesized with the MAX sub-version. The host codes are compiled with GCC 8.3.0 and Intel MPI 2019 Update 6 Build 20191024. Configuration and generation of the build scripts are done using CMake 3.15.3.
Additionally, the benchmarks are synthesized and executed on Xilinx Alveo U280 boards. The Heterogeneous Accelerated Compute Cluster (HACC) system at ETH Zurich contains four of these boards. As SDK, Vitis 2020.2 is used with the shell xilinx_u280_xdma_201920_3, and XRT 2.9. The host codes are compiled with GCC 7.5.0 and OpenMPI 2.1.1. Each FPGA is controlled by an Intel Xeon Gold 6234 CPU with 108 GB of main memory. The topology of the used nodes is given in Figure 11. The visualization is simplified to only show the accelerator boards and communication interfaces that are used for the benchmark execution. Each node also contains an additional Alveo U250 board, which is not used in the experiments.
Fig. 11. Topology of the used ETH HACC system nodes. Each node is connected to the 100 G Ethernet switch via two Network Interface Controllers (NICs) and is equipped with four CPUs and two FPGA boards.
All benchmark kernels are designed to be independent of the number of used FPGAs, so only a single synthesis for every benchmark and FPGA board is required for the evaluation. With configuration parameters, it is possible to improve resource utilization and performance of the benchmark kernels for a specific FPGA board before synthesis. For the benchmarks STREAM, FFT, and GEMM, these configuration parameters were discussed in more detail in Reference [15]. Table 6 contains the used configurations for each benchmark. The configuration parameters are chosen to better utilize performance-relevant resources on the FPGA.
| Benchmark | Parameter | 520N IEC | 520N PCIe | U280 PCIe |
|---|---|---|---|---|
| STREAM | 4 | 4 | 2 | |
| float | float | float | ||
| 1 | 1 | 1 | ||
| 16 | 16 | 16 | ||
| 32,768 | 32,768 | 16,384 | ||
| RandomAccess | 4 | 4 | 2 | |
| 0 | 0 | 10 | ||
| 5 | 5 | 3 | ||
| 5 | 5 | 1 | ||
| FFT | 2 | 2 | 1 | |
| 17 | 17 | 9 | ||
| GEMM | 5 | 5 | 3 | |
| float | float | float | ||
| 8 | 8 | 8 | ||
| 512 | 512 | 256 | ||
| 8 | 8 | 8 | ||
| b_eff | 2 | only host code required | ||
| 8 | only host code required | |||
| PTRANS | 4 | 4 | 2 | |
| float | float | float | ||
| 8 | 16 | 16 | ||
| 512 | 512 | 256 | ||
| LINPACK | 5 | 5 | 2 | |
| float | float | float | ||
| 9 | 9 | 8 | ||
| 3 | 3 | 3 | ||
Table 6. Synthesis Configurations of All Benchmarks
In Table 7, the resource usage of the synthesized benchmark kernels is given. Our updated scalable implementation of the RandomAccess benchmark now requires additional logic and BRAMs to implement the RNGs. We have chosen the number of RNGs in the configuration to be the next larger power of two of the used FPGAs. This increases the probability that a valid number can be processed in every clock cycle by every FPGA. Further increasing the number of RNGs may lead to lower clock frequencies, which will also have an impact on performance. With the recent SDK version, we were able to synthesize the GEMM benchmark with a much higher clock frequency for the BittWare 520N, which promises a large performance improvement. For the b_eff benchmark, only a single bitstream is synthesized. For the base implementations, no bitstream is required, since the data transfer is solely handled by the host. Also, resource consumption is not an issue with this synthetic benchmark, because the performance is mainly limited by the network bandwidth and latency.
Table 7. Resource Usage of the Synthesized Benchmarks
PTRANS requires BRAM buffers that are used to block-wise transpose the matrix and store intermediate results. A large block size will benefit large memory bursts, but it is also important to achieve a clock frequency close to 300 MHz to make best use of the memory bandwith. For LINPACK it is—similar to GEMM—important to maximize the number of used DSPs for matrix multiplications. In addition, mainly some extra BRAM is required to store the matrix blocks for the kernels of the communication phase. As a result, the resource utilization is very similar to GEMM. A significant difference is visible for the Alveo U280, where only 69% of the DSPs can be utilized. We were not able to fit the communication kernels and three matrix multiplication kernels on the FPGA, because this would overutilize the DSPs. One approach to make use of the remaining DSPs would be to reduce the parallelism for the third matrix multiplication kernel by setting
The difference in DSPs between the base and optimized implementation for the BittWare 520N is caused by the way the multiplication of the \(8 \times 8\) matrices in registers is implemented. In case of the IEC version, only multiply-adds are used consuming 512 DSPs in total per replication. In the base implementation, the compiler created the same matrix multiplication from 64 dot-products of size 8 followed by 64 additions. This slightly increases the DSP usage to 576 DSPs per replication but considerably reduces the logic and BRAM usage, as it can be seen in the resource utilization.
Besides the two bitstreams for the baseline and one for the vendor-specific implementation, we also synthesized LINPACK with a block size of 256 elements for the 520N. This is the same block size that is used on the U280 and allows a better comparison of the performance results of both FPGA boards. The configuration requires considerably lesser logic and BRAM compared to the version with 512 element block width and achieves a higher clock frequency.
3.2 Evaluation of the Effective Bandwidth and PTRANS
The b_eff benchmark does not only report the derived metric effective bandwidth but also the achieved bandwidth for all tested message sizes, which range from a single byte to 1 MB. A new message is only sent after the current message is received by the neighbor node.
The base implementation of the b_eff benchmark reads the data from the FPGA board to the host using an OpenCL call, exchanges the data between the host CPUs using the
The measured total bandwidth over the message sizes is given in Figure 12 for two FPGAs or CPU nodes, respectively. We do not show the MPI-only performance for the Xilinx system in this plot, since it heavily overlaps with the measurements for the base implementations. The maximum theoretical bandwidths for PCIe, MPI via the Intel Omni Path 100Gbps interconnect, and IEC are given by the black dashed lines in the plot. For the base implementations, the bandwidth remains below 5 GB/s on both devices, although the theoretical bandwidth of PCIe and MPI are both much higher. The bandwidth gets limited by the additional copy operations required to get the data from FPGA to CPU and back. Using Equation (2) and the peak MPI and PCIe bandwidths, the theoretical bandwidth of the baseline version is given in Equation (7): (7) \(\begin{equation} b_{\mathit {max}} = \frac{2}{\frac{1}{12.5 \mathit {GB/s}} + \frac{2}{8 \mathit {GB/s}}} = 6.06 \mathit {GB/s.} \end{equation}\) However, the measurements with the MPI-only implementation of the benchmark show that the maximum message size of 1 MB is not sufficiently large to utilize the MPI peak performance on Noctua.
Fig. 12. The aggregated bandwidth over different message sizes measured by the b_eff benchmark over two CPUs or FPGAs. Next to the measurements, the maximum performance for the communication between FPGAs, CPUs, and between FPGAs and CPUs via PCIe.
The optimized IEC approach shows maximum bandwidths close to the theoretical peak for 1 MB message sizes. Also, the measurements closely correlate to the model described in Section 2.1.2 in Equation (4).
The linear scaling behavior for the derived effective bandwidth metric is visualized for all for implementations in Figure 13. All implementations show nearly perfect scaling with the available network bandwidth, as it is indicated by the extrapolation lines for each device. Especially for the MPI-only and the MPI + PCIe version, the performance on a single node is considerably higher than the available network bandwidth, because in these cases the data will be transferred between the ranks using shared memory. We also observe a huge difference in the effective bandwidth between the two MPI versions of Noctua and the Xilinx system. These measurements have to be kept in mind when comparing the results of the base implementation on the two systems, since the huge difference in MPI performance will also have an impact on the PCIe + MPI performance. The ability to add additional optimized implementations of the benchmark kernels allows to generate comparable results not only limited to FPGAs but also for CPU or other accelerators. This only requires minor changes in the existing code base, and large amounts of the code can be reused, including handling of input parameters, input data generation and validation, calculation of derived metrics, and printing of performance summaries.
Fig. 13. The measured effective bandwidth over the number of used FPGAs and CPUs. A logarithmic scale is used for the ring size and the measured bandwidth. The colored lines represent the perfect scaling based on measured effective bandwidth over two FPGAs or CPU nodes.
For the matrix transposition, the blocks of a matrix are distributed among the FPGAs in a PQ distribution where \(P = Q\). A matrix of 32,768 elements is transposed using strong and weak scaling. The resulting speedups for the different FPGAs are given in Figure 14. In the strong scaling experiment, the base implementation on the BittWare 520N shows a better scaling behavior as the optimized version using IEC. This is because the base implementation is mainly bottlenecked by the PCIe bandwidth for the exchange of the matrices. In contrast to that, the optimized version shows a significant reduction of the speedup for larger number of FPGAs. This is caused by the compute pipeline on the FPGAs, which can not be fully utilized with the smaller matrix sizes.
Fig. 14. Speedup of the PTRANS benchmark executed with a quadratic matrix of 16,384 elements over up to 25 FPGAs in a weak and strong scaling scenario.
In the weak scaling experiment, the matrix size per FPGA stays the same, and the implementation achieves optimal speedup for up to 25 FPGAs. The base implementation shows no significant differences for both FPGAs in strong and weak scaling. On the Xilinx Alveo U280, the base implementation does not scale well. This is related to the low FPGA-to-FPGA bandwidth that we also measured in the b_eff benchmark. This means that this difference is caused by the comparably low MPI performance on the Xilinx system and not by the FPGAs.
3.3 Evaluation of HPL
In a first experiment, we measure the performance on a single FPGA for different matrix sizes of up to 20,480 elements. The performance for four bitstreams on the two different systems are given in Figure 15. To allow an easier comparison of the efficiency of the design on the different platforms, the performance was normalized to a kernel frequency of 100 MHz and a single kernel replication. So, the nomalized performance for a given matrix size should be very similar on the different platforms. For small matrix sizes, the communication phase can not be overlapped with the computation phase. Only for larger sizes of the matrix, both phases overlap for most of the computation time and the performance converges to the matrix multiplication performance. Still, significant differences between the different bitstreams can be observed and are mainly caused by the chosen benchmark configuration parameters and compiler flags. When comparing the base version and the vendor-specific version with Intel external channels (IEC) used on the BittWare 520N board, the base version using PCIe for communication shows lower performance, although the same configuration parameters are used. The base version of the benchmark failed to synthesize with memory interleaving because additional Load Store Units (LSUs) are used for the communication and increase the complexity of the memory system. Since only a single buffer is used to store matrix \(A\), this effectively reduces the global memory bandwidth and stalls of the matrix multiplication kernels increase.
Fig. 15. Normalized performance of the HPL bitstreams on the target FPGAs for different matrix sizes. The base versions of the benchmark are marked with PCIe, referring to the path of communication. For small matrices, the design is limited by the communication latency until it can be efficiently hidden by matrix multiplications. Moreover, an additional execution for the BittWare 520N with a block size of 256 is given for comparison with the Xilinx Alveo U280.
On the Xilinx Alveo U280 board, the largest block size that fits on the device is 256 elements, in contrast to 512 element blocks for the 520N board. The reduced block size results in an overlap of communication and computation for smaller matrix sizes but also reduces the peak performance, because the utilization of the matrix multiplication pipeline decreases. For comparison, we synthesized a bitstream for the 520N with a block size of 256. It shows a similar scaling behavior with regards to the matrix size but shows a slightly lower normalized peak performance. The bitstream for the 520N achieves a nearly 80% higher frequency, which also increases the memory bandwidth utilization and leads to more frequent pipeline stalls, which eventually leads to a lower normalized performance.
Based on the measurements done with a single FPGA, we set the matrix size for the multi-FPGA experiments to 24,576 elements, since all bitstreams will be close the their peak performance for this size. We use this matrix size as a base for a weak scaling experiment, where the matrix size increases with the width of the FPGA torus so the matrix size on a single FPGA remains constant. Additionally, we execute a strong scaling experiment, where the global matrix size remains constant while increasing the torus size. The measurement results for the weak scaling experiment are given in Figure 16. It can be observed that all three implementations of the benchmark show a close to optimal scaling for up to 25 FPGAs. Considering the differences in the network bandwidth that had an effect on the PTRANS results, this also means that the benchmark implementation is compute-bound on all FPGAs.
Fig. 16. Speedup of HPL with a matrix width of 24,576 elements over multiple FPGAs in a weak scaling scenario.
The results of the strong scaling experiment are given in Figure 17. Both benchmarks show a much lower increase in performance for larger torus sizes. Based on the data of our single FPGA scaling experiment shown in Figure 15, we created an extrapolation model for the strong scaling experiment. It shows that performance per FPGA is tightly coupled to the size of the local matrices on the FPGAs. The extrapolation for the Xilinx Alveo U280 shows a better speedup in this strong scaling scenario because of the smaller block sizes. With this very simple approach it is already possible to model the performance depending on the total matrix size and the number of FPGAs with high accuracy. The strong scaling experiment shows that the overall performance in the torus is tightly coupled to the input size on a single device for all implementations.
Fig. 17. Speedup of HPL with a matrix width of 24,576 elements over multiple FPGAs in a strong scaling scenario. Extrapolation models for the three different bitstreams are given as colored lines. They are based on the measured single-FPGA performance per matrix size in Figure 15.
The HPL implementation achieves a lower performance per FPGA than the existing GEMM benchmark, although both get their performance from matrix multiplication. Also, the configuration parameters are chosen similarly for both benchmarks, which results in a similar expected performance. The main difference in performance is caused by the different clock frequencies of the designs given in Table 7. The tools achieve higher clock frequencies for the base implementation of the GEMM benchmark, because the kernels of the HPL communication phase consume additional resources. Also, the matrix multiplications work on smaller matrix sizes of just a single block, which reduces the reuse of data in local memory.
Our HPL implementation achieves 14.3 TFLOP/s for the base version and 20.8 TFLOP/s for the optimized version using IEC on 25 BittWare 520N FPGAs. The scaling experiments show that the two major reasons for the performance differences rely on the achieved frequencies and more efficient use of the global memory. Although the benchmark is computation-bound, our optimized version using IEC still achieves higher performance by reducing the number of LSUs. This allows further global memory optimizations and higher kernel frequencies that improve the performance.
3.4 Evaluation of the Existing Benchmarks
For STREAM, FFT, and GEMM, the design did not change compared to the previous work. All except RandomAccess are executed embarrassingly parallel, so every MPI rank calculates on a local problem. MPI is only used to exchange measurement and validation results. In the case of RandomAccess, the data array is distributed among the FPGAs. This way, only scaling to a power of two is allowed, since the total size of the data array must be a power of two.
We executed the four benchmarks on up to 26 FPGAs to show their scaling performance. 4 GB arrays are used in STREAM, FFT calculates on 4,096 1d FFTs of \(2^{17}\) or \(2^9\) complex numbers, and GEMM uses matrices with the width of 23,040 elements per FPGA. RandomAccess is executed in a strong scaling scenario with 8 GB data array. The normalized measurement results are given in Figure 18. For STREAM, the measurements are normalized to a single memory bank with a theoretical bandwidth of 19.2 GB/s. The benchmark shows a similar scaling behavior on both devices. For GEMM, the results are normalized to a single kernel replication running at 100 MHz with an \(8 \times 8 \times 8\) matrix multiplication in registers. This leads to a maximum theoretical performance of 102.4 GFLOP/s times the number of used FPGAs. Also, here, the base implementation shows a performance close to the theoretical peak on both devices. Because of the comparably high clock frequency of our synthesized design, we achieved more than 1.2 TFLOP/s per FPGA on the BittWare 520N.
Fig. 18. Normalized performance of the four benchmarks without inter-FPGA communication. Data is normalized to a single memory bank and 300 MHz clock for STREAM. For GEMM and RandomAccess, the results are normalized to a single kernel replication at 100 MHz to allow a better comparison of the performance efficiency of the baseline designs on the two FPGA boards.
Although we used the same benchmark code on the FPGA side as in Reference [15], we were not able to execute the FFT benchmark on the Alveo U280. The benchmark required internal channels or pipes between the kernels to forward data, but support for pipes in OpenCL kernel code was removed with XRT 2.9. This still allows synthesis of the benchmark, but no execution. On the BittWare 520N, the benchmark scaled linearly. For FFT, we show the absolute measured performance.
The RandomAccess results were also normalized to the number of memory banks and a kernel frequency of 100 MHz. Because an update of a value requires one read and one write to the memory bank, two clock cycles are required per update, which results in a theoretical peak performance of 50MUOP/s per FPGA. On the BittWare 520N, the base implementation gets close to this theoretical peak, whereas on the Alveo U280, we only get roughly half of this performance. One reason for that is the difference in the configuration: For the Alveo U280, we need a small buffer to read and write multiple values subsequentially, which partially hides the latency of memory accesses and increases performance. As a tradeoff, we see an increased error rate, because we may overwrite values that are already in the buffer. Still, this approach requires iteration between two different pipelines that fill and empty the buffer. Since these pipelines have a considerable latency because of memory accesses, this will also reduce the performance, because the pipelines have to be emptied frequently. If it would be possible to ignore the dependency between reads and writes, then the single-pipeline approach could be used as it is done for the BittWare 520N.
3.5 Overall Benchmark Results
In the previous section, we have focused on the evaluation of the performance efficiency of the proposed benchmark designs on Noctua and the HACC system. By normalizing the kernel frequency or the number of memory banks and comparing the results to simple performance models, we showed that the baseline versions of the benchmarks have a similar performance efficiency on both tested devices. The absolute performance numbers obtained with the benchmarks are given in Table 8. The first column contains the performance numbers obtained from the HACC system using four FPGAs, the second and third columns contain the numbers for 16 FPGAs on Noctua for the baseline version and the optimized version using IEC. The results for STREAM, RandomAccess, FFT, and GEMM are only given for the baseline version, because they were not changed for the optimized runs.
Table 8. Benchmark Results
In the last column, we executed HPCC 1.5.0 over 16 CPUs on Noctua. The benchmark suite was compiled with GCC 11.3.0, OpenMPI 4.1.1, and Intel MKL 2022.0.1. For basic optimizations, we enabled OpenMP support and set the compilation flags
We executed the Embarrassingly Parallel (EP) versions of the HPCC benchmarks, because our FPGA versions are based on their benchmarking rules. Still, there are some differences between the CPU and FPGA benchmarks: STREAM, FFT, GEMM, PTRANS, and HPL are executed in double-precision floating-point on CPU, whereas single-precision is used on FPGA. However, for STREAM and PTRANS, this should have only a minimal impact on the results, since both benchmarks are very likely not compute-bound because of their low arithmetic intensity and the performance is reported in GB/s. FFT is executing one large FFT of size \(2^{27}\) instead of a batched execution of \(4{,}096 \times 2^{17}\) FFTs per rank. This means that the total number of FLOPs and the arithmetic intensity slightly differs between the execution of FPGA and CPU. HPL makes use of partial pivoting within the LU factorization, whereas the FPGA implementation does not use pivoting. The results for all benchmarks are reported as an average over all used FPGAs/CPUs, as it is done in the original HPCC benchmark suite. Only PTRANS and HPL output the total performance of the whole system.
The performance difference in STREAM for the two FPGA baseline versions is caused by the number of DDR memory banks on the FPGAs. The U280 has two DDR memory banks, whereas the 520N has four, which directly reflects in the measured bandwidth. The used CPU is equipped with six memory banks and shows the highest bandwidth in this result. However, HBM2 can be used on some FPGA boards like the Alveo U280 to considerably increase the STREAM bandwidth [16].
For RandomAccess, the FPGA implementation used on Noctua without any local memory buffers shows clear performance benefits compared to the cached version used on the ETH HACC system and on CPU.
There is a considerable performance difference between the baseline version and the optimized version of LINPACK on Noctua. As we have seen in Figure 15, the efficiency of the benchmark design is reduced for the baseline version due to differences in the memory interface. In addition, the clock frequency of the design is slightly lower than for the optimized design as given in Table 7.
For PTRANS and b_eff, we can observe the largest performance difference between the baseline and the optimized version on FPGA. For b_eff, the communication latency and bandwidth are reported for all systems. For the FPGA baseline versions, the communication latency is very high compared to the CPU and the optimized FPGA version. The data is first copied from the FPGA DDR memory to the CPU memory before it can be transferred via MPI. These additional copies, the latency of the data transfer via PCIe, and the additional latencies introduced by the FPGA runtimes have a huge impact on the overall latency of the baseline communication approach. Like our own experiments show in Figure 12, the b_eff benchmark included in the HPCC suite is not able to achieve a bandwidth close to the theoretical peak of 12.5 GB/s with the 2 MB message sizes used for this measurement. With larger message sizes, bandwidths close to the theoretical peak may be achievable.
The proposed benchmarks are capable to reflect the advantages of direct inter-FPGA communication in terms of communication bandwidth and latency. The results also show that direct inter-FPGA communication offers new opportunities for scaling FPGA applications over multiple devices.
4 RELATED WORK
Several benchmark suites for FPGA provide benchmarks for OpenCL or HLS frameworks [2, 17, 24, 25]. However, the benchmarks often use small input sizes or come with fixed kernel designs that allow no easy scaling to larger sizes. With Spector [7], a benchmark suite exists that provides configuration parameters for each benchmark to support design space explorations. Also, HPCC FPGA [15] comes with parametrizable benchmarks and is based on the HPC Challenge benchmark suite [6] for CPU, so the benchmarks target the HPC domain. But it lacks implementations for some of the benchmarks. None of the suites supports the performance characterization of a multi-FPGA system and its inter-FPGA communication networks.
The most complex benchmark that we added to the benchmark suite is LINPACK, where the most compute-intensive part is the LU decomposition. There already exist several implementations for scalable, double-precision blocked LU decomposition implemented in Hardware Description Language (HDL). A multi-FPGA implementation of LU decomposition is proposed in Reference [8]. The implementation uses up to five Virtex-II FPGAs for the calculation on a single matrix arranged in a star topology. The FPGAs need to be reconfigured several times during computation and calculate the LU decomposition of an 8,192 \(\times\) 8,192 matrix with double precision complex numbers using five FPGAs in 1,862.41 seconds. Wu et al. [23] propose a single FPGA implementation for Virtex-5 FPGAs that reaches 8.5 GFLOP/s and that can be easily extended for multi-FPGA execution. Jaiswal and Chandrachoodan also propose a scalable double-precision block LU decomposition implementation for Virtex-5 FPGAs [11] written in Verilog. They report achieving more than 120 GFLOP/s when scaling over eight FPGAs.
Turkington et al. [21] implement the LINPACK 1000 benchmark in Handel-C and achieve more than 2.5 GFLOP/s on Stratix II. Wu et al. [22] achieved more than 3.6 GFLOP/s with their HDL implementation of the same benchmark. Both implementations also include pivoting. Since we are rather taking up on the rules proposed for HPL-AI, pivoting is not part of our implementation.
Zohouri et al. [25] implemented an OpenCL single-precision LU decomposition without pivoting within the Rodinia FPGA benchmark suite. Execution on an Intel Arria 10 FPGA with a matrix size of 8,192 elements resulted in a performance of above 366.5 GFLOP/s. Our implementation requires much larger matrix sizes to achieve its peak performance but already achieves 493.7 GFLOP/s on Stratix 10 with the given matrix size. However, since our implementation is also scalable over multiple FPGAs, the overall performance is not limited by the resources of a single FPGA.
5 CONCLUSION
In this work, we extended the HPCC FPGA benchmark suite with support for multi-FPGA systems and their inter-FPGA communication interfaces. Therefore, we proposed a scalable version of the RandomAccess benchmark and extended all existing benchmarks with multi-FPGA support. Moreover, we added three new benchmarks, b_eff, PTRANS, and LINPACK, for Xilinx and Intel FPGAs that stress inter-FPGA communication and provided baseline implementations via MPI and PCIe for all of them. The baseline implementations show similar normalized performance and scaling behavior on our two evaluation systems with up to 26 BittWare 520N and four Xilinx Alveo U280 boards.
To show the extendability of the benchmark suite with support for vendor-specific communication interfaces, we also provided implementations with IEC for direct point-to-point connections between FPGAs. Evaluation of the vendor-specific and the baseline implementations revealed the advantages of direct inter-FPGA communication over communication via MPI not only for the communication-bandwidth-bound applications but also for computation-bound applications like LINPACK.
With LINPACK, we also proposed a well-scaling LU decomposition implementation in a 2D torus. The evaluation showed that the performance of the implementation is limited by the aggregated matrix multiplication performance of the used devices. With further architecture-specific optimizations to increase the clock frequency of the implementation, more than 1 TFLOP/s per FPGA on the BittWare 520N are within reach with the proposed design.
We made the extended version of HPCC FPGA publicly available to facilitate active participation in development towards a performance characterization tool for HPC multi-FPGA systems and their inter-FPGA communication interfaces.3 Although we provide widely applicable baseline implementations with this work, there are still no optimized benchmark implementations available in the suite that utilize the direct inter-FPGA communication capabilities on Xilinx FPGAs. A further extension of the benchmark suite with support for communication libraries such as EasyNet and ACCL is a step towards filling this gap.
ACKNOWLEDGMENTS
The authors gratefully acknowledge the support of this project by computing time provided by the Paderborn Center for Parallel Computing (PC2) and the Systems Group at ETH Zurich as well as the AMD Heterogeneous Accellerated Compute Clusters (HACC) program for access to their FPGA evaluation system.
Footnotes
- [1] BittWare Ltd. 2020. BittWare OpenCL S10 BSP Reference Guide. Rev. 1.3.Google Scholar
- [2] . 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In 3rd Workshop on General-Purpose Computation on Graphics Processing Units. Association for Computing Machinery, New York, NY, 63–74. Google Scholar
- [3] . 2019. Streaming message interface: High-performance distributed memory programming on reconfigurable hardware. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC’19). Association for Computing Machinery, New York, NY.
DOI: Google ScholarDigital Library
- [4] Jack Dongarra, Piotr Luszczek, and Yaohung Tsai. 2022. HPL-MXP Mixed-Precision Benchmark. Retrieved from https://hpl-mxp.org.Google Scholar
- [5] . 1997. Key concepts for parallel out-of-core LU factorization. Parallel Comput. 23, 1-2 (1997), 49–70.Google Scholar
Digital Library
- [6] . 2004. Introduction to the HPCChallenge Benchmark Suite.
Technical Report . Defense Technical Information Center, Fort Belvoir, VA.DOI: Google ScholarCross Ref
- [7] . 2016. Spector: An OpenCL FPGA benchmark suite. In International Conference on Field-Programmable Technology (FPT’16). 141–148.
DOI: Google ScholarCross Ref
- [8] . 2007. Performance of a LU decomposition on a multi-FPGA system compared to a low power commodity microprocessor system.Scal. Comput.: Pract. Exper. 8 (
1 2007).Google Scholar - [9] . 2021. EasyNet: 100 Gbps network for HLS. In 31st International Conference on Field-Programmable Logic and Applications (FPL’21). IEEE Computer Society, 197–203.
DOI: Google ScholarCross Ref
- [10] . 2021. ACCL: FPGA-accelerated collectives over 100 Gbps TCP-IP. In IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC’21). 33–43.
DOI: Google ScholarCross Ref
- [11] . 2012. FPGA-based high-performance and scalable block LU decomposition architecture. IEEE Trans. Comput. 61, 1 (2012), 60–72.
DOI: Google ScholarDigital Library
- [12] . 2018. OpenCL-ready high speed FPGA network for reconfigurable high performance computing. In International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia’18). Association for Computing Machinery, New York, NY, 192–201.
DOI: Google ScholarDigital Library
- [13] . 2021. HPCC FPGA Evaluation Data.
DOI: Google ScholarCross Ref
- [14] . 2021. HPCC_FPGA.
DOI: Google ScholarCross Ref
- [15] . 2020. Evaluating FPGA accelerator performance with a parameterized OpenCL adaptation of selected benchmarks of the HPCChallenge benchmark suite. In IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC’20). 10–18.
DOI: Google ScholarCross Ref
- [16] . 2022. In-depth FPGA accelerator performance evaluation with single node benchmarks from the HPC challenge benchmark suite for Intel and Xilinx FPGAs using OpenCL. J. Parallel Distrib. Comput. 160 (2022), 79–89.
DOI: Google ScholarDigital Library
- [17] . 2015. CHO: Towards a benchmark suite for OpenCL FPGA accelerators. In 3rd International Workshop on OpenCL (IWOCL’15). Association for Computing Machinery, New York, NY.
DOI: Google ScholarDigital Library
- [18] . 2018. Application partitioning on FPGA clusters: Inference over decision tree ensembles. In 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 295–2955.Google Scholar
Cross Ref
- [19] Rolf Rabenseifner. 1999. Effective Bandwidth (b_eff) Benchmark. Retrieved from https://fs.hlrs.de/projects/par/mpi/b_eff/.Google Scholar
- [20] TOP500.org. 2022. TOP500 Supercomputer Sites. Retrieved from https://www.top500.org/lists/top500/2021/11/.Google Scholar
- [21] . 2006. FPGA based acceleration of the LINPACK benchmark: A high level code transformation approach. In International Conference on Field Programmable Logic and Applications. 1–6.
DOI: Google ScholarCross Ref
- [22] . 2009. A fine-grained pipelined implementation of the LINPACK benchmark on FPGAs. In 17th IEEE Symposium on Field Programmable Custom Computing Machines. 183–190.
DOI: Google ScholarDigital Library
- [23] . 2012. A high performance and memory efficient LU decomposer on FPGAs. IEEE Trans. Comput. 61, 3 (2012), 366–378.
DOI: Google ScholarDigital Library
- [24] . 2018. Rosetta: A realistic high-level synthesis benchmark suite for software-programmable FPGAs. In International Symposium on Field-Programmable Gate Arrays (FPGA’18).Google Scholar
- [25] . 2016. Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 409–420.
DOI: Google ScholarCross Ref
Index Terms
Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-switched Inter-FPGA Networks
Recommendations
In-depth FPGA accelerator performance evaluation with single node benchmarks from the HPC challenge benchmark suite for Intel and Xilinx FPGAs using OpenCL
Highlights- FPGA performance characterization using parametrizable OpenCL benchmarks.
- High ...
AbstractEmerging high-level tools lead to a reduced development time for applications on FPGA accelerators while still producing high-quality results. This is one reason for the increased adoption of FPGAs in data center applications which ...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysField-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
OpenCL-enabled GPU-FPGA Accelerated Computing with Inter-FPGA Communication
HPCAsia2020: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region WorkshopsField-programmable gate arrays (FPGAs) have garnered significant interest in high-performance computing research; their computational and communication capabilities have drastically improved in recent years owing to advances in semiconductor integration ...
























Comments