Abstract
The overlay architecture enables to raise the abstraction level of hardware design and enhances hardware-accelerated applications’ portability. In FPGAs, there is a growing awareness of the overlay structure as typified by many-core architecture. It works in theory; however, it is difficult in practice, because it is beset with serious design issues. For example, the size of FPGAs is bigger than before. It is exacerbating the issue of the place-and-route. Besides, a single FPGA is actually the sum of small-to-middle FPGAs by advancing packaging technology like silicon interposers. Thus, the tightly coupled many-core designs will face this covert issue that the wires among the regions are extremely restricted. This article proposes efficient essential processing elements, micro-architecture design, and the interconnect architecture toward a scalable many-core overlay design. In particular, our work proposes a novel compact buffering technique to reduce memory resource utilization in tightly connected overlays while preserving computational efficiency. This technique reduces the utilization of BlockRAM to nearly 50% while achieving a best-case computational efficiency of 91.93% in a three-dimensional Jacobi benchmark. Besides, the proposed enhancements led to around 2× and 3× improvement in performance and power efficiency, respectively. Moreover, the improved scalability allowed increasing compute resources and delivering around 4× better performance and power efficiency, as compared to the baseline Dynamically Re-programmable Architecture of Gather-scatter Overlay Nodes overlay.
1 INTRODUCTION
Circuit design optimization is a complex and lengthy process that often requires expertise in multiple aspects related to the desired goals. Field Programmable Gate Array (FPGA) overlay architectures face this issue; multiple design properties such as speed, area, power, and computational efficiency can be quite frustrating to improve, even for the most seasoned experts in this field. Frequently, a performance-vs.-power and area-vs.-power dilemma in chip designs tends to keep a perplexing aspect. In fact, improving the performance of parallel computing systems is heavily prone to a power consumption surge, either due to an increase in clock speed or compute resource density. This article shows a concrete example where these tradeoffs can be well handled by carefully analyzing the optimization goals and their underlying cost to smartly strike the best deals on each aspect. A many-core overlay architecture, called Dynamically Re-programmable Architecture of Gather-scatter Overlay Nodes (DRAGON), was given in Reference [3], which was based upon our cumulative work [2, 5, 6]. It is noteworthy that DRAGON managed to achieve an 89.9% Effective to peak Performance Ratio (EPR) in complex scientific benchmarks while preserving its general-purpose versatility and its software-programmable flexibility. This high EPR comes at the cost of deploying a separate communication buffer for each of the four directions of data exchange among adjacent Processing Elements (PE) in the two-dimensional (2D) Mesh tightly connected overlay network. The separate buffers offer separate write and read channels, and thus coupled with an efficient Very Large Instruction Word (VLIW) Instruction Set Architecture, they enable perfect overlapping of memory transfers with computations, effectively reducing the data exchange latency between PEs to absolute zero. While this approach appears to be efficient for a low dimension of the interconnect (2D), its sustainability remains questioned for higher dimensions because of the increased Block Random Access Memory (BRAM) utilization to implement more communication buffers for the added connections with remote neighbouring PEs. Here, we propose enhanced DRAGON (DRAGON2) and compact buffering (DRAGON2-CB), which implement a novel technique labeled compact buffering that remains efficient regardless of the interconnect dimension and reduces the BRAM requirement by more than 50%, as compared to our previous work, while having a little to almost no overhead in terms of EPR impact for multiple dimensions (two, three, and four) of widely used stencil benchmarks. Moreover, we provide a general model for the gains and costs of such a technique for any dimension \(N\) of a given stencil benchmark in a DRAGON2-CB overlay with interconnect dimension \(N\). Through this article, we re-introduce the DRAGON overlay architecture while summarizing the mainstream enhancements in DRAGON2. Then, we provide a side-by-side comparison against our previous findings. Besides, we further introduce an efficient compact buffering technique for \(N\)-dimensional interconnect and subsequently expand our study to investigate the scalability over multiple overlay size configurations and multiple interconnect degrees. Ultimately, we thoroughly discuss the impact of compact buffering, as compared to the regular scheme, on multiple aspects that include area, power, performance, scalability (in size and degree of interconnect), and computational efficiency (EPR). The main contributions of this work are summarized as follows:
2 BACKGROUND AND MOTIVATION
2.1 FPGA Overlays and Interconnection Networks
In the fall of Moore’s law, improving the performance of computing systems shifted from increasing their operating frequency to introducing parallelism to their workload by dividing it and dispatching its subsets to multiple, less power-hungry, processing cores. FPGA overlays adopted this many-core concept through Coarse Grained Reconfigurable Array (CGRA) to harness the parallel performance of FPGAs while reducing the programming model complexity by abstracting physical fabric hardware details. While we previously surveyed the state-of-the-art FPGA overlays in Reference [3], we shift our focus now onto their interconnect schemes. The work in Reference [32] provides a survey on time-multiplexed FPGA overlays and organizes their interconnection strategies into four different categories, namely, linear interconnect [8, 13], island style [7, 12, 25, 26], network-on-chip [16, 27, 28], and nearest neighbor [9, 33], which resembles most to our interconnect architecture proposed in this work. Generally, the many-core approach allows achieving better peak performance; however, the sustained performance remains somehow dependent on the interconnect scheme between these cores as well as the kind of application deployed. While multiple interconnect topologies and their routing components have been examined in previous works, such as Torus [17, 29], Butterfly [19, 31], and Mesh [30, 34, 42], the last one remains the most common choice of topology for on-chip interconnection network [39]. The reason behind this choice is the low implementation cost and reduced programming complexity as well as the perfect match with rectangular-shaped tiles of processing elements [48]. Nonetheless, applications with communication patterns that require significant data exchange rates among non-neighboring processing elements would cause a 2D Mesh topology to result in higher transfer latencies (increased number of hops) and therefore an increased power consumption. To alleviate this issue, the Mesh interconnect degree can be scaled to the problem pattern dimensions (2D Mesh for 2D applications, 3D Mesh for 3D applications, and so on). The scaling in the network dimension allows an enhanced communication degree by providing a direct link to more remote neighbors and thus would reduce traffic congestion, to maintain an overall better compute and power efficiency as will be shown through the evaluation sections.
2.2 DRAGON
DRAGON is a many-core overlay architecture that embeds multiple parallel programming paradigms such as Single Instruction Multiple Data (SIMD), VLIW, and Decoupled Access Execute (DAE) [3]. In fact, DRAGON is split into two parts that operate in tandem, namely, the Controller and the Accelerator as depicted by Figure 1. The Controller ensures decoupling the access to Global Memory (GM) data from the execution of instructions by the PEs on the Accelerator side by providing a data-exchange boundary-interface through each Broadcast Memory (BM). A sequencer decodes program instructions and issues a VLIW packet containing two 64-bit-wide instructions. These VLIW packets control the operations of each PE, Broadcast Memory Controller (BMC), and the Direct Memory Access (DMA) engines. However, the PE architecture consists of two slots, namely Dual Compute Slot (DCS) and Memory Slot (MS), as depicted by Figure 2. The VLIW instruction stream, controls the operation of these two slots, concurrently. This allows efficient overlapping of computations with internal (between Local Memory and Register File) as well as external (between adjacent PEs) memory transfers. When DRAGON was presented in Reference [3], the focus was on the EPR, regardless of the FPGA layout and physical resource details. Therefore, the implementation relied almost completely on the FPGA vendor’s software tools to fully control synthesis, placement, and routing steps. This approach unfairly capped the performance range to less than what the FPGA target and the overlay architecture itself were able to deliver. For example, in a 2D Jacobi benchmark, DRAGON was able to reach a sustained 89.9% of its Theoretical Peak Performance (TPP). While this EPR is relatively high, the TPP is, at least, below our expectations (37.44 GFLOPs) and may be dramatically improved. Nonetheless, the cost of such improvement shall not induce a visible impact on the achieved EPR. The TPP is still given by Equation (1) and suggests that enhancing the performance would require at least increasing the clock frequency (FREQ), the number of Broadcast Clusters (BC) (\(N\)BC) and/or the number of PEs per BC (\(N\)PE). (1) \(\begin{equation} TPP_{DRAGON} = 2 \times FREQ \times N_{PE} \times N_{BC}. \end{equation}\)
Fig. 1. The DRAGON many-core overlay architecture [3].
Fig. 2. The baseline DRAGON processing element architecture [3].
Unfortunately, many obstacles related to the target FPGA as well as the DRAGON architecture were blocking any increase on these parameters. The floating-point accumulation required a single clock-cycle and hence created critical paths on the FPGA implementation. Moreover, relying completely on FPGA software tools such as Vitis, yielded a sub-optimal placement and routing of the design and subsequently a reduced Fmax (maximum operational clock speed) on the delivered bitstream. While these issues can still be addressed, by careful guidance of the implementation and some tweaking on the overlay architecture, increasing the number of PEs still requires more complex considerations. In fact, two important factors on the FPGAs prohibited scaling the overlay. The first is the limited number of Super Long Lines (SLL) connecting the Super Logic Regions (SLR) of the multi-die FPGA to each other [49]. The second and the most important is the number of BRAMs that limited not only scaling the overlay size but also the interconnect degree.
In Reference [3], DRAGON was implemented and evaluated using a 2D Mesh interconnect and a single configuration embedding a total of 144 PEs, organized as 3 \(\times\) 3 BCs, where each BC has 16 PEs (2D Mesh 4 \(\times\) 4 configuration). Here, we extend our implementation and evaluation to multiple configurations with different numbers of PEs as well as different degrees of interconnection (2D, 3D, and 4D). Consequently, we first dive into the enhancements that we implemented in DRAGON2 and later we will provide a comparison with the baseline DRAGON architecture to show the impact of these improvements. Ultimately, we introduce the compact buffering technique (through DRAGON2-CB) and explain its necessity for the overlay scalability, both in size and degree of interconnect. Then, we conduct multiple benchmark-based experiments on the proposed DRAGON2 and DRAGON2-CB, to evaluate the benefits (and costs) of the proposed compact buffering, in terms of sustained performance, power-efficiency and EPR. Besides, stencil computing remains one of the most important applications of High-Performance Computing (HPC), which led many researchers to propose numerous approaches, including specialized architectures and programming models, whose aim is to accelerate such kind of computations [10, 40, 43, 45, 46]. While DRAGON2 is a general-purpose architecture and not purposely built for accelerating stencil computing, we decided to evaluate it through this kind of computations because of their intensive neighbor-communication requirements that can stress its interconnect scheme and the underlying buffering system.
3 ARCHITECTURE AND MICRO-ARCHITECTURE IMPROVEMENTS
While the work in Reference [3] achieved a near 90% computational efficiency, its double-precision TPP remained relatively low, primarily because it was operating at 130 MHz. The performance bound had multiple causes: the large 1,024-bit-wide Advanced eXtensible Interface (AXI) data bus interface, the non-registered SLR boundary crossing, the single-cycle 32 \(\times\) 32 bits multiplication in the Arithmetic and Logic Unit (ALU), and the Multiply-Accumulate Floating-Point Unit (MAC FPU) architecture, which required an accumulation operation within a single clock cycle. This section proposes multiple enhancements to the baseline DRAGON and introduces the improved PE architecture and the compact buffering scheme.
3.1 An Enhanced PE Micro-Architecture
Figure 3 illustrates the new micro-architecture of the DRAGON2-CB PE that benefits from the Compact Buffering (CB) scheme and a deeper pipelining. The enhanced PE adds a third memory pipeline stage (MEMory3) to register the data output toward adjacent PEs or toward BM. Moreover, to reduce critical path issues due to data exchange between two PEs across SLR boundaries, an extra register is inserted (between two PEs) to break that path further. To compensate for the single clock-cycle delay introduced by this register, an extra write-back stage is added to the PE to ensure a synchronized exchange of data among adjacent PEs. Moreover, the integer multiplication in the ALU is now split into two stages to enhance its speed. Besides, the MAC FPU used to take one clock-cycle for multiplication and an additional one for accumulation that was a limiting factor for clock speed. While the multiplication step can be further split into multiple stages, doing so in the accumulation step complicates the programming by forcing a multi-threaded approach on the programmer as will be explained in the programming model section. Nonetheless, to extract the maximum performance, a minimum of four-stage accumulation is mandatory to overcome the clock speed limitation of the underlying critical path. Subsequently, the MAC FPU have been stretched to a total of eight pipeline stages, four for the multiplication part and the other four for the accumulation part. This leads to a total of 15 pipeline stages on the proposed PE against seven stages on the baseline version. Figure 4 details the new pipeline structure of the MAC FPU. To simplify the required double-precision hardware logic, truncation is set as the default rounding mode in the FPU, following the same approach used in the baseline DRAGON [3]. Ultimately, the most important enhancement here is the concept of compact buffering. BRAMs have been the bottleneck resource for scalability and therefore a more efficient low-latency buffer-based data communication mechanism was required. The proposed PE supports this scheme that will be explained in detail in the next sub-section.
Fig. 3. Micro-architecture of the proposed deeply pipelined PE.
Fig. 4. Description of each operation on the Floating Point Unit at every pipeline stage.
3.2 Compact Buffering: Less Memory Footprint with Near Identical EPR
The compact buffering technique consists of fitting multiple First In First Out– (FIFO) based communication buffers into a single BRAM as depicted by Figure 5, instead of implementing each FIFO-based communication buffer into its own separate BRAM. It also consists of an efficient scaling along the dimension of interconnect. This means this technique mainly consists of allocating all communication FIFOs in a 2D interconnect into a single BRAM, then, gradually adding a single BRAM for each added dimension to the interconnect, while providing the control logic support for managing incoming and outgoing data. The work in [53] proposed an approach to implement multi-channel full FIFO in one blockRAM, however simply fitting all communication buffers following this approach will degrade the sustained performance, in particular for a high degree of interconnect. In contrast, our approach consists of implementing standard circular buffers for neighbor communication, which simplifies the control logic by removing unnecessary checks on full FIFO status flags. Besides, the proposed compact buffering gradually increases the number of BlockRAMs in 2D and 3D interconnects to provide more bandwidth and maintain a high sustained performance as confirmed in the experimental evaluation section. This gradual increase means that 2D, 3D, and 4D interconnects would require one, two, and three BRAMs, respectively. In contrast, the baseline version requires four, six, and eight BRAMs in 2D, 3D, and 4D interconnects, respectively.
Fig. 5. Compact Buffering using a single BRAM that embeds four FIFO-based sub-buffers, one for each incoming data direction among four adjacent PEs, in the case of a 2D interconnect and a write operation.
For example, in the 2D Mesh DRAGON overlay [3], every PE exchanges data with four adjacent neighbors and has a specific buffer dedicated to transmit data into each single direction (North (N), South (S), East (E), and West (W)) as illustrated by the right side of Figure 6. This requires four BRAMs (one for each direction) and imposes the use of a 64-bit five-input multiplexer (1 input from RegFile + 4 from FIFOs in the MS part of the PE) as well as a 64-bit seven-input multiplexer (in the DCS part of the PE). Clearly, this may not be the most optimal solution for exchanging data with multiple neighbors. In fact, the communication interface was designed with the aim of providing a simplified control logic and maximizing the inter-PE bandwidth. Knowing that every BRAM can be used in Simple Dual Port mode to allow simultaneous write and read operations on their different A and B ports with a 64-bit data width size [51], a total of four BRAMs is necessary to allow same-cycle continuous scatter and gather flow of incoming and outgoing data into an ALL-to-ALL manner. Along with the bandwidth maximization, the Buffer size can use all of the available BRAM locations (512 different address locations). Nonetheless, this bandwidth-maximization approach as well as the usage of the full BRAM size may be over-provisioned for certain applications, namely stencil-based computation. For example, in 2D Jacobi stencils, the Local Memory (LM) of the PE adopts a single Ultra Random Access Memory (UltraRAM) resource that can hold up to 4,096 64-bit different data; therefore, a PE can hold a small partition (of a larger 2D problem) with a chunk of data equal to 64 \(\times\) 64. In this situation, the PE has to exchange the halo points in the perimeter of the partition with its neighbors. Consequently, a 64-bit buffer with 128 locations (or with 64 locations in case of a \(32 \times 32\) partition) is enough to handle such a situation. Furthermore, the halo data are transferred into a single direction at every clock cycle and only the corner points of such a partition would be scattered toward different neighbors at the same clock cycle. DRAGON allows overlapping the computations with the scattering of data toward neighbor PEs, at the same clock cycle, using a single VLIW instruction. In regular buffering this same instruction can send, for example, a corner point of a 2D stencil tile toward two different PE directions. In contrast, the compact buffering imposes a single BRAM and thus a single write port is available. In this situation (and in other applications with traffic patterns where there are more incoming data than the available write ports), the solution is to defer scattering the corner data to additional clock cycles using extra scattering instructions. Thus, it is possible to use a single communication buffer (a single write port) at the cost of spending four extra clock cycles (at every stencil iteration) to write the data of these corner points into the communication buffer. By fitting the four communication buffers into a single BRAM, we save the remaining three BRAMs from the baseline DRAGON PE. Besides, other resources such as Look Up Tables (LUTs) can be saved as well. In fact, since the read port of the resulting BRAM contains a single output, the multiplexer logic in its downstream is simplified as well as depicted by the left side of Figure 6, hence, the 64-bit five-input bottom multiplexer now requires only two 64-bit inputs and the 64-bit seven-input upper multiplexer requires only four 64-bit inputs, effectively resulting in less logic resource utilization. Figure 6 depicts how this simplification is further extended when the interconnect degree is increased to 3D or 4D through the added dashed lines corresponding to the outputs of the extra BRAMs when higher dimensions of interconnect are used.
Fig. 6. The impact of compact buffers versus regular buffers on the PE multiplexing logic and the required number of BRAM-based buffers used for the communication with adjacent PE neighbors.
While the write port of each BRAM has three unique inputs: write port address input (wrADDr), data, and control, each FIFO-based sub-buffer, has virtually its own read and write addresses, data input signal, as well as its own write enable signal. This abstraction allows multiple sub-buffers to fit inside a single BRAM, provided enough memory space. Figure 5 depicts this abstraction and shows how a single write port interface can control multiple virtual sub-buffer write interfaces. In fact, when a write command is requested on a certain sub-buffer, the related write pointer is assigned to the wrADDr and the associated data input is assigned to the write port data input. A write command request on any of the available sub-buffers will set the write enable of the BRAM write port. Besides, every sub-buffer has its own address range that is defined by its own write pointer. These address ranges may or may not be contiguous; however, they should never be overlapping to maintain data integrity. Moreover, each write pointer is incremented following its related write request and its corresponding range is managed upfront. Obviously, the sum of all address ranges of each sub-buffer should be less or equal to the total storage capacity of the hosting BRAM. While not shown in Figure 5, the read port of the BRAM is handled in a very similar way to the write port. In fact, it consists of a single input that is the BRAM read address and a single output that is the read data. The read pointers of each sub-buffer are managed in a similar manner to the write pointers. The read pointers are also exclusively assigned to the read port address input, and then incremented, following a read request from the instruction stream.
In an \(N\)-dimensional interconnect, for example, where \(N=3\), the total number of connected PE neighbors is equal to six. Four of these neighbors reside in the same dimension, whereas the other two reside in the remaining dimension. While it is completely possible to fit each of the six FIFO communication buffers inside the same BRAM by splitting the storage space into equal or even different sub-buffer spaces (every sub-buffer handles incoming and outgoing data from a single direction), this would unavoidably harm the EPR, because in this situation an important portion of the overall execution time would be dedicated solely to transfer all the extra dimension halo points (increasingly more than just four corner points depending on the stencil tile size in each PE). The baseline DRAGON overlay achieves a relatively high EPR (89.9%), notably because it hides the cost of inter-PEs data transfers by fusing them with the effective computation at the instruction level. The goal of the compact buffering proposed here is to reduce the BRAM requirement for any \(N\)-dimensional interconnect while preserving to the best extent the previously achieved EPR values. Hence, each sub-buffer in the 2D space are grouped together and any sub-buffer handling a certain direction of the \(N\)-dimensional interconnect (beyond 2D) should reside into the same BRAM with other sub-buffers handling other directions belonging to the same dimension. Consequently, every extra dimension added to the interconnect, would require adding an extra BRAM to handle concurrent transfers in the different dimensions, to maintain the cost of extra cycles at 4 for every stencil computation iteration. This would effectively save more BRAMs, allow a better scalability and leads to an overall improved performance while maintaining nearly the same EPR.
Figure 7 illustrates an example of the virtual relative placement of communication buffers as well as some of the possible dataflows between adjacent PEs in the case of a 2D interconnect. Compared to our previous work, any data transfer should be flowing strictly in a single direction. While every PE data output is still connected to each input of the four neighbors, the single BRAM that contains the four FIFO-based sub-buffers (N, S, E, and W FIFOs) can only allow a single write operation per clock cycle and therefore accepts a single incoming data among all connected directions.
Fig. 7. Virtual relative placement of communication buffers and directions of data transfers for scatter and gather operations in the case of a 2D Mesh interconnect.
The scatter and gather operations can be fulfilled using R-type instructions (computational register-based instructions) or N-type class of instructions such as NPASS and Neighbor-Scatter-Gather (NSG) [3]. The dataflow across adjacent PEs is depicted by Figure 7 and can be described as follows: Given the position indexes (i,j) of a certain PE on the 2D grid, scattering data to its North PE neighbor means that PEi,j will scatter the same data to all of its four PE neighbors. Nonetheless, these data will only be stored into the “S FIFO” communication buffer of PEi-1,j. At this same clock cycle, PEi,j will store into its “S FIFO” buffer the gathered input data from its South neighbor PEi+1,j. The concept is similar for all directions of transfer. In fact, scattering data in one direction means that these data are written in the opposite side communication buffer. For example, when PEi,j scatters data to the East, PEi,j+1 would store it into its “W FIFO.” Subsequently, PEi,j would gather data from PEi,j-1 into its own “W FIFO.” Ultimately, in 3D or 4D interconnects and beyond, the dataflow concept remains the same. The only difference is that new remote connections have to be established, which means new possible directions for the scattered data. Nonetheless, the proposed compact buffering technique requires adding an extra BRAM for each extra dimension of the interconnect to maintain the same 2D overhead of just four extra clock cycles when data have to be scattered to two different-direction recipients in each data tile corner, with the overall goal of maintaining a slightly noticeable impact on the EPR.
4 FPGA-RELATED OPTIMIZATIONS
4.1 Enhancing Clock Speed through Deeper and Smarter Pipelining
While FPGA compilation tools have known significant improvements during the past decade, they still rely on user expertise about the available synthesis and implementation strategies. The TPP that a design can achieve is directly tied to its maximum clock frequency. From this viewpoint, improving the performance involves breaking critical paths into smaller equal-length portions. This can be done at the Register Transfer Level (RTL) description level, as well as during the implementation phase. The first suggests a better pipelining for the longest paths, that are limiting the clock speed. We conducted a timing analysis of our design and found multiple culprit sections prone to improvements. Some of these, include the MAC FPU (addressed in Section 3.1), the High-Bandwidth Memory (HBM)-to-BC AXI bus as well as the direct interconnect link between PEs along an SLR crossing. Besides, the AXI bus that connects the HBM2 banks to their corresponding BCs had the longest critical path of all. All HBM2 banks reside solely on the FPGA bottom SLR, while BCs are widespread across all SLRs, meaning that in the worst case, one AXI bus has to cross 2 SLR boundaries. Pipelining an AXI bus is not a straightforward process and requires careful consideration of the underlying protocol that forbids any combinatorial path between input and output signals [4]. Therefore, we used a skid buffer also known as Carloni buffer [1]. This buffer is basically the smallest possible FIFO, because it has only two storage locations and thus can be implemented with a very small footprint using registers instead of BRAMs or LUTmems. In fact, this kind of buffer allows decoupling both sides of a ready/valid handshake interface without inserting any combinatorial path while allowing back-to-back transfers, hence, effectively inserting a pipeline stage into the AXI bus. In total, we inserted six pipeline stages in the data and control paths of each AXI bus, one stage nearby each HBM2 bank, one stage nearby each BC and two around every SLR crossing. While not all paths cross SLR boundaries, the extra pipeline stages are kept to maintain synchronization between all paths. In addition, to reduce the critical path created by the output of each PE that crosses an SLR boundary, we added two extra pipeline registers. One is embedded into the PEs by adding a third memory stage while the other is external to the PE to allow more placement flexibility for the compilation tools. To compensate the additional single clock-cycle latency created by the last one, an additional write-back stage is added to the PE as depicted by Figure 3. Ultimately, the instruction stream was the most important high fan-out critical-path signal. Since it was connected to all the PEs across multiple SLRs, it forcibly crossed SLR boundaries. Therefore, we manually replicated this signal to reduce its fan-out and added multiple pipeline stages to considerably improve routability and consequent timing.
4.2 Layout-aware Floorplanning
To assist the placement effort, RTL design files are tagged with Xilinx pragmas (RLOC) that marks relative locations of hardware instances into a virtual grid with two \(X\) and \(Y\) axes. These pragmas help improve placement time and quality by providing valuable information to the compilation tool about the relative positioning of multiple components, such as AXI pipeline buffers, BCs as well as PEs relative locations inside a BC. However, using absolute locations can lead to sub-optimal results because of the lack of flexibility, despite being more deterministic.
Vivado is arguably a near-excellent tool that allows extensive control on every step of the design process. In contrast, Vitis provides less direct control and implements a higher abstraction layer, because it tries to attract more software-oriented users. Consequently, enabling Vivado-level control incurs extra efforts and less Graphical User Interface–based usage. To allow more placement control, a <pre opt_design> step script has been included into the Vivado configuration file used during compilation. In fact, Vitis creates bounded regions called pblocks that can be used to refine the placement. However, these pblocks are currently limited to the SLR level. Our placement script profits from these established pblock descriptions and uses regular expressions to map module instances and even signals to the desired SLR locations. By combining both methods (RLOC pragmas and <pre op_design> script), we managed to obtain the desired placement outcome. An example of a portion of the content of this script is given by Listing 1, which shows how BCs (BRCLUSTER instance) are mapped to the corresponding SLR region in a 3 \(\times\) 3 BCs configuration.
Listing 1. An example of some contents of a <pre opt_design> step placement guidance script.
4.3 Reducing the Amount of Required SLLs in SLR Boundary Crossing
The work in Reference [18] proposes an automated floorplanning framework for High-Level Synthesis (HLS)-based designs on multi-die FPGAs that aims to improve the operating clock speed. Nonetheless, it does not discuss an important part of such devices, which is the limited wiring that exists between the dies. Here, we try to quantify this, in particular for DRAGON2 and the target hardware. Our target FPGA in the Alveo U280 acceleration card contains three different dies known as SLRs [49]. These SLRs are interconnected through special wires called SLLs [49]. The base SLR region (SLR0) shown in Figure 8 contains all the 32 HBM2 banks; therefore, the DRAGON2 Controller is placed on this SLR to keep the DMA engines as close as possible to these banks. However, the DRAGON2 Accelerator spans across all three SLR regions. To maintain a balanced resource distribution over these regions, the number of BCs should be identical in every SLR. These BCs are organized in a configuration of NROW ROWS by NCOL COLUMNS where columns lay in the same direction as the SLRs and do not cross their boundaries. As such, the number of BCs NBC consists of the product of NROW and NCOL. Each of these BCs embeds an AXI Bridge that exchanges the data between its BM and its corresponding HBM2 bank, using an AXI4 protocol. Figure 8 shows that a balanced resource distribution across SLRs causes a subsequent unbalanced SLL requirement to connect HBM2 banks and their BCs. In fact, the number of required SLLs that connect a BC and its HBM2 bank, between SLR0 and SLR1, is double of that between SLR1 and SLR2, which limits the overlay’s scalability.
Fig. 8. Interconnect limitation in SLR boundaries due to the unbalanced requirement on the number of SLLs.
The available SLLs in an SLR boundary crossing is given by Equation (2), where NLagColSLL is the number of SLLs per Laguna column (here 1,440), NLagCol is the number of Laguna columns per clock region (here 2) and NClkReg is the number of clock regions in one SLR near the boundary crossing (here 8). Consequently, NSLL_available evaluates to 23,040. (2) \(\begin{equation} N_{SLL\_available} = N_{LagColSLL} \times N_{LagCol} \times N_{ClkReg}. \end{equation}\)
The Controller’s DMA moves the data back and forth between HBM2 banks and their related BMs through WAXIBus wires for read and an equal amount of wires for write operations. Assuming that a bus width is equal to its data width (by ignoring control signals widths), a bidirectional (read and write) 1,024-bit AXI data bus means that WAXIBus is equal to 1,024, hence, 2,048 wires are required for data transfer in both directions. Besides, BCs exchange data across SLR boundaries and the total bandwidth required depends on the interconnect dimension and topology. For example, a 2D Mesh network has to connect WRow-based-InterconnectBus wires between every two BCs that reside in both sides of an SLR-to-SLR boundary. Each of these two sides contains four PEs that have a 64-bit data output and a 64-bit data input, at every clock cycle. Therefore, WRow-based-InterconnectBus is equal to 512 (2 \(\times\) 64 \(\times\) 4). A 4D Mesh network, for example, would quadruple the value of WRow-based-InterconnectBus as depicted by Figure 9. Assuming a balanced distribution of BCs across the three SLR regions, the data movement bottleneck would reside in the boundary crossing between base SLR0 and SLR1, which would require a number of SLLs NSLL_required defined by Equation (3), where NSLR is the number of SLRs (here 3), (3) \(\begin{equation} N_{SLL\_required} \ge N_{ROW} \times \left[ W_{Row-based-InterconnectBus} + (N_{SLR}-1) \times \left(2 \times W_{AXIBus} \times \frac{N_{COL}}{N_{SLR}}\right)\right]\!. \end{equation}\)
Fig. 9. Wiring between processing elements in a 2D or a 3D or a 4D Mesh network.
Rewriting Equation (3) leads to the formula given by Equation (4), which sets the upper bound on the AXI data bus width for a given number of dies (NSLR), a given cluster configuration (NROW and NCOL) and a given interconnect (WRow-based-InterconnectBus), (4) \(\begin{equation} W_{AXIBus} \le \left[ \frac{1}{2} \right] \times \left[ \frac{N_{SLR}}{N_{SLR}-1} \right] \times \left[ \frac{N_{SLL\_available}-(N_{ROW} \times W_{Row-based-InterconnectBus})}{N_{ROW} \times N_{COL}} \right]\!. \end{equation}\)
By analyzing Equation (4), we found that reducing each HBM2 AXI data bus width from the original 1,024-bit width in the baseline DRAGON to just 256-bit width would allow scaling DRAGON2 both in size and degree of interconnect with respect to the available resources, as will be shown in the scalability evaluation section.
4.4 Layout-aware Interconnect Generation
Overlay architectures often consist of multiple PEs operating in a concurrent manner and interconnected through a certain scheme. Here DRAGON2 adopts a tightly connected Mesh-based network that allows data exchange between PEs using FIFO-based communication buffers. Other topologies such as Torus may be adopted as well. Besides, the interconnect dimension is defined by the number of adjacent neighbors connected to each PE. Figure 9 depicts an example overlay with 144 PEs and illustrates the neighborhood wiring of the DRAGON2 PEs in a 2D, 3D, or 4D Mesh configuration.
In 2D, the BCs are organized as 3 \(\times\) 3 (12 \(\times\) 12 PEs), while in 3D and 4D configurations, they are organized as 1 \(\times\) 3 \(\times\) 3 (4 \(\times\) 12 \(\times\) 3 PEs) and 1 \(\times\) 1 \(\times\) 3 \(\times\) 3 (4 \(\times\) 4 \(\times\) 3 \(\times\) 3 PEs), respectively. For example, in a 2D interconnect, all BCs and all PEs belong to a single plane where every PE can communicate with four adjacent neighbors as marked in Figure 9 by N, S, E, W. In contrast, a 3D interconnect adds an extra dimension and reorganizes the BCs in three planes, where each plane contains 1 \(\times\) 3 BCs (4 \(\times\) 12 PEs). In this interconnect dimension, A single plane spans the three SLRs depicted by Figure 9, while a single SLR contains three BCs, each belonging to a different plane. Therefore, the original 2D interconnect is removed between planes and replaced by the remote horizontal connection between PEs that adds two remote neighbors that are marked in Figure 9 by RN (Remote North) and RS (Remote South). This remote connection had been intentionally kept in the horizontal direction inside the SLR to avoid SLR boundary crossing that reduces the SLL utilization and simplifies the implementation. Ultimately, a 4D interconnect adds a fourth dimension and consequently removes the direct link between each pair of PEs across SLR boundaries and adds instead, for all PEs, two remote neighbor connections across SLRs, that are marked in Figure 9 by RE (Remote East) and RW (Remote West).
5 EXAMPLE APPLICATION: STENCIL COMPUTING
Iterative Stencil Loop (ISL) is a class of algorithms where a problem is structured as an \(N\)-dimensional grid of points that are dependent on each other following a fixed pattern called a stencil. Such a class of algorithms is widely found in scientific computing and its applications cover wide areas such as tsunami simulation [35], weather forecast, computational fluid dynamics [14] and image processing [38]. The method consists of iteratively updating the grid points of a given stencil problem by applying certain coefficients to their surrounding points, following a certain stencil shape. For example, Jacobi and Laplace equations are among those stencil-based computations whose computational models are represented through iterative calculations and whose patterns (in their 2D versions) have a starlike shape such as illustrated in Figure 10(a). The equations representing the time-iterative update of this kind of stencils are shown in Table 2 for 2D, 3D, and 4D versions. In fact, these equations will be implemented as assembly programs that are compiled for different grid sizes to be used later for the evaluation of the proposed compact buffering technique. Figure 10 illustrates some of the concepts about stencil computing. In fact, Figure 10(a) shows how a certain 2D problem is organized as a 2D grid of stencil points. Figure 10(b) shows how this grid can be decomposed further into multiple tiles. Figure 10(c) shows how a parallel computation can be achieved by mapping these tiles into different PEs. It also shows the dependence between the stencil points in the neighboring borders of each PE. These points are usually called ghost/halo points [44]. Ultimately, Figure 10(d) illustrates an example where these PEs can exchange the ghost points with their neighbors and store them locally into dedicated communication buffers.
Fig. 10. Concepts of problem decomposition, ghost points and halo exchange through buffers in stencil computations.
6 PROGRAMMING MODEL
The DRAGON2 overlay is implemented as an RTL kernel and is abstracted as a software function that is callable from a host as a task in an OpenCL program. The arguments of this function consist of the program reload switch, the program size, and the HBM2 memory banks pointers. The program reload switch decides whether to reload the program instructions on the FPGA or to re-use the previously loaded ones. However, the program size argument provides the size (in Bytes) of the overlay program instructions. Besides, some memory pointer arguments are provided to connect each BC to a corresponding HBM2 bank in a peer-to-peer manner. An OpenCL host program populates the arguments of the DRAGON2 OpenCL task (the function that abstracts the DRAGON2 overlay and provides a data and control interface to it). Then, the host adds this task to its OpenCL queue and schedule it for execution. When the host sets the program reload switch argument of the DRAGON2 task, the overlay enters a boot state where it fills its Instruction Memory (IM) by the program instructions that are stored by the host on a dedicated HBM2 bank. Otherwise, the boot state is bypassed and the overlay executes any instructions that were previously stored on its IM. During the program execution, when a STOP instruction is encountered by the DRAGON2 overlay, its control unit notifies the host, which consequently reads back the updated contents of each HBM2 bank. The stencil benchmarks move the input/output data from/to HBM2 banks only at the start/end of the computation.
Listing 2. Using VLIW instructions to implement a sliding stencil cyclic buffer in the Register File of the baseline DRAGON in Reference [3].
Listing 3. Using VLIW instructions to implement a sliding stencil cyclic buffer in the Register File of the DRAGON2-CB.
One of the most attractive aspects of DRAGON2 (and DRAGON) is its ability to overcome the Von-Neumann memory bottleneck at both the PE and system level. The first is achieved by overlapping data transfers between the LM and the Register File (or/and when exchanging data between PEs) with effective computations. The second is achieved by decoupling Global Memory accesses from instruction execution through a custom designed DAE approach. At the PE level, a single VLIW instruction can fulfill up to three memory transfers and two different computations (Multiplication and accumulation). This key capability is harnessed to emulate a cyclic buffer that implements a sliding window technique similar to that used in image processing applications [15]. An example of this cyclic buffer in a 2D benchmark is depicted by the active region area in Figure 11. This region consists of three active lines where the central line is the one containing the stencil points to be updated. The other upper and lower lines provide the North and South neighbor points. The fourth line below the stencil is used to prefetch new stencil points from LM concurrently with the update of those in the central line. Basically, Listing 3 shows a portion of the main loop in the new version of the 2D Laplace benchmark (on the DRAGON2-CB) while Listing 2 shows the equivalent program portion in the baseline DRAGON in Reference [3]. Both the previous and newer version programs compute the average of the surrounding points of a stencil (2D Laplace) and efficiently overlap memory transfers (such as load, store or scatter/gather to/from PE neighbors) with the computations, thanks to the power of the VLIW approach. While the proposed enhancements substantially improve the scalability, performance, and power efficiency, they introduce a higher level of programming difficulty that is noticeable by comparing Listings 2 and 3. For instance, the compact buffering scheme requires a single write access into each communication buffer at each clock cycle and therefore, scattering the halo corner points illustrated in Figure 10(b), for example, from a PE toward North (Up) and East (Right) directions, can not be achieved in the same clock cycle. Consequently, sending data toward two different directions is done in two separate clock cycles using two additional NSG instructions (line 27 to 33 of Listing 3). Moreover, since the accumulation step in the enhanced MAC FPU architecture is split into four stages, a multi-threading approach is adopted to fully populate the four-stage-pipeline accumulator. In fact, instead of fully updating the stencil points one after another, each four contiguous points are grouped and handled together as a set. Besides, the index (i) in Listing 3 represents the global position of the active point in the stencil tile while the index (j) defines the current pass or iteration over the set of four points subject to update. Previously, one multiplication was followed by three multiply-accumulate operations to update a single stencil point. Now, four multiplications are executed on the first iteration (j=0) over four successive points to fill the accumulator pipeline and then on each of the three remaining passes (j=1, 2, or 3), four consecutive multiply-accumulate steps are used to update the stencil points for a given direction.
Fig. 11. Emulating a sliding window cyclic buffer. Multiple operations using a single VLIW instruction (e.g., compute&store to LM + scatter/gather to/from PE neighbors + load from LM).
In line 2 to 6 of Listing 3, we start by multiplying all Left neighbors of each of the 4 points in the current set by a constant coefficient from the Register File (address RSRC1). The index (i) is incremented and the operation is repeated for each of these points. At each of the required four clock cycles, a new point is loaded from LM to the Register File just below the active region of the cyclic buffer as depicted by Figure 11. Here (LMADDRLD) in Listing 3 corresponds to the address of the purple point in LM, whereas (RDSTLD) corresponds to the address of the purple point in the Register File. FMUL and LD operations are packed inside slot1 and slot2 of the same VLIW instruction. Depending on the current position of the stencil point being updated, the Left neighbor can be either located in an internal register (RSRC2L) or incoming from another PE and thus was stored in the local West sub-Buffer (accessed by OPSRCL). (OPSRCL) is pre-computed based on the current index (i), to select the second operand of the multiply operation FMUL. Using this index, (RSRC2L) is also pre-computed for each point being updated, to locate the register address containing its Left neighbor. Subsequently, this is followed by a multiply-and-accumulate operation with East neighbors of each point in the chunk (line 8 to 10 of Listing 3), then for all their South neighbors (line 12 to 14 of Listing 3), and, finally, for all their North neighbors (line 17 to 26 of Listing 3), which marks the update completion of all points in the set. The (MODE) field in the FMACCA operations allows storing updated points into LM address (LMADDR). This address is then incremented and loops back to its origin to start a new stencil iteration after all the tile points are updated for the current one (line 34 of Listing 3). Nonetheless, at every step, if the position of the next stencil point (i+1) is still inside the current four-point chunk (line 42 of Listing 3), then the same operation defined by index (j) would be applied to it. Otherwise, when the end of this chunk is reached (line 37 of Listing 3), there are two possible scenarios. First, if all the operations on all the neighbors have been completed (line 41 of Listing 3), then the point position (i) is incremented and a new four-iteration update of the next four-point chunk would start. Otherwise (line 38 of Listing 3), the point index (i) is set on the first point of the chunk to start a multiply-and-accumulate pass (increment j) with respect to the new neighbor direction coefficients.
7 EXPERIMENTAL ENVIRONMENT
7.1 Experimental Setup
Our experiments have been conducted on a Xilinx Alveo U280 acceleration card, which features a custom Ultrascale+ FPGA named XCU280 (equivalent to XCVU37P) and that embeds 8 GB of HBM2 on-chip memory split into two stacks of 4 GB, with 16 memory banks in each. DRAGON2 is packed into an RTL kernel using Xilinx Vitis. This design choice allows FPGA runtime control through the OpenCL Application Programming Interface (API). The OpenCL-based host transfers program instructions and data back and forth between the FPGA card and the host through the memory pointers of the enqueued OpenCL task function. Details of the setup used in our experiments are given by Table 1. All DRAGON2 and DRAGON2-CB variations are compiled through Vitis and use Vivado 2020.2 in the background for synthesis and implementation. The implementation strategies focus mostly on timing and performance optimization rather than area saving or power consumption reduction.
| CPU (FPGA host) | Intel Core i9-9900K CPU 3.60 GHz (8 cores), 64 GB DDR4 RAM | |||
| Operating System (PC host) | Ubuntu 18.04.1 LTS | |||
| Accelerator Card | Alveo U280 Data Center Accelerator Card [50] | |||
| FPGA Framework/Compiler | Xilinx Vitis 2020.2 (64-bit)/Vivado 2020.2 (64-bit) | |||
| Vivado Steps | Opt_design | Place_design | Route_design | Physical_Opt_design |
| Strategies | ExploreWithRemap | ExtraTimingOpt | NoTimingRelaxation | ExploreWithAggressiveHoldFix |
Table 1. Environment Setup Used in the Experiments
7.2 Evaluation
Our evaluation aims to quantify the impact of the introduced enhancements to the baseline version of DRAGON and to investigate the overlay power requirements and performance scalability as well as its computational efficiency (EPR). Our evaluation approach is divided into three incremental steps. The first step consists of studying the impact of the introduced enhancements on DRAGON2 while maintaining the same buffering scheme previously used in the baseline DRAGON overlay. Then, the second step focuses on the costs and benefits of further deploying the compact buffering scheme on the specialized DRAGON2-CB version. The study on both two first steps evaluates the same size overlay (144 PEs) interconnected through a 2D Mesh topology, from the viewpoint of performance, power efficiency and resource utilization. This overlay size and its interconnect dimension match the same configuration of the baseline DRAGON presented in Reference [3], for a fair comparison. Ultimately, the last evaluation step will cover an extensive comparative study of performance, power, resource utilization, computational efficiency (EPR), scalability and code size between DRAGON2 (regular buffering) and DRAGON2-CB (compact buffering) overlays to assess the costs and benefits of the proposed buffering scheme. Besides, since a compiler is still in the work, programs are written in C and use an assembly API that abstracts the instruction set opcodes into high-level functions (as can be seen in Listing 3). Therefore, we manually prepared and optimized complex programs that solve the stencil benchmarks presented in Table 2.
Table 2. Benchmarks Used in the Experimental Evaluation
The FPGA compilation is run with dynamic profiling enabled and generates execution times and power consumption reports that can be accessed by the Vitis Analyzer tool. Subsequently, the collected results are further exploited to compute the sustained performance, power efficiency, and EPR for each benchmark. The sustained performance is computed through the division of the total number of operations (on all points of a given problem size) by the execution time. This number of operations can be expressed as Equation (5) where Niters is the number of all the iterations in the stencil computation, SizeTile is the size of a problem partition stored into each LM and NPE is the total number of PEs. Finally, #OPSB is the number of operations required to update a single stencil point in a given benchmark (one multiplication followed by a certain number of multiply-accumulate operations, where each single multiply-accumulate counts as two operations). The number of operations #OPSB for each benchmark in each dimension is given in Table 2. The power efficiency is obtained by dividing the sustained performance by the power consumption, whereas the EPR is computed as the ratio of the sustained to peak performance (the peak performance can be computed by Equation (1)), (5) \(\begin{equation} \#OPS_{DRAGON} = N_{iters} \times \#OPS_{B} \times Size_{Tile} \times N_{PE}. \end{equation}\)
8 COMPARISON BETWEEN THE PROPOSED DRAGON2 AND THE BASELINE DRAGON OVERLAYS
This section discusses the proposed architecture and technology-related enhancements introduced to DRAGON2, excluding the compact buffering scheme. For a fair comparison with the baseline version presented in Reference [3], the size of the overlay (144 PEs) as well as the dimension of the interconnect (2D Mesh) are kept the same.
8.1 Impact of the Proposed Enhancements on Resource Utilization
Table 3 summarizes the differences in terms of resource utilization between both versions. The results show an increase in BRAMs. In fact, the enhanced DRAGON2 uses two BRAMs for each Register File while the baseline version uses a mix of a single BRAM coupled to distributed RAMs (LUTmems). This design choice is selected to provide a more compact area for each PE. Besides, the communication buffers are still implemented using a single BRAM each; thus, four BRAMs are used to store incoming data from the adjacent PEs in each direction (North, South, East, and West). The proposed version saves more LUTmems in the implementation of the Register File; however this gain is not clearly visible, because more LUTmems and registers are used to absorb the long shift registers that implement the pipelined data and control signal in the new PE that requires 15 pipeline stages as compared to just seven stages in the baseline version. Besides, the same amount of Ultra Random Access Memory (URAM) and Digital Signal Processor (DSP) is used in both versions at all hierarchy levels, which suggests that the splitting of the MAC unit, in particular the extra pipelining of the multiplication function, makes better use of the internal structure of the DSP48E2 multipliers and thus leads the PE to use around 32% less LUTs as compared to the baseline DRAGON PE.
| PE | BC (16 PEs) | OVERLAY (9 BCs) | (OVERLAY+FPGA shell) | |||||
|---|---|---|---|---|---|---|---|---|
| DRAGON2 | DRAGON | DRAGON2 | DRAGON | DRAGON2 | DRAGON | DRAGON2 | DRAGON | |
| LUT | 2,231 | 3,297 | 36,631 | 54,134 | 352,161 | 505,202 | 514,406 | 657,622 |
| LUTmem | 251 | 296 | 4,075 | 4,736 | 36,774 | 58,835 | 53,688 | 86,205 |
| REG | 3,020 | 1815 | 55,449 | 29,574 | 546,802 | 281,765 | 737,745 | 531,152 |
| BRAM | 6 | 5 | 96 | 80 | 864 | 720 | 1,070 | 940 |
| URAM | 1 | 1 | 32 | 32 | 304 | 304 | 304 | 304 |
| DSP | 13 | 13 | 208 | 208 | 1,872 | 1,872 | 1,876 | 1,876 |
Table 3. Comparison of the Resource Utilization between the Proposed DRAGON2 and the Baseline DRAGON from Reference [3]
8.2 Performance and Power Efficiency Gains
Figure 12 depicts a comparison between the double-precision performance and power efficiency of the proposed enhanced DRAGON2 and the baseline version. The comparison includes multiple problem sizes and number of iterations for the 2D Laplace and Jacobi benchmarks. An optimized implementation on the Core i9 9900K (Using OpenMP and AVX256 instructions) of both benchmarks is borrowed from Reference [3] to serve as a reference. The introduced enhancements allowed us to achieve a better clock speed (276 Mhz vs. 130 MHz), which increased the sustained performance by around 100%, as depicted by the left side of Figure 12. Despite the higher clock speed, the power efficiency has actually been improved (nearly 3 times as compared to the baseline DRAGON) as shown by the right side of Figure 12. This is mostly a result of the reduction in AXI data bus width from 1,024 bits to just 256 bits, which reduced related logic and wiring across all SLRs.
Fig. 12. Double-precision performance and power-efficiency of the proposed DRAGON2 as compared to the baseline DRAGON in Reference [3].
9 Comparison between the proposed DRAGON2-CB and DRAGON2 overlays
This section introduces the compact buffering scheme and discusses its impact on the enhanced DRAGON2-CB; therefore, experimental results comparison between DRAGON2 and DRAGON2-CB is still given based on the same overlay size (144 PEs) with the same interconnect degree and topology (2D Mesh).
9.1 Impact of Compact Buffering on Resource Utilization
Compact buffering provides a significant reduction in terms of the required BRAMs for implementing communication buffers. In this scheme, data incoming from four neighbors is stored in four different partitions of a single BRAM as compared to four BRAMs in the non-compact version. An extra two BRAMs are used by each PE to implement the Register File component. Therefore, the total BRAMs in a PE is dropped from six to just three in the compact buffering scheme, which cuts the BRAMs utilization by half in the overall DRAGON2-CB overlay, as reported by Table 4. This reduction leads as well to a slight improvement in the utilization of registers (Flip Flops), because fewer buffer outputs are registered to pass the next pipeline stage (Decode/Execute1) as can be seen in Figures 3 and 6. Besides, while the multiplexing logic in Execute1 pipeline stage of the PE is simplified, the number of LUTs remains nearly the same, mostly because the equivalent multiplexing logic is moved to the upstream side of the communication buffers.
| Buffering type: | PE | BC (16 PEs) | OVERLAY (9 BCs) | OVERLAY+FPGA shell | ||||
|---|---|---|---|---|---|---|---|---|
| DRAGON2 | DRAGON2-CB | DRAGON2 | DRAGON2-CB | DRAGON2 | DRAGON2-CB | DRAGON2 | DRAGON2-CB | |
| LUT | 2,231 | 2,235 | 36,631 | 37,824 | 352,161 | 361,361 | 514,406 | 533,626 |
| LUTmem | 251 | 243 | 4075 | 3,947 | 36,774 | 35,615 | 53,688 | 52,529 |
| REG | 3,020 | 2,787 | 55,449 | 51,714 | 546,802 | 514,158 | 737,745 | 732,245 |
| BRAM | 6 | 3 | 96 | 48 | 864 | 432 | 1,070 | 638 |
| URAM | 1 | 1 | 32 | 32 | 304 | 304 | 304 | 304 |
| DSP | 13 | 13 | 208 | 208 | 1,872 | 1,872 | 1,876 | 1,876 |
Table 4. Resource Utilization of the DRAGON2-CB Overlay That Adopts the Proposed Compact Buffering Scheme as Compared to the DRAGON2 Overlay That Uses the Regular Buffering Scheme
9.2 Impact of Compact Buffering on Performance and Power Efficiency
The DRAGON2-CB overlay suffers a small drop in performance (less than 1% for the largest data size and the largest number of iterations), as compared to DRAGON2, for both the 2D Laplace and Jacobi benchmarks, as depicted by the left side of Figure 13. Here the compact buffering uses a single BRAM to store incoming data from any direction of each PE neighbor. Therefore, when a PE requires sending multiple corner-point data of its local stencil tile to two different neighbors (for example North and West), it has to split the transfer into two different clock cycles as opposed to the possible single clock-cycle transfer when separate BRAMs are associated to each adjacent PE, in the regular buffering scheme. Nonetheless, the number of corner points in a 2D tile is always equal to four (example in Figure 21), and therefore, the performance gap would be reduced further when the size of the stencil tile in each PE grows larger, as depicted by the left side of Figure 13. In contrast, the power efficiency has improved when using compact buffering as depicted by the right side of Figure 13. This is mostly achieved thanks to the reduction in the number of BRAMs that implement communication buffers in every PE from 4 to just 1 (overall FPGA utilization reduced from 1,070 to 638 as marked by Table 4).
Fig. 13. Double-precision performance and power efficiency of the proposed DRAGON2-CB as compared to the DRAGON2 overlay.
10 A STUDY OF SCALABILITY
This section provides an in-depth experimental comparative study about resource utilization, performance, power efficiency, and EPR between DRAGON2 and DRAGON2-CB.
10.1 The Impact on Operating Clock Frequency
Since the compact buffering scheme moves most of the multiplexing logic from the downstream of communication buffers to their upstream side, we examined the subsequent impact on the overlay operating frequency. Figure 14 depicts the operating frequency of DRAGON2 and DRAGON2-CB (for different number of PEs and interconnect dimensions) and suggests that designs with compact buffering (DRAGON2-CB) tend to operate at a comparable frequency to those not profiting from this technique (DRAGON2). The slight difference in clock speed is due to the nondeterministic behavior of the Xilinx FPGA implementation tools. While these reported frequencies are obtained after bitstream generation for each design, we noticed that the Xilinx FPGA runtime tools adjust further each frequency to the greatest multiple of five, that is, less or equal to the reported bitstream operating frequency. For example, an obtained bitstream with a clock of 300 MHz can operate on FPGA at the same speed while a bitstream that has a 294-MHz clock is further adjusted to 290 MHz during runtime, as we found in dynamic profiling reports.
Fig. 14. Achieved clock speed of DRAGON2 and DRAGON2-CB for 2D, 3D, and 4D interconnects with varied overlay size configurations.
10.2 Area (Hardware Resource Utilization) and Scalability
An ideal resource utilization of any design would maintain an equal ratio (percentage) on all of the available resources. Any hardware resource being relatively over-utilized would result in a high utilization ratio and consequently becomes a bottleneck for scalability. In the case of DRAGON (and DRAGON2), this bottleneck is BRAM. In fact, the communication buffers that are used to exchange data between PEs occupy four BRAMs in 2D, six in 3D, and eight in 4D interconnects. This bottleneck is depicted through the greater slope in the utilization ratio of BRAMs (with regular buffering) shown in Figure 15 (this ratio is based on user available resources after subtracting the FPGA shell resources). The lack of sufficient BRAM resources limits the number of PEs that can be deployed in 3D/4D designs without compact buffering. The 2D version has the exact required number of BRAMs to fit 288 PEs, however, enabling FPGA profiling claims some of these resources. Thus, we omitted a 2D regular buffering implementation with 288 PEs, because otherwise we would not be able to extract performance and power reports and subsequently could not include it into our experimental evaluation.
Fig. 15. Percentage of resource utilization of the proposed DRAGON2-CB (with COMPACT buffering) as compared to DRAGON2 (with REGULAR buffering), for multiple overlay size configurations and with varied dimensions (2D, 3D, and 4D) of the Mesh interconnect.
However, the compact buffering scheme overcomes this bottleneck and enhances the overlay scalability, as can be seen on the left side of Figure 15. In fact, the scalability enhancement is not limited to the overlay size—it also extends to higher dimensions of interconnect as well, where the BRAM utilization is expected to grow higher. The absolute numbers of resource utilization are illustrated by Figure 16. The linear trend in growth over all the hardware resources proves again the better scalability introduced by the compact buffering scheme. The amount of utilization of each resource is relative to the size of the overlay in terms of the total number of PEs. A greater slope is an indication that the underlying resource would become a bottleneck for size scaling, sooner than other kinds of hardware resources.
Fig. 16. Resource utilization of the proposed DRAGON2-CB (with COMPACT buffering) as compared to DRAGON2 (with REGULAR buffering), for multiple overlay size configurations and with varied dimensions (2D, 3D, and 4D) of the Mesh interconnect.
10.3 Performance, EPR, and Scalability
Table 5 summarizes the configuration parameters for BCs and PEs in each dimension of interconnect, as well as the total problem size (Number of stencil points) that fits in every configuration. The total size is obtained by multiplying the tile size in each dimension by the corresponding number of PEs in the same dimension (given by PE configuration). All stencil benchmarks are run for 10,000 iterations and the tile size in each PE is fixed for each dimension of benchmark. The total problem size is increased proportionally with the number of PEs to analyze the system scalability.
Table 5. Overlay Configurations and Related Stencil Sizes Used in the Scalability Analysis
The performance grows almost linearly with the size of the overlay, both for Laplace and Jacobi benchmarks in their 2D, 3D, and 4D versions, as depicted by Figures 17 and 18. The minor fluctuation in linearity is due to the change in runtime clock frequency depending on the overlay size as depicted by Figure 14. The proposed compact buffering technique has as a primary goal the reduction of BRAM resource utilization to simplify the PE implementation and to allow a better scalability of the overall overlay. Nonetheless, an equally important goal is to keep the same or at least similar performance level as compared to regular buffering scheme. Our experiments demonstrate that the EPR was maintained almost constant (barely noticeable variation within precision range beyond the two decimal digits reported in Figures 17 and 18). The steady EPR means that scaling compute resources (along with the problem size) maintains computational efficiency of the overlay. Overall, the compact buffering DRAGON2-CB has slightly less EPR than its counterpart because of the extra clock cycles required to scatter a tile’s corner data into two different directions, at each iteration of the stencil computation, due to the single write port on each communication buffer. Nonetheless, the compact buffering requires significantly less BRAMs than the regular buffering scheme; thus, it allows deploying a larger number of PEs, which in turn leads to a significant increase in the overall performance. However, this depends on the clock frequency, which may decrease significantly. For example, the required amount of SLLs in a 4D interconnect with 288 PEs limits the clock speed to just 200 MHz (achieved after 25 hours for routing and a total implementation time of 30 hours), which negatively impacted the performance, as can be seen in Figures 17 and 18.
Fig. 17. Double-precision floating-point performance scalability and Effective to peak Performance Ratio, using 2D, 3D, and 4D Jacobi benchmarks. A side-by-side comparison between DRAGON2-CB (COMPACT buffering) and DRAGON2 (REGULAR buffering).
Fig. 18. Double-precision floating-point performance scalability and Effective to peak Performance Ratio, using 2D, 3D, and 4D Laplace benchmarks. A side-by-side comparison between DRAGON2-CB (COMPACT buffering) and DRAGON2 (REGULAR buffering).
10.4 Power and Scalability
The power efficiency is obtained by dividing the performance of the overlay by its power consumption (which can be found in Vitis dynamic profiling reports). Thanks to the FPGA-layout-aware placement and interconnection, the 2D and 3D DRAGON2 and DRAGON2-CB report nearly similar power efficiencies with a slight advantage on the 2D version. In fact, both designs exhibit close ranges of performance and power consumption, because they use similar amounts of SLLs across SLR boundaries and therefore the reduced power efficiency in the 3D version is only due to the added long wires between remote PEs that belong to different BCs across the same SLRs, as illustrated in Figure 9. Besides, while the 4D overlay preserves these remote connections, it adds as well, long wires between remote PEs across the fourth dimension that crosses SLR boundaries and therefore it exhibits a heavy use of SLLs. Consequently, this leads to increase the overall power consumption. Furthermore, this reduces the system performance by limiting the operating clock speed as a result of the increased difficulty to meet timing requirements during the implementation phase. Nevertheless, Figure 19 shows that large size 4D overlays (more than 192 PEs) tend to stagnate around 3.5 GFLOPS/W while overlays with 2D or 3D interconnects appear to be able to scale further beyond 4.2 GFLOPS/W. As expected, the compact buffering DRAGON2-CB is more power-efficient than DRAGON2 (its regular buffering counterpart). The gap in power efficiency increases with the increase in size and/or interconnect dimension, thanks to the reduced BRAM utilization that reduced the overall power consumption. Note that these overlays were not optimized for power reduction. Specialized power-oriented optimizations may further increase their power efficiency. Nevertheless, the current power-efficiency is still in-par or even higher than other related works such as [40], which reports 3.33 GFLOPS/W (single-precision) for the 3D Jacobi, whereas DRAGON2-CB achieves 4.24 GFLOPS/W for the same benchmark despite using double-precision format.
Fig. 19. Impact of compact buffering on power efficiency with 2D, 3D, and 4D Laplace and Jacobi benchmarks.
10.5 Bandwidth and Scalability
Figure 20 illustrates the average bandwidth of each single HBM2 bank as well as the combined total bandwidth for read and write operations, depending on how many BCs are on the the overlay overlay. The high number of HBM2 banks provides a linear aspect to the overall bandwidth, which enhances further the scalability of the overlay. The HBM2 memory access is explicitly managed by the Controller; thus, the buffering scheme or the interconnect degree have no impact on it. Therefore, bandwidth dynamic profiling and related experimental data collection have been enabled only on the 2D compact buffering overlays to avoid timing overhead on higher interconnects due to the extra logic resource insertion. All accesses use bursts with 32 Bytes (256-bit AXI data bus) and 128 beats at most (4,096 Bytes). Accesses adopt a peer-to-peer model (data residing in an HBM2 bank is only accessed through that bank) as each BC is connected to a unique HBM2 bank. Vitis RTL kernel flow inserts an HBM Memory Sub-System that connects the overlay AXI interface to the hardened AXI switches near the HBM2 banks. This extra layer uses FIFOs to ensure clock-domain crossing adaptation from the HBM2 clock operating at 450 MHz to the overlay clock operating at less than 300 MHz, depending on configuration. Consequently, this introduces extra latency for read (638 and 684 ns for reported minimum and maximum, respectively) and write (546 and 583 ns for reported minimum and maximum, respectively) operations. While the obtained bandwidths are in-par with those reported in Reference [47] for the same burst size, they can still be improved by using a larger burst size (64 Bytes, for example) or a 512-bit AXI interface, in which case more SLLs would be used and may hinder the overlay scalability. Besides, the external HBM2 bandwidth has little impact on performance, because the data are loaded and stored back only at the start and end of the computation. A large number of stencil iterations hides the initial and final HBM2 access overhead as can be seen in Figure 12 where a minimum of 100 iterations helps to achieve around 80% or more of the maximum sustained performance.
Fig. 20. HBM2 bandwidth for read and write operations.
10.6 Impact on Executable Code Size
Table 6 demonstrates a slightly small gap between sizes of the generated executables from programs targeting DRAGON2 and DRAGON2-CB. This size gap grows expectedly alongside the increase in the interconnect dimension, because of the added neighbor-communication instructions that result from the addition of new remote neighboring PEs. Nonetheless, the difference between DRAGON2 and DRAGON2-CB remains constant as depicted by Table 6. These constant gaps in code size come from the usage of the extra VLIW instructions that handle the scattering of the corner points in each stencil tile local to each PE. These points have to be scattered to West or East PE neighbors at extra clock cycles when the same data has to be scattered as well toward North or South PEs (lines 29 to 32 in Listing 3 show these added VLIW instructions in a 2D Laplace benchmark). Figure 21 depicts an example of these corner points in 2D and 3D tiles. The extra points transmission is necessary due to the presence of a single write port on each BRAM in compact buffering, which constrains writing data into a single sub-buffer at each clock cycle. Besides, a single VLIW instruction consists of two 64-bit slots. Both these slots, packed together, total 128 bits or 16 Bytes. Since four corner points require four extra VLIW instructions, the origin of the 64-Byte gap in the 2D benchmarks becomes obvious. However, the four corner points in every 2D plane are replicated across the third dimension in the case of a 3D tile (in 3D benchmarks) as depicted by Figure 21. Since the 3D tile is equal to 8 \(\times\) 8 \(\times\) 12, the 64 Bytes are replicated 12 times, and hence the gap size becomes 768 Bytes. Similarly, 4D benchmarks require an extra 1,920 Bytes, which results from multiplying 64 by the value of the added third and fourth dimensions (3 and 10, respectively, since the 4D stencil tile is equal to 4 \(\times\) 4 \(\times\) 3 \(\times\) 10). Note that the program size remains unchanged when increasing compute resources (more BCs) and only depends on the tile size inside a single PE.
Fig. 21. The 2D (left) and 3D (right) stencil tiles inside a local memory of a PE and their corner points that require extra clock cycles to be exchanged with adjacent PEs.
Table 6. Cost of Compact Buffering on Size (in Bytes) of the Generated Binary Code for 2D, 3D, and 4D Jacobi and Laplace Benchmarks
10.7 A Costs and Benefits Model for N-Dimensional Interconnects
A stencil shape can be 1D, 2D, 3D, 4D, or even of a higher dimension [41]. Table 7 summarizes the costs and benefits of the proposed compact buffering technique for 2D, 3D, and 4D interconnects, and then it extrapolates the outcomes to beyond 4D, hence generalizing to \(N\)D (\(N\)-dimensional) interconnects. This generalization can be explained by carefully analyzing the local grid partition (stencil tiles) in 2D and 3D and their corner points, which are depicted by Figure 21.
Table 7. Number of Required BRAMS in Each Interconnect Dimension and the Related Clock Cycle and Code Size Overhead
In fact, in a 2D stencil computation, the tile is decomposed as a \({\bf a_{1}} \times {\bf a_{2}}\) plane where the number of corner points is equal to 4. These points require extra clock cycles to be sent to neighbors, because they need to be scattered toward two different directions, whereas the scatter operation is limited to a single direction because of the unique write port of the compact buffer. Furthermore, the number of clock cycles required to scatter these corner points, in a 3D computation (a 3D Mesh interconnect is required), is multiplied by the width of the extra dimension , that is a3 times, provided that an extra BRAM is added (as required by the proposed technique) to handle the exchange of the extra halo regions (halo regions become 2D planes and not just lines) across the added dimension. Hence, every added dimension will duplicate further the original number of the 2D space corner points, which leads to a general model for the execution clock cycle cost (for every stencil compute iteration) defined by \(4\times \prod _{i=3}^{N} a_{i}\) where \(N\) is the dimension of the stencil benchmark (and the interconnect). Besides, since every corner point to be scattered requires one VLIW instruction (16 Bytes wide), every four corner points in each 2D plane would require an extra 64 Bytes in the binary program. Thus, the generalization is valid for \(N\)D benchmarks (mapped to \(N\)D overlays), and the cost in binary program size would be \(64\times \prod _{i=3}^{N} a_{i}\).
11 COMPARISON WITH RELATED WORKS
While our current work targets double-precision computations, we also implemented a single-precision variant of DRAGON2-CB for a fair comparison against other works. This variant was implemented by using a SIMD-packed dual-thread path with two 32-bits FPUs and ALUs instead of the single 64-bit FPU and ALU. The rest of the overlay architecture remains unchanged. The clock speeds with 288 single-precision PEs, achieve 275, 270, and 240 MHz, with 2D, 3D, and 4D interconnect, respectively. Tables 8 and 9 summarize some of the most relevant works and report performance, power efficiency, and EPR figures for double- and single-precision implementations, respectively, for 2D, 3D, and 4D stencil computations. The reported results correspond to the DRAGON2-CB version. While some of these works are implemented through multiple FPGAs (or Central Processing Unit (CPU)), we only considered the obtained results for a single chip, aiming for a fair comparison. For example, the work in Reference [43] proposes a multi-FPGA domain-specific programmable streaming architecture that targets Jacobi calculations, however, Tables 8 and 9 report only the single-FPGA results. The PE proposed in Reference [43] contains a constant memory that stores stencil coefficients and an addressable buffer memory that stores input data, acts as a cyclic buffer for the current iteration calculation and as a communication buffer that scatter/gather data to/from neighboring PEs. The PE data path is split into eight pipeline stages, from which five are dedicated to execution, through the FMAC unit. This FMAC motivated our MAC FPU architecture. The authors require two independent accumulations to fully use their FMAC. In contrast, our MAC FPU is split into four stages (to meet timing requirements for a 64-bit precision) and thus requires four independent accumulations for a full usage.
| Ref | [45] | [45] | [46] | [20] | Ours\(^{\mathrm{OMP}}\) | Ours | |
| Year | 2016 | 2016 | 2019 | 2012 | 2022 | 2022 | |
| Type | FPGA | FPGA | FPGA | GPU | CPU | FPGA | |
| Device | DE5 | 395-D8 | Nallatech385 | GTX580 | Core i9 9900K | Alveo U280 | |
| 2D Jacobi | Perf. [GFLOPS] | 27.3 | 40.9 | 104 | 49.5 | 66.87 | 139.72 |
| P.Eff [GFLOPS/W] | — | — | — | — | 0.7 | 4.33 | |
| EPR [%] | — | — | — | — | 29.02 | 89.84 | |
| 2D Laplace | Perf. [GFLOPS] | — | — | 115 | — | 50.91 | 135.79 |
| P.Eff [GFLOPS/W] | — | — | — | — | 0.53 | 4.2 | |
| EPR [%] | — | — | — | — | 22.09 | 87.31 | |
| 3D Jacobi | Perf. [GFLOPS] | 27.2 | 40.7 | 74 | 50 | 43.4 | 145.62 |
| P.Eff [GFLOPS/W] | — | — | — | — | 0.45 | 4.24 | |
| EPR [%] | — | — | — | — | 18.83 | 91.93 | |
| 4D Jacobi | Perf. [GFLOPS] | — | — | — | — | 50.2 | 105.77 |
| P.Eff [GFLOPS/W] | — | — | — | — | 0.52 | 3.6 | |
| EPR [%] | — | — | — | — | 21.78 | 91.81 |
Table 8. Comparison of the Double-Precision Sustained Performance, Power Efficiency, and the EPR with Other Related Works
| Ref | [43] | [10] | [45] | [45] | [11] | [46] | [40] | [45] | [41] | [45] | [52] | [52] | Ours | |
| Year | 2014 | 2015 | 2016 | 2016 | 2018 | 2019 | 2021 | 2016 | 2018 | 2016 | 2018 | 2018 | 2022 | |
| Type | FPGA | FPGA | FPGA | FPGA | FPGA | FPGA | FPGA | CPU | CPU | GPU | GPU | GPU | FPGA | |
| Device | EP3SL | XC7VX | DE5 | 395-D8 | ADM- | Nallatech | XC7VX | Core i7 | Xeon | GTX | Tesla | Tesla | Alveo | |
| 150 | 485T | PCIE-KU3 | 385 | 485T | 4960X | E5-2630 | 960 | P100 | V100 | U280 | ||||
| 2D Jacobi | Perf. [GFLOPS] | 34 | 23.6 | 133.3 | 237.8 | 90.04 | 763 | 160.81 | 65.9 | — | 164.1 | — | — | 284.62 |
| P.Eff [GFLOPS/W] | 0.8 | — | — | — | — | 15.1 | 7.1 | — | — | — | — | — | 8.82 | |
| EPR [%] | 87.4 | — | 68 | 15.8 | — | 55.8 | — | 38.1\(^{\mathrm{C}}\) | — | 7.1 | — | — | 89.84 | |
| 2D Laplace | Perf. [GFLOPS] | — | — | 181.9 | 175.7 | — | 659 | — | 31.7 | — | 73.8 | — | — | 276.61 |
| P.Eff [GFLOPS/W] | — | — | — | — | — | 13.1 | — | — | — | — | — | — | 8.77 | |
| EPR [%] | — | — | 92.8 | 11.7 | — | 48.2 | — | 18.3\(^{\mathrm{C}}\) | — | 3.2 | — | — | 87.31 | |
| 3D Jacobi | Perf. [GFLOPS] | 31.9 | 2.63 | 111.3 | 193.3 | 83.98 | 628 | 66.06 | — | — | — | 1205.3 | 2111 | 285.96 |
| P.Eff [GFLOPS/W] | 0.71 | — | — | — | — | 11 | 3.33 | — | — | — | 6.4 | 8.1 | 8.5 | |
| EPR [%] | 83.9 | — | 59.3 | 12.8 | — | 45.9 | — | — | — | — | 12.9\(^{\mathrm{P}}\) | 13.4\(^{\mathrm{V}}\) | 91.93 | |
| 4D Jacobi | Perf. [GFLOPS] | — | — | — | — | — | — | — | — | 70.65 | — | — | — | 253.85 |
| P.Eff [GFLOPS/W] | — | — | — | — | — | — | — | — | 0.74 | — | — | — | 7.79 | |
| EPR [%] | — | — | — | — | — | — | — | — | 64.22\(^{\mathrm{X}}\) | — | — | — | 91.81 |
\(^{\mathrm{C}}\)We recalculated and updated the EPR value given in Reference [45] based on the official Intel data [23] that provides the peak performance of the Core i7-4960X. \(^{\mathrm{X}}\)EPR is computed based on the official Intel data [24] that provides the peak performance of the Xeon E5-2630. \(^{\mathrm{P}}\)EPR is computed based on the official Nvidia data [36] that provides the peak performance of the Tesla P100 PCI-E. \(^{\mathrm{V}}\)EPR is computed based on the official Nvidia data [37] that provides the peak performance of the Tesla V100 SXM2.
Table 9. Comparison of the Single-Precision Sustained Performance, Power Efficiency, and the EPR with Other Related Works
\(^{\mathrm{C}}\)We recalculated and updated the EPR value given in Reference [45] based on the official Intel data [23] that provides the peak performance of the Core i7-4960X. \(^{\mathrm{X}}\)EPR is computed based on the official Intel data [24] that provides the peak performance of the Xeon E5-2630. \(^{\mathrm{P}}\)EPR is computed based on the official Nvidia data [36] that provides the peak performance of the Tesla P100 PCI-E. \(^{\mathrm{V}}\)EPR is computed based on the official Nvidia data [37] that provides the peak performance of the Tesla V100 SXM2.
The authors of Reference [10] proposed a streaming-based approach to accelerate iterative stencil calculations. The proposed method consists of a queue of computational blocks named Streaming Stencil Timestep (SST) whose mission is to perform a single ISL timestep. The performance obtained through the described method managed to achieve an almost linear speedup when increasing the number of used SSTs from 1 to 8 in the 3D Jacobi benchmark and from 1 to 48 in the 2D Jacobi benchmark. Nonetheless, the amount of deployed SSTs appears to be constrained by the dimension of the benchmark that limits its scalability (and its performance). In the case of a 3D Jacobi computation, the reported resource utilization in Reference [10] demonstrates that BlockRAM is the most consumed resource and thus can be a bottleneck for scalability. In contrast, the compact buffering scheme in DRAGON2-CB, uses BlockRAMs efficiently and both 2D and 3D Jacobi benchmarks exhibit similar performance trends, while the 3D version achieves a slightly better EPR because of the increased arithmetic intensity that leads to a more efficient use of the MAC FPU.
The work in Reference [45] proposed an OpenCL-based streaming architecture for accelerating stencil computations. This architecture contains multiple Pipelined Computing Module (PCM) containing, in turn, multiple PEs that are coupled with shift registers, providing them with input data in a chained manner. The number of stencil iterations computed in parallel is equal to the number of PCMs. The Dynamic Random Access Memory (DRAM) is equivalent to our GM and feeds input data onto the first PCM module only. Other PCMs are cascaded one after another and only the last PCM writes back the result of the computation to the DRAM. This implementation achieved a performance reaching at best 237.8 GFLOP/s (single-precision) for a 2D Jacobi benchmark as well as a peak EPR of 92.8% for a 2D Laplace benchmark (single-precision). These results dramatically decreased with computations in double-precision as shown in Table 8. In fact, the authors in Reference [45] claim that multiplications are implemented using hardened DSP multipliers while additions are implemented using Adaptive Logic Module (ALM) that become a bottleneck due to the increased complexity in double-precision that requires significantly more ALMs. Furthermore, the Stratix V FPGAs implemented in the DE5 and the 395-D8 boards [22] implement a different number of DSPs with variable-precision multipliers that support up to 27 \(\times\) 27 multiplications (256 and 1,963 for DE5 and 395-D8 boards, respectively). Thus, a single DSP can implement a full multiplication for two single-precision mantissas (23 \(\times\) 23 bits) while a double-precision mantissa multiplication (52 \(\times\) 52 bits) would require multiple DSPs (possibly four) or multiple clock cycles, which may explain the decrease in double-precision performance.
A seemingly more recent update of Reference [45] is presented in Reference [46], which explores further the baseline architecture and proposes some algorithmic enhancements. This work has been implemented on a larger, newer generation FPGA (Arria 10 GX1150 [21]). The authors make use of the newly available hardened single-precision floating-point multipliers/adders that provide a peak of 1366 GFLOP/s. This led to a jump in sustained performance at 763 GFLOP/s (single-precision) for a 2D Jacobi benchmark. Nonetheless, the lack of dedicated double-precision DSPs in this FPGA has led the double-precision performance to remain relatively low at 104 GFLOP/s for the same benchmark.
The work in Reference [11] proposes an automated framework, code-named SODA, for implementing stencil computing models on FPGAs. SODA takes as inputs high-level descriptions of the stencil computation parameters and generates the specific optimized dataflow-based architectures. Soda generates a HLS-based C++ kernel on the FPGA side and an OpenCL API on the host side. Our system-level programming model (OpenCL Host + FPGA RTL kernel) is similar to SODA, with the difference that our kernel is fully implemented using Hardware Description Language (HDL) (SystemVerilog) and painstakingly optimized through manual floorplanning, guided placement as well as lengthy iterations of experimenting with multiple implementation strategies. DRAGON2 and DRAGON2-CB are also more flexible, because they can accept new program instructions (for different computation models) from the OpenCL host, which removes the need for bitstream regeneration.
The work in Reference [40] proposes a library of customizable HDL components for efficient implementation of stencil computations on a FPGA while also providing a multi-FPGA proof of concept. The proposed stencil accelerator design consists of serially chained computational blocks (named SST). One drawback is that increasing the number of stencil iterations appears to increase resource utilization. In contrast, DRAGON2 allows changing the number of iterations for a given stencil computation while maintaining the same resource utilization, because it is statically configured while the loop count limit as well as the overall behavior can be dynamically changed through software instructions. Besides, the work in Reference [40] provides normalized performance against a multi-threaded CPU. To extract absolute performance values, we compared the reported normalized performance values given in this work and the absolute performance given in Reference [43] and we report the results for the single-FPGA implementation in Table 9.
The work in Reference [41] proposed a DSL that translates high-level stencil abstractions into high-performance optimized stencil code for distributed memory architectures. This work examined the performance of such generated code for multi-dimension stencil benchmarks. For a 4D Jacobi computation, the single-node CPU performance is approximately 7.065 Giga stencil/s (extracted from a graph). The authors report that their update of each stencil point uses 2 multiplications and 8 additions (single-precision), which translates to the performance expressed in GFLOP/s reported in Table 9.
DRAGON2-CB outperforms older generation Graphics Processing Unit (GPU) such as GTX580 and GTX960, which has a peak double-precision performance of just 72.12 GFLOPS (double-precision cores have a rate of 1/32 of single-precision cores and therefore the single precision peak of 2,308.1 GFLOPS [45] has to be divided by 32). However, the number of cores in more recent GPUs keeps growing at a steady rate. That has led to unprecedented levels of performance, both for double- and single-precision. The work in Reference [52] reports the single-precision performance and power-efficiency for 3D Jacobi on Tesla P100 and V100 GPUs. The peak double-precision performance for these two GPUs is nearly 50% of the single-precision peak. Assuming a perfect scenario where the same EPR could be maintained in double-precision, the reported performance (and power-efficiency) for P100 and V100 GPUs, would be halved to reach 602.65 GFLOPS and 1055.5 GFLOPS (3.2 GFLOPS/W and 4.05 GFLOPS/W), respectively. Even then, both GPUs may provide a considerably higher performance than DRAGON2-CB, while the latter can achieve a better power-efficiency.
12 CONCLUSION
In this article, we present multiple architecture and technology-related enhancements to the baseline DRAGON overlay including a novel compact buffering technique that leads to around 2 \(\times\) and 3 \(\times\) improvement in performance and power efficiency, respectively, as compared to the baseline DRAGON overlay, with similar number of PEs and dimension of the interconnect. The current implementation targets an HBM2-enabled multi-die FPGA (ALVEO U280) where all the 32 memory banks reside on a single SLR region. Therefore, we propose an implementation model with balanced PEs distribution as well as efficient usage of scarce inter-die wires called SLLs. Furthermore, we propose a mathematical formulation that allows to compute the maximum possible AXI data bus width for a given combination of (rows,columns) of BCs with respect to the number of SLR regions, the amount of available SLLs, as well as the dimension of the interconnect. Using this formulation, we were able to implement 288 PEs, connected through a 4D Mesh interconnect, while using a 256-bit AXI interface, simply by switching the (rows \(\times\) columns) pair from (6 \(\times\) 3) to (3 \(\times\) 6). We implemented the enhanced overlay with and without the proposed compact buffering and conducted an extensive study of the impact on performance, power, area, code size and scalability, with 2D, 3D, and 4D Mesh interconnects. The in-depth evaluation is based on experimental results of 2D, 3D, and 4D Laplace and Jacobi stencil benchmarks. The enhanced DRAGON2-CB with compact buffering scheme, outperforms previous double-precision implementations and achieves 139.72, 145.62, and 105.77 GFLOPS, in 2D, 3D, and 4D Jacobi computations, respectively. The corresponding EPRs remarkably achieve 89.84%, 91.93%, and 91.81%, respectively. The results show that the BRAM utilization can be reduced to nearly 50%, which enhances the scalability and allows increasing compute resources along with the problem size and consequently achieving around 4\(\times\) improvements in performance and power efficiency over the baseline version. Ultimately, we provide a mathematical model for costs and benefits for any interconnect dimension using the compact buffering scheme. This model allows to quantify the number of necessary BRAMs for a given N-dimensional interconnect (and stencil benchmark), to assess the upper limits for both scalability and code size.
- [1] . 2018. Latency insensitive design styles for FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, Los Alamitos, CA, 360–3607. Google Scholar
Cross Ref
- [2] . 2020. Condensing an overload of parallel computing ingredients into a single architecture recipe. In Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors. IEEE, 25–28. Google Scholar
Cross Ref
- [3] . 2021. A highly-efficient and tightly-connected many-core overlay architecture. IEEE Access 9 (2021), 65277–65292. Google Scholar
Cross Ref
- [4] AXIprotocol. 2013. AMBA AXI and ACE Protocol Specification. Retrieved from https://developer.arm.com/documentation/ihi0022/e/AMBA-AXI3-and-AXI4-Protocol-Specification/Single-Interface-Requirements/Basic-read-and-write-transactions/Handshake-process?lang=en.Google Scholar
- [5] . 2019. MITRACA: A next-gen heterogeneous architecture. In Proceedings of the IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE, 304–311. Google Scholar
Cross Ref
- [6] . 2019. MITRACA: Manycore interlinked torus reconfigurable accelerator architecture. In Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors, Vol. 2160-052X. IEEE, 38. Google Scholar
Cross Ref
- [7] . 2012. Design, integration and implementation of the DySER hardware accelerator into OpenSPARC. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. IEEE, 1–12. Google Scholar
Digital Library
- [8] . 2011. Towards synthesis-free JIT compilation to commodity FPGAs. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE, 202–205. Google Scholar
Digital Library
- [9] . 2013. A high-performance overlay architecture for pipelined execution of data flow graphs. In Proceedings of the International Conference on Field programmable Logic and Applications. IEEE, 1–8. Google Scholar
Cross Ref
- [10] . 2015. On how to accelerate iterative stencil loops: A scalable streaming-based approach. ACM Trans. Archit. Code Optim. 12, 4, Article
53 (December 2015), 26 pages. Google ScholarDigital Library
- [11] . 2018. SODA: Stencil with optimized dataflow architecture. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’18). Association for Computing Machinery, New York, NY. Google Scholar
Digital Library
- [12] . 2010. Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing. In Proceedings of the IEEE/ACM/ IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, 13–22. Google Scholar
Digital Library
- [13] . 2015. Adjustable-cost overlays for runtime compilation. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, New York, NY, 21–24. Google Scholar
Digital Library
- [14] . 2020. FPGA-based computational fluid dynamics simulation architecture via high-level synthesis design method. In Applied Reconfigurable Computing. Architectures, Tools, and Applications, et al. (Ed.). Springer International Publishing, 232–246.Google Scholar
- [15] . 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 47–56. Google Scholar
Digital Library
- [16] . 2016. GRVI phalanx: A massively parallel RISC-V FPGA accelerator accelerator. In Proceedings of the IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, New York, NY, 17–20. Google Scholar
Cross Ref
- [17] . 2019. 2GRVI Phalanx: A 1332-Core RISC-V RV64I Processor Cluster Array with an HBM2 High Bandwidth Memory System, and an OpenCL-like Programming Model, in a Xilinx VU37P FPGA. WIP Report.Google Scholar
- [18] . 2021. AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs. ACM, 81–92. Google Scholar
Digital Library
- [19] . 2010. Design and implementation of MPSoC single chip with butterfly network. In Proceedings of the 18th IEEE/IFIP International Conference on VLSI and System-on-Chip. IEEE, New York, NY, 143–148. Google Scholar
Cross Ref
- [20] . 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the ACM International Conference on Supercomputing. ACM, 311–320. Google Scholar
Digital Library
- [21] Intel 2020. Intel Arria 10 product table. Intel. Retrieved December 23, 2021 from https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/pt/arria-10-product-table.pdf.Google Scholar
- [22] Intel 2020. Stratix V Device Overview. Intel. Retrieved December 23, 2021 from https://www.mouser.com/datasheet/2/612/stx5_51001-1099064.pdf.Google Scholar
- [23] Intel 2021. APP metrics for Intel microprocessors. Intel. Retrieved December 23, 2021 from https://www.intel.com/content/dam/support/us/en/documents/processors/APP-for-Intel-Core-Processors.pdf.Google Scholar
- [24] Intel 2021. APP metrics for Intel microprocessors. Intel. Retrieved December 23, 2021 from https://www.intel.com/content/dam/support/us/en/documents/processors/APP-for-Intel-Xeon-Processors.pdf.Google Scholar
- [25] . 2015. Efficient overlay architecture based on DSP blocks. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE, 25–28. Google Scholar
Digital Library
- [26] . 2016. Throughput oriented FPGA overlays using DSP blocks. In Proceedings of the Design, Automation Test in Europe Conference Exhibition. IEEE, 1628–1633.Google Scholar
- [27] . 2006. Packet switched vs. time multiplexed FPGA overlay networks. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 205–216. Google Scholar
Digital Library
- [28] . 2015. Hoplite: Building austere overlay NoCs for FPGAs. In Proceedings of the 25th International Conference on Field Programmable Logic and Applications (FPL’15). IEEE, New York, NY, 1–8. Google Scholar
Cross Ref
- [29] . 2017. Hoplite: A deflection-routed directional torus NoC for FPGAs. ACM Trans. Reconfig. Technol. Syst. 10, 2, Article
14 (March 2017), 24 pages. Google ScholarDigital Library
- [30] . 2011. Heracles: Fully synthesizable parameterized MIPS-based multicore system. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 356–362. Google Scholar
Digital Library
- [31] . 2020. Exploring the impact of switch arity on butterfly fat tree Fpga Nocs. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE, 70–74. Google Scholar
Cross Ref
- [32] . 2019. Time-multiplexed FPGA overlay architectures: A survey. ACM Trans. Des. Autom. Electron. Syst. 24, 5, Article
54 (July 2019), 19 pages. Google ScholarDigital Library
- [33] . 2015. QuickDough: A rapid FPGA loop accelerator design framework using soft CGRA overlay. In Proceedings of the International Conference on Field Programmable Technology. IEEE, 56–63. Google Scholar
Cross Ref
- [34] . 1998. REMARC (Abstract): Reconfigurable multimedia array coprocessor. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 261. Google Scholar
Digital Library
- [35] . 2019. Scalability analysis of deeply pipelined tsunami simulation with multiple FPGAs. IEICE Trans. Inf. Syst. E102.D (
5 2019), 1029–1036. Google ScholarCross Ref
- [36] . 2016. NVIDIA Tesla P100 GPU accelerator. Retrieved December 23, 2021 from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-p100/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf.Google Scholar
- [37] . 2020. NVIDIA V100 Tensor Core GPU. Retrieved December 23, 2021 from https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf.Google Scholar
- [38] . 2022. Scalable communication for high-order stencil computations using CUDA-aware MPI. Parallel Computing 111 (2022), 102904.
DOI: Google ScholarDigital Library
- [39] . 2020. A survey on coarse-grained reconfigurable architectures from a performance perspective. IEEE Access 8 (2020), 146719–146743. Google Scholar
Cross Ref
- [40] . 2021. Enhancing the scalability of multi-FPGA stencil computations via highly optimized HDL components. ACM Trans. Reconfig. Technol. Syst. 14, 3, Article
15 (August 2021), 33 pages. Google ScholarDigital Library
- [41] . 2018. Automatic Code Teneration and Optimization of Multi-dimensional Stencil Computations on Distributed-memory Architectures. Ph.D. Dissertation. University of Strasbourg, Strasbourg, France.Google Scholar
- [42] . 2010. FPGA-array with bandwidth-reduction mechanism for scalable and power-efficient numerical simulations based on finite difference methods. ACM Trans. Reconfig. Technol. Syst. 3, 4, Article
21 (November 2010), 35 pages. Google ScholarDigital Library
- [43] . 2014. Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth. IEEE Trans. Parallel Distrib. Syst. 25, 3 (2014), 695–705. Google Scholar
Digital Library
- [44] . 2018. High Performance Computing: Modern Systems and Practices. Morgan Kaufmann, 294–295.Google Scholar
- [45] . 2017. OpenCL-based FPGA-platform for stencil computation and its optimization methodology. IEEE Trans. Parallel Distrib. Syst. 28, 5 (
May 2017), 1390–1402. Google ScholarDigital Library
- [46] . 2019. Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability. IEEE Access 7 (
April 2019), 53188–53201. Google ScholarCross Ref
- [47] . 2020. Shuhai: Benchmarking high bandwidth memory on FPGAS. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE, 111–119. Google Scholar
Cross Ref
- [48] . 2012. A hexagonal shaped processor and interconnect topology for tightly-tiled many-core architecture. In Proceedings of the IEEE/IFIP International Conference on VLSI and System-on-Chip. IEEE, 153–158. Google Scholar
Cross Ref
- [49] . 2018. UltraFast design methodology guide for the Vivado Design Suite. Retrieved June 7, 2021 from https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_3/ug949-vivado-design-methodology.pdf.Google Scholar
- [50] . 2021. Alveo U280 data denter accelerator card. Retrieved June 7, 2021 from https://www.xilinx.com/products/boards-and-kits/alveo/u280.html#specifications.Google Scholar
- [51] . 2021. UltraScale architecture memory resources. Retrieved June 20, 2021 from https://www.xilinx.com/support/documentation/user_guides/ug573-ultrascale-memory-resources.pdf.Google Scholar
- [52] . 2018. Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 153–162. Google Scholar
Digital Library
- [53] Zhipeng Gong, Tefang Chen, Fumin Zou, Li Li, and Yingxi Kang. 2014. Implementation of Multi-channel FIFO in One BlockRAM with Parallel Access to One Port. Journal of Computers 9, 5 (2014).Google Scholar
Index Terms
A Scalable Many-core Overlay Architecture on an HBM2-enabled Multi-Die FPGA
Recommendations
Time-Multiplexed FPGA Overlay Architectures: A Survey
This article presents a comprehensive survey of time-multiplexed (TM) FPGA overlays from the research literature. These overlays are categorized based on their implementation into two groups: processor-based overlays, as their implementation follows ...
High performance programmable FPGA overlay for digital signal processing
ARC'11: Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applicationsIn this paper we investigate the use of a programmable overlay to increase the performance of variable DSP workloads executing on FPGAs. The overlay approach reduces reconfiguration time and provides fast processing. The overlay was implemented on a ...
Scalable multicasting with network-aware geometric overlay
It is crucial to design an efficient network-aware overlay network to enable multicast service to adjust under the dynamic underlying network conditions and node churn in a scalable manner without extensive network measurements. We propose an accurate ...






























Comments