Flexible Systolic Array Platform on Virtual 2-D Multi-FPGA Plane

Systolic arrays are a promising approach to achieving high-performance processing based on highly parallelized designs in various fields, such as AI and bioinformatics. Many previous studies have devoted considerable effort to exploring efficient circuit designs for specific processing. However, the increasing size of systolic arrays forces us to process increasingly large workloads by dividing them into smaller pieces. Therefore, we propose a systolic array platform based on a two-dimensional FPGA plane in which multiple FPGAs are connected by a virtual network. The systolic array realized by this system can be freely customized in shape and size according to the target. Distributed memory access through off-chip memory on each FPGA board and simple stream processing enable scalable performance. This paper presents a preliminary implementation based on the proposed systolic array platform and its performance evaluation. The evaluation results show that the proposed method improves the processing performance in proportion to the number of FPGAs. The results also show that the proposed platform is highly scalable due to the small circuit area required, and that the processing performance depends on the network bandwidth, which means that recent high-bandwidth FPGA boards can be expected to significantly improve the performance.


INTRODUCTION
The use of systolic array architectures to accelerate computation has grown in recent years with the rise of extremely computeintensive applications such as Artificial Intelligence (AI) [38,40,43] and sequential matching [3,4,12]; for instance, recent large-scale AI applications such as Large Language Models (LLMs), which have an enormous number of parameters, require the use of systolic arrays that can perform parallel operations quickly and efficiently.The systolic array architecture is suitable for accelerating and streamlining large-scale matrix operations, and dedicated chips or devices based on it are already in use in many production systems.In addition, various companies, universities, and research institutes have proposed dedicated systems to address this need.On the other hand, due to the recent large-scale growth of such requirements, it is common to process huge workloads by partitioning them into tiles of appropriate sizes for fixed-size systolic arrays on such systems or chips.While this partitioning is a realistic approach, it requires complex designs and operations that have the potential to reduce computational efficiency and operability [41].Currently, available chips with systolic arrays, such as Google TPUs [19,20] and AWS Inferentia [1], provide extremely high performance for training or inference in deep learning, but are not designed to be efficient for a wide variety of applications.When employing systolic arrays intended for various kinds of applications, systems with fixed-size systolic array circuits may not be the best choice for the above reasons or for cost.
Field Programmable Gate Arrays (FPGAs) [7,9] are often used to implement systolic arrays as hardware.The circuit reconfigurability of FPGAs can provide a flexible systolic array platform with different sizes and shapes, as well as the processing itself.However, there are two major factors that prevent FPGA-based systolic arrays from achieving larger scale and higher performance: resource and bandwidth limitations.The size of the systolic array implemented on an FPGA is limited by the number of available resources [23], such as DSP, on-chip memory, etc.Also, increasing the size of a systolic array will not improve its performance if the available bandwidth is not sufficient [3].Because of the limited bandwidth of off-chip memory and other resources that supply data for FPGA implementation, bandwidth often becomes a bottleneck for systolic array operations.
The most straightforward method to solve the resource limitation problem is to employ multiple FPGAs.However, using just multiple FPGAs might be severely impractical and not scalable in performance.For example, in a multi-FPGA system connected by a direct network without a network switch, the FPGA network configuration is fixed, which heavily reduces the customizability of systolic arrays.On the other hand, a switch network using existing general-purpose protocols such as Ethernet, while allowing flexible communication, might be a user-unfriendly system due to complex operations such as generating frames according to the protocol.To build an easy-to-use, scalable, and flexible system, it is necessary to devise and select appropriate solutions not only for the network described above but also for mapping systolic arrays to multi-FPGAs and for supplying sufficient I/O bandwidth to systolic arrays.
To achieve the above requirements, we propose a novel multi-FPGA systolic array platform that incorporates various functions and mechanisms as described below.We use the Virtual Circuit-Switching Network (VCSN) [34,35] to build a virtual 2-D reconfigurable FPGA plane, that can be easily customized in shape and size, to implement large-scale systolic arrays.In addition, the unified design of all FPGAs and the high-bandwidth operation of simultaneously driving off-chip memories on multiple FPGA boards make a systolic array with both flexibility and efficiency.In this paper, we introduce the design and operation of our proposed systolic array platform and present a performance evaluation of a preliminary benchmark design.Based on the evaluation results, we analyze the performance of the proposed system and show that our platform is not only flexible but also highly scalable and efficient.
The contributions of this paper are listed below: • A customizable 2-D multi-FPGA platform for large-scale systolic arrays; This paper is organized as follows.Section 2 describes previous studies about systolic arrays on FPGAs and other devices to show the novelty of this paper.Section 3 presents the FPGA-based systolic array architecture proposed in this study.Section 4 details the customizable systolic array platform on a virtual 2-D multi-FPGA plane.Section 5 shows the performance of our proposed systolic array platform and discusses and analyzes its performance.Finally, Section 6 presents the conclusion.

RELATED WORK
Systolic array architectures have a tightly coupled hardware structure of regularly arranged homogeneous Data Processing Units (DPUs) to achieve high processing throughput due to high parallelism.The structure of the systolic array enables highly parallel and efficient processing of large amounts of data without additional instructions, memory, or cache access [21,26].In addition, this uniformity and regularity provides systolic arrays with high throughput and scalability.Thus, they are widely used in various fields requiring high performance, such as Deep Neural Networks (DNNs) [40], matrix operations [25,37], bioinformatics [4,12], and error syndrome correction for quantum computers [36].
Many prior studies on systolic arrays leveraged the advantages of FPGAs, such as energy efficiency and low economic cost, and proposed implementations of specific applications, such as dense linear algebra operations (e.g., matrix multiplication) [22], Convolutional Neural Networks (CNNs) [24,38,43], and bioinformatics [3,8,10].Many of them achieved high performance with a single FPGA by making effective use of FPGA resources and optimizing the design for a specific problem.For CNN applications, Wei et al. [38] proposed an automatic design space exploration framework for CNN that transforms codes from a C program.Liu et al. [24] proposed the Winograd Processing Element (WinoPE) that constructs a highly efficient systolic array accelerator.They showed that a systolic array design with WinoPE and a specially designed memory subsystem achieved high throughput and DSP efficiency on several FPGA products.For bioinformatics, Di Tucci et al. [8] implemented a systolic array design to accelerate the Smith-Waterman algorithm.Subsequently, Siracusa et al. [31,32] ported the same design to different FPGAs using a roofline model-based optimization methodology [39], further improving performance.Abdelhamid et al. [3] presented a block-based systolic array approach to maximize the usage of memory banks in HBM2 on FPGA, achieving high performance and energy efficiency.
In these single-FPGA systolic arrays, the key to performance is optimizing data delivery and reuse through the effective use of on-chip memory.These innovations avoided bandwidth limitations caused by on-board memory (DDR in most cases) and achieved high performance through designs optimized for the target application.On the other hand, what we propose in this paper is not an efficient system optimized for a specific application, but a highly available systolic array platform that is flexible and easily scalable.Thus, our proposed system does not compete with these optimized FPGAbased systems, but rather coexists with them to achieve even higher performance.
Some studies have already proposed multi-FPGA systolic array architectures, for instance, to target stencil computing.For instance, Sano et al. [28,29] proposed a scalable array architecture with multiple FPGAs that target stencil computing.Similarly, Del Sozzo et al. [6] presented an automation framework for Iterative Stencil Loops (ISLs) targeting single-and multi-FPGA systems.Finally, Reggiani et al. [27] implemented a library of HDL components to effectively compute ISLs and CNNs inference on a scalable multi-FPGA system.These previous studies connected FPGAs through a 1-D ring network using two external ports.On the other hand, our proposed multi-FPGA systolic array uses 2-D FPGA connections with a virtual network while using FPGA boards that also have only two external ports of typical FPGA boards.Another important multi-FPGA systolic array implementation already in use in the industry is Microsoft Brainwave [5], a deep learning platform in the cloud that enables real-time AI inference with multiple FPGAs.It can provide a generic multi-FPGA systolic array similar to our proposal, but each FPGA is connected to the cloud network, and the actual processing follows the traditional approach of distributing the divided workload to each FPGA.Instead, our proposed systolic array platform deploys multiple FPGA resources in a virtual two-dimensional plane to achieve full stream processing without memory access during data transfer between FPGAs.
On the ASIC side, the work by Jouppi et al. [20] proposed the Tensor Processing Unit (TPU) accelerates the inference stage of neural networks.The TPU delivered a peak throughput of 92 Tera Operations per Second (TOPS), significantly exceeding the processing performance and energy efficiency of state-of-the-art GPUs at the time, indicating that ASICs are very effective in improving performance due to their advantages, such as higher operating frequency and an optimized circuit design.However, the cost required for ASICs is significantly higher than for FPGAs, and they cannot adapt to other applications.Yüzügüler et al. [42] proposed SOSA, which optimizes array granularity, interconnect, and tiling in multi-pod systolic arrays for ASIC implementation.Although their evaluation shows very high performance and high utilization and efficiency of their multi-pod approach are attractive, it is a different direction from our concept of building flexible and costefficient systolic arrays with multiple FPGAs connected by virtual networks.Also, the high performance of 600 TeraOps/s in their evaluation results is due to optimizing the design for the benchmark application, with no mention of performance in other cases.
Compared to conventional single FPGA implementations, our proposal offers novelty in achieving larger scale systolic arrays by overcoming resource limitations thanks to the multi-FPGA implementation and providing bandwidth proportional to its scale through distributed memory accesses.Also, compared to previous studies using multiple FPGAs, our advantage is a flexible systolic array constructed by a virtual 2-D mesh network between FPGAs independent of the physical configuration, e.g., the number of network ports.While it is difficult to match the extremely high performance of ASIC-based predecessors with designs optimized for specific operations, our advantage is that a systolic array with flexible configurations for different applications can be realized by a low-cost implementation using commercially available FPGA boards.

SYSTOLIC ARRAY ARCHITECTURE
This Section describes the systolic array architecture employed in this study.The design of the systolic array must be highly scalable and operational.In addition, we adopt a unified, simple, yet effective design to facilitate implementation and operation on multiple FPGAs.Our design is a common platform for realizing systolic arrays on multi-FPGAs.To this end, we first introduce an basic implementation not specific to any particular computation and then a version featuring Floating-Point (FP) Multiply-Accumulate (MAC) operations.We developed all the components described in this Section using High-Level Synthesis (HLS) for easy and fast customization.

DPU Design
The main aspects to consider when developing a DPU of a systolic array are the input/output connection topology and the functionality to implement.Indeed, their combination guarantees a proper distribution of data processing among DPUs.Regarding the connection topology in 2-D systolic arrays, each DPU features four connections: North, West, South, and East.In particular, North and West are the input of the DPU, while South and East are the output.More technically, each connection implements an Avalon-Streaming (ST) interface featuring the valid, ready, startofpacket, and endofpacket signals [15].In this way, the DPU design enforces a backpressure mechanism that enables a dynamic computation schedule.Besides, the startofpacket and endofpacket signals manage the beginning and stop of the DPU processing.Finally, we employ FIFOs between neighboring DPUs to store incoming/outgoing data and relieve potential pressure issues.
3.1.1Basic DPU Version.Moving to the functionality, the basic DPU implements a data-forwarding functionality, which transfers data from North to South and West to East.Although it might look more straightforward than other systolic arrays [3, 4, 12, 22, 25, 36-38, 40, 43], our design choice allows decoupling the proposed architecture from a specific computation.In this way, we can evaluate the maximum bandwidth our architecture achieves when moving data through the systolic array and, potentially, multiple FPGAs without introducing additional latency due to DPU internal computations.Nonetheless, to maintain a realist behavior, we devised a synchronization mechanism within each DPU to guarantee that data forwarding to South and East connections happens only when both North and West data are available.Indeed, without synchronization, data would seamlessly move from input to output connections, preventing the typical wavefront/skew data flow of systolic architectures.Specifically, such a mechanism relies on blocking read operations on the two input FIFOs; consequently, the DPU stalls until both inputs are valid, enabling synchronization.Please note that the DPU design is fully pipelined with an Initiation Interval (II) = 1. Figure 1a gives an overview of the basic DPU version.

MAC-DPU Version.
The basic DPU version described so far helps evaluate the overall bandwidth our solution can reach when synthesized on FPGA.However, since no computation occurs, this design cannot give us a benchmark of the obtainable performance.To this end, we also developed a different version of the DPUs implementing FP MAC operations.Specifically, in addition to the DPU data-forwarding functionality, we multiply the North and West input data and accumulate their product internally to the DPU.In particular, we employ a shift register for the accumulation, whose size is greater (or equal) than the latency of the FP MAC operation.This way, we guarantee an II = 1 for this DPU version too.Finally, please note that such a DPU version only aims to evaluate the performance of a systolic array featuring FP operations without tailoring it to a specific computation.Figure 1b shows the design of the MAC-DPU version.

Systolic Array Design and System-Level Integration
After defining the design of a DPU, we now focus on the systolic array architecture and its single-FPGA integration at a system level.
The systolic array we devised in this study features the following main components: an  ×  grid of DPUs, two Dispatchers, and two Collectors.from other components, unpack their content, and distribute the single elements to the target DPUs.For instance, in the case of a 16×16 systolic array, each Dispatcher reads 512-bit packets and sends 32-bit data to the DPUs.Please note that the Dispatchers do not have to implement any skew/wavefront mechanism since additional FIFOs buffer data transfer to the DPUs.This way, data reside in the FIFOs until North and West inputs are available to a DPU.On the other hand, Collectors X and Y behave in a specular fashion; indeed, they aggregate data from the East/South connection of rightmost/bottommost DPUs (stored in FIFOs) and send the coalesced packet to the following components.Finally, both Dispatchers and Collectors implement the Avalon-ST interface for input and output connections.Note that this platform does not restrict connections between DPUs to unidirectional.We now focus on the system-level integration of the systolic array.To this end, we utilize our developed AFU Shell that is a kind of an FPGA Shell based on Accelerator Functional Unit (AFU) [14] embedded in Intel Open FPGA Stack (OFS) [18] (Figure 3).OFS is Intel's open-sourced hardware and software infrastructure for FPGAs, which comprehensively provides both hardware that implements basic functions on FPGAs and software such as drivers and API libraries for FPGA applications.AFU Shell is a System-on-Chip (SoC) that incorporates DMA controllers (DMACs), crossbars, ports to network interfaces, etc. in a partially reconfigurable AFU region within an Intel FPGA.DMACs interface with the off-chip memory banks available on the target board and send/receive data to/from the systolic array through the crossbars.Specifically, we hierarchically organized the crossbars, where the top crossbar directly communicates with the DMACs and the bottom ones ( and ) with the systolic array.In addition, crossbars  and  enable data transmission to other FPGAs via the network subsystem (Section 4.2).In this way, according to the configuration of the crossbars, the systolic array can interact with the off-chip memory or other FP-GAs.The system can also send/receive data streams to/from the network without memory access or host intervention, enabling efficient inter-FPGA communication.

SYSTOLIC ARRAY ON MULTI-FPGA
This section presents the implementation and operation of our proposed systolic array on multiple FPGAs connected by a virtual 2-D mesh topology.The novelty of this paper is to consider a virtual 2-D mesh-connected FPGA as a large reconfigurable plane on which we build a large systolic array.Virtual Circuit-Switching Network (VCSN) [35] used for this purpose provides a virtual topology that does not depend on physical connections to the inter-FPGA network, so that a virtual 2-D topology can be constructed using FPGAs with only two external ports, as described below.In addition, the system achieves high I/O bandwidth by simultaneously accessing off-chip memory (DDR) on multiple FPGA boards.

Multi-FPGA System
Figure 4 shows the network of our developed multi-FPGA system to implement the proposed systolic array platform.This system employs Intel PAC D5005 FPGA boards [13] that consist of an Intel Stratix10 GX 2800 FPGA, two external network ports, and four banks of DDR4 off-chip memory.These two ports connect different network switches to form a dual-plane network, in which each connection is bi-directional at 100 Gbps.Two FPGA boards are connected to each CPU server via PCI Express (PCIe) Gen3 x16 slots.
The CPU servers are also connected by their own InfiniBand switch network for intensive control of multiple FPGAs simultaneously.
In our multi-FPGA system, as shown in Figure 3, the DMACs on the FPGA are driven from the CPU server, allowing DMA transfer of data between the memory on the CPU server and the off-chip memory on the FPGA board [30].In addition, the CPU server can control arbitrary data movement with streaming in any path among components, such as the off-chip memory, the systolic array, and the network subsystem, in the FPGA connected by Avalon-ST interfaces through access to DMACs and crossbars.For control of multiple FPGAs, parallel operation of multiple servers using the Message Passing Interface (MPI) library [33] enables simultaneous control of many FPGAs connected to those servers with low latency via the server networks.

Inter-FPGA Network
Since our employed FPGA boards have two external network ports, direct links of FPGAs cannot provide the 2-D network topology.Therefore, we use VCSN to build a virtualized 2-D FPGA plane by configuring virtual links between FPGAs.Each FPGA has a network subsystem region for VCSN, which provides virtual network ports for FPGA applications as shown in Figure 3.We can configure arbitrary virtual topologies by connecting any pairs of these virtual ports on any FPGAs connected to the same network switch.The connections between virtual ports can be easily changed by simply rewriting corresponding registers in each FPGA.Thanks to VCSN's virtual topology, our multi-FPGA system can have a flexible, customizable 2-D FPGA plane depending on different shapes and sizes of systolic arrays.
For efficient communication among virtually 2-D connected FP-GAs, we assign communication in the x-and y-directions to each of the two network switches we use, as shown in Figure 4.The x-direction corresponds to the west-to-east direction, and the ydirection corresponds to the north-to-south direction in Figure 2.This x and y division is also applied to the FPGA's internal circuitry, and data transfers in the x and y directions are controlled independently.Figures 3 and 4 show the x and y direction in red and blue respectively.With this structure, a systolic array slice on each FPGA can have network and memory connections via I/O streams in the x-and y-directions, respectively, through the three 4×4 crossbars.This combination of such directional partitioning and VCSN allows users to efficiently perform virtual 2-D communication on the multi-FPGA system.Please note that we refer to the divided systolic array implemented in each FPGA as a slice, as shown in Figure 5.
In multi-FPGA systolic array implementations, the bandwidth and latency of the inter-FPGA network are critical factors that affect overall performance.Actual measured bandwidth and latency of inter-FPGA communication over VCSN are described in [35]; the reference bandwidth is approximately 99.4 Gbps (theoretical value is 100 Gbps) and the latency is approximately 851ns.These values are slightly higher for VCSN in terms of bandwidth and about 1.73 times higher for VCSN in latency than a direct optical cable connection between the FPGA board ports.As for bandwidth, since it is almost a theoretical value, we can expect high performance if there are no bottlenecks in the systolic array and other parts.On the other hand, latency is an overhead in systolic array operation and can be an obstacle to achieving a scalable architecture.In the performance evaluation in the next section, we show the actual performance of our systolic array, including communication latency, as measured in the practical implementation.

Systolic Array Operations
Next, we show how to operate systolic arrays on the multi-FPGA system.Figure 5 shows the systolic array architecture mapped to the FPGA plane (2 × 2 FPGAs).The entire systolic array is divided into two-dimension and the divided slices are integrated into Shells in each FPGA.Since each slice has the same design (number of DPUs, aspect ratio, etc.), it is more accurate to say that multiple slices are combined and arranged in two-dimensional space to form a larger systolic array.Thanks to the simple structure shown in Section 3, we can build systolic arrays on FPGA planes by  × VCSN virtual topology.From the Shell structure in Figure 3, if the interfaces in the x-and y-directions of each FPGA's systolic array slice are properly connected, the entire systolic array operation can be realized by streaming.The control required in each FPGA depends on the location of the systolic array slice.If each slice is represented by its original position (, : 0 ≤  < ; 0 ≤  <  ), as shown in Figure 5, I/O control for  ×  systolic array slices can be divided into the following conditions: (1) if  = 0 or  = 0: input from local off-chip memories; (2) if  =  − 1 or  =  − 1: output to local of-chip memories; (3) others: input/output from/to the network.
The following is the procedure for a systolic array operation on our proposed multi-FPGA system.First, configure developed circuits that integrate systolic array slices within Shells on all involved FPGAs.Then, transfer necessary data from each CPU server's memory to local off-chip memory on each FPGA.Next, configure the virtual network topology by updating the appropriate registers to form the FPGA plane.In the example in Figure 5, the virtual links corresponding to the red and blue arrows from (1) to (4) are configured.The actual data paths on the network are shown on the lower left in Figure 5. Next, set up the path of the data streams inside each FPGA by configuring crossbars.The configuration of the inputs and outputs to/from the systolic array circuitry follows the itemized conditions above.Now that the data stream paths are established, start DMA read processing in the FPGAs that satisfy condition (1) and DMA write processing in the FPGAs that satisfy condition (2).Data queuing occurs at all DPUs so that data movement is automatically synchronized at each location.At the end of the DMA write operation, the final processing results are written to the right and bottom off-chip memories in Figure 5.To control the involved FPGAs simultaneously, we introduced MPICH 3.0.4[11], one of the MPI implementations, in all the CPU servers.
A notable feature of this approach is that DMAs on multiple FPGAs operate simultaneously to exploit the available bandwidth of all the off-chip memories.Furthermore, because of the uniform design of each FPGA, we can realize systolic arrays of various sizes and shapes by only configuring virtual topologies.Our design policy is to fix the input/output data width of the systolic array slices regardless of the data width of DPUs to perform efficient stream processing.For example, for 32-bit data, if the I/O data width in each direction of the slice is 512 bits, there will be 16 × 16 DPUs in each slice.In the same situation, targeting 16-bit data, each slice would have 32 × 32 DPUs.

Data Movement in the Systolic Array
Figure 6 shows the details of data movement on the systolic array slice and the FPGA plane during the operation.First, let's focus on the data flow in the systolic array in each FPGA (left side of Figure 6).As explained in Section 3, each DPU can process when it receives data from both the North and West ports, only then a DPU processes the input data and outputs results to the South and East.Therefore, within the systolic array slice, data move from the DPU in the northwest to the southwest one level at a time in a wavefront fashion.As a result, this design performs a pipeline process in which data flows in the order of the red-circled numbers in each DPU in Figure 6.On the other hand, the input/output to/from the slice are channels with single data streams in each direction.Therefore, the dispatcher divides the single input data stream from the outside and inputs them to each DPU.However, since the start time of processing is different for each DPU as described above, there is a time lag in the inputs to the DPUs.To compensate for these gaps, a FIFO is installed in each input channel of the systolic array as shown in Figure 6.The depth of these FIFOs is equal to the number of DPUs in the vertical (or horizontal) direction.As with the input side, since data must be available in all channels on the output side of the slice, a FIFO of the same size as on the input side is required for each output channel.
The data flow on the FPGA plane is also similar to this data movement in a wavefront fashion.For example, the FPGAs with 2  ○ on the right side of Figure 6 receive data from memory at the start of processing.However, processing cannot start until the FPGA 1 ○ outputs data to 2  ○ due to the data synchronization condition mentioned earlier.As a result, the FPGAs are processed in the same order as the DPUs, from northwest to southeast.At first glance, we expect such processing to have a critical overhead.However, once data arrive at all DPUs in the multi-FPGA system, it achieves continuous processing with maximum efficiency, just as in general pipeline processing.
Let us estimate the processing time in hardware with such a structure under ideal circumstances.We assume a systolic array slice consists of  ×  DPUs and processing on  ×  FPGAs.Also, we assume that the number of data processed by a single DPU is   , the network latency (FPGA-to-FPGA) is a fixed value of   , the latencies for memory read and write are   and   , operational frequency in all FPGAs is  , and timing deviations between FPGAs are ignored.Please note that   is the data from either north or west (numbers of data from north and west are the same).First, assuming there is no stall during processing, the processing time    in one FPGA can be expressed as follows.memory affect the processing time in terms of memory-related latencies.The numerator in Equation ( 1) is the number of clock cycles required for processing.Extending this to multiple FPGAs, the processing time   can be expressed as follows.
Note that the above equation ignores timing deviations in operation among FPGAs, including memory accesses.From Equation (2), we can see that if the data size is sufficiently large compared to , , and  , the network latency   will affect whether scalable performance can be achieved.There may be other overheads in the actual processing, such as control software for the FPGA in the CPU server.Although the network latency is small, less than 1 s as mentioned earlier, we will verify the impact of these overheads on the performance of large systolic arrays by evaluation with real devices in the next section.

EVALUATION
This section presents the performance evaluation of the proposed multi-FPGA systolic array platform on real devices.We evaluate the performance of multiple systolic arrays on virtual 2-D FPGA planes of different shapes and sizes with the DPUs shown in Section 3.
For the evaluation, we used a cluster of 5 CPU servers, each powered by an Intel Xeon Gold 5122 CPU running at 3.60GHz.Each server has two Intel PAC D5005 FPGA boards connected to the host via PCIe, and each FPGA board contains four DDR4 off-chip memory banks.In this evaluation, we use from one to a maximum of nine FPGA boards to configure multi-FPGA planes connected by VCSN.Although the theoretical memory bandwidth in our system is 19.2 GB/s per DMAC, the available memory bandwidth in this experiment is 12.8 GB/s because we set the FPGA operating frequency to 200 MHz.In addition, we used two Mellanox SN2100 100 Gbps Ethernet switches for FPGA interconnection, as shown in Figure 4.In terms of hardware design, we developed the systolic array architecture using the Intel HLS Compiler 21.1 and used Intel Quartus Pro 19.2 to generate bitstream files.Note that the experimental values presented in this section are averages of 100 experimental trials.

Theoretical Performance
We now define our system's theoretical throughput (in terms of bandwidth) and performance according to the target multi-FPGA configuration.We first characterize the theoretical I/O throughput  /   ℎ , which indicates if the bottleneck is the DDR throughput    ℎ at a given frequency  or the network   ℎ : Given Equation ( 3) and an  ×  grid of FPGAs, the theoretical system throughput    ,, ℎ is: For a single FPGA design, note that Equation (4) turns into: On the other hand, assuming  /   ℎ in GB/s, the theoretical performance of our multi-FPGA systolic array is: where   is the number of DPUs on a single FPGA, and   is the number of operations each DPU performs on its input.

Measurement Method
Here, we describe what and how to operate and measure the performance following the procedure in Section 4.3.In practical multi-FPGA operation, the processing order follows the sequence of the data distribution to off-chip memory on each FPGA board, processing on FPGAs, and data collection from these memories and verification.
At first, all data are placed in one master CPU server.We use MPI_Scatter operation to distribute data to memories on other servers.Next, we drive DMACs in all the FPGAs by MPI simultaneously to move data from the host memories in the servers to off-chip memory on FPGA boards.Then, we configure the network topology and crossbars to establish the data paths and run DMA transfers for the operation.After the operation is completed, we then drive DMAs again to move data from off-chip memories to memories in the servers.Finally, we use MPI_Gather operation to gather data into the master server's memory from those in other servers.
In this evaluation, we measure the time required for multi-FPGA systolic array processing, i.e., from the start of DMA transfer from the FPGA's off-chip memory until all stream processing on the FPGA is complete and all data is written to the off-chip memory, which is the same time as Equation (2).Please refer [35] for the details of the data transfer between the host memory and the offchip memory on the FPGA board via PCIe and the data transfer by MPI among CPU servers, respectively.
To guarantee the completion of the process on all FPGA boards, we employ MPI_Barrier to synchronize the DMA write completion among all the FPGAs.We used MPI_Wtime to measure the time from the start of systolic array processing to its completion (including synchronization by MPI_Barrier).Therefore, the measured time includes the gap in start time among FPGA boards, the overhead of MPI_Barrier operation, and some software overheads.The time obtained here (including software and MPI overhead as well as memory and network latency as shown in Equation 2) is defined as processing time.We calculate throughput and computational performance based on the obtained processing time.

Throughput
First, we show the throughput of the proposed systolic arrays on  × FPGAs.In this evaluation, each systolic array slice has 16×16 basic DPUs with 32-bit data inputs with no operation.Figure 7 shows the results for different data sizes.Obviously, throughput increases as data size increases because the ratio of fixed overhead to total processing time decreases.It is noteworthy that the I/O bandwidth of the system depends on the number of operating DMACs, represented by  +  as shown in Figure 5 in this design.For example, in the cases of the 2 × 4 and 3 × 3 FPGA implementations, both of them drive 6 DMACs and achieve almost the same throughput.Also, when comparing 1 × 1, 2 × 2, and 3 × 3, the throughput ratio is roughly 1:2:3, the same as the DMA number ratio.
Then we compare the results with the theoretical values obtained by Equation (4).In the case of 3×3 FPGAs, our system achieves approximately 74 GB/s throughput when the single DMA size is 1 GB, reaching 98.8% efficiency compared to a theoretical bandwidth of 75 GB/s which is calculated by 12.5/ (theoretical bandwidth of a link) ×6(=  +  ).Although it is due to no operation in the DPUs, this high efficiency indicates that the entire system design can almost fully exploit the available network bandwidth.
We also estimate the performance of DPU designs with 16-bit, 32-bit, and 64-bit input widths when a single DMA size is 1 GB.In our design, the number of DPUs decreases by a factor of 4 when the input width of DPUs doubles in this architecture.Note that this estimation is based on the assumption that the number of operations per DPU is 1.The results shown in Figure 8 indicate that the performance is proportional to  ×  .From this relationship, for example, the performance for 8-bit data, which is important in CNN, is theoretically 4 times higher than that for 16-bit data, or 6.4 TOPS for 3 × 3 FPGAs.This performance is expected to increase in proportion to the number of operations per DPU.

Operational Performance
Then we evaluate the actual computational performance of the multi-FPGA systolic array with MAC-DPUs.Each DPU has two operations, multiplication and accumulation, for input FP data streams.We evaluate the throughput and performance with both 32-bit single-precision and 64-bit double-precision FP data.
Figure 9 shows the actual throughput and computational performance of the MAC-DPU systolic arrays when the data size of each DMA is 1 GB.Since the obtained processing times are almost the same as in the cases for the basic version, the increase in overhead due to the additional operations has little effect on the overall performance.The DSP block in Intel Stratix 10 FPGAs [17] combines multiplication and addition processing for single-precision FP data with a latency of five clock cycles.This results in less than 100 clock cycles (several hundred nanoseconds order) of additional latency per FPGA compared to the basic version.Therefore, the efficient DSP block architecture in recent FPGAs prevents a significant increase in latency for FP operations even in the proposed large-scale systolic array.
In the left graph of Figure 9, there is no difference between the throughput of systolic arrays with 32-bit and 64-bit DPUs since the input width to the systolic array is the same, as shown on the left of Figure 9.In addition, the throughput in each case is proportional to the value of  +  , the same as in the basic version.Furthermore, in all cases except 1 × 1, the results are close to the theoretical throughput calculated from the network bandwidth.
The right graph in Figure 9 shows computational performance in each case in GFLOPS.Also, Table 1 shows how  ×  FPGAs improve the performance compared to the 1 × 1 case.These results prove that the processing performance of our multi-FPGA systolic array is roughly proportional to the number of FPGAs.For instance, in the case of 2 × 2, the 32-bit and 64-bit designs reach about 397 and 99 GFLOPS, respectively.In contrast, in the case of 3 × 3, the processing performance improves by about 2.25 (= 9/4) times in both cases (892 and 223 GFLOPS), proving that our solution provides scalable performance to systolic array operations on multiple FPGAs.

Resource and Frequency
Table 2 shows required FPGA resources and maximum operation frequencies for systolic array slices with 32-bit and 64-bit MAC-DPUs.Adaptive logic module (ALM) is Intel's FPGA logic element with an 8-input LUT, two adders, and four registers.Values in parentheses show the percentage of total resources for the Intel Stratix10 GX 2800 used in this experiment.Overall, it is clear that the proposed systolic array occupies considerably small resource consumption in this implementation.The 32-bit design uses more  ALMs than the 64-bit case because the former has more DPUs and more logic required to control them.The 32-bit one also consumes more on-chip memory because of the larger number of FIFOs in the systolic array.Please note the low utilization of the DSP blocks (4.4%), which indicates plenty of room for more FP operations per DPU to improve performance.For maximum operating frequency, the case of 64-bit DPU is 15% higher than that of 32-bit.There is a trend that the larger the DPU input/output width, the higher the operating frequency, which appeared in the comparison in Figure 8.We believe that the reason is that the input/output width of the whole systolic array is constant, so reducing the DPU input/output width increases the number of DPUs, which increases the complexity of routing among modules.

Discussions
The observations about our proposed platform from the results are listed below.
• The performance of this multi-FPGA systolic array is determined by bandwidth.• Latency (circuit and network) has a small impact on processing performance.• Efficient use of FPGA computational resources can be expected to significantly improve performance.
Regarding the first point, in the evaluation results, the memory bandwidth in the case of a single FPGA and the network bandwidth in the case of multiple FPGAs determine the overall performance.Of course, if the network bandwidth is higher than the memory bandwidth, the memory bandwidth will be the rate-limiting factor even for multiple FPGAs.For example, assuming that our design is extended to the latest FPGA board capable of 400 Gbps communication [2,16], the 4x increase in input/output bandwidth is expected to result in a 4 × 4 =16x increase in operational performance.The fact that the performance of our proposed system is bandwidthdependent is expected to be advantageous given the recent trend Regarding the second point, since this implementation is limited to 9 FPGAs, there is a possibility that the latency will affect the performance when a large number of FPGAs are used.However, compared to the implementation of a 1D network as shown in related work, the proposed 2D system is expected to show a slower increase in latency when more FPGAs are used.
The third is the most important point: this system leaves plenty of room for optimization for specific applications through efficient use of on-chip memory and DSPs.Since this study is only a proposal for a multi-FPGA systolic array platform, it is important that there is sufficient foresight available in terms of resources to implement more optimized design for specific applications.Other issues include the fact that this design limits data transfer between FPGAs to uni-directional, but in practice, bi-directional communication is also possible by increasing the number of virtual links.In this case, the available bandwidth is reduced because multiple virtual links share a single physical link, but this system has the flexibility to support a variety of applications.

CONCLUSIONS
This paper proposed a novel flexible systolic array platform on the virtual 2D multi-FPGA level.We developed a multi-FPGA systolic array platform with a flexible virtual 2D mesh network, which provides customizability to the systolic array depending on the required shapes and sizes from the target applications.We also demonstrated simultaneous memory access in each FPGA board to provide scalable I/O bandwidth for large systolic arrays.With these features, large, highly efficient systolic arrays of arbitrary shapes can be easily realized on multi-FPGA systems.In the experimental evaluation, our proposed systolic array showed scalable throughput on a multi-FPGA system connected by 2D meshes of different shapes.The systolic array with MAC-DPUs, implemented as a practical application, also showed scalable processing performance proportional to the number of FPGAs.In addition, our implementation used less than 25% of the FPGA logic elements, less than 10% of the on-chip memory, and less than 5% of the DSP blocks, indicating that there is much room for optimization and performance improvement for specific applications.Furthermore, since the performance of the proposed system is always limited by bandwidth (memory or network), significant performance improvements can be expected from the latest FPGA boards with high-bandwidth memory and networks.
Our future plan is to build a practical systolic array for specific applications on this platform and compare its performance with that of other systems.We will also evaluate the performance of the platform with a larger number of FPGA boards and discuss its scalability.

Figure 2
shows a 2×2 systolic array design example.First, the grid implements a 2-D mesh topology with FIFOs placed between neighboring DPUs.Next, Dispatchers X and Y provide input data to the West connection of the leftmost DPUs and the North connection of the topmost DPUs, respectively.In particular, Dispatchers receive packets containing coalesced data MAC-DPU version.

Figure 1 :
Figure 1: Overview of the basic and MAC-DPU versions.

Figure 3 :
Figure 3: Single FPGA integration of the systolic array platform.

Figure 4 :
Figure 4: Physical network connection of our multi-FPGA system.

Figure 6 :
Figure 6: Data movement on the systolic array and the FPGA plane in operation.Since the outside of the Dispatcher and are combined into a single data stream, they require synchronization of data to/from the DPUs.

Figure 7 :
Figure 7: Single DMA data size vs. throughput in M×N FPGAs systolic arrays.The single DMA size corresponds to the data size per DMA transfer shown on the right side of Figure 5.

Figure 8 :
Figure 8: Estimated performance with various DPU input widths.

Table 2 :
Resources and frequency of a MAC-DPU systolic array slice.