Abstract
One of the main characteristics of High-performance Computing (HPC) applications is that they become increasingly performance and power demanding, pushing HPC systems to their limits. Existing HPC systems have not yet reached exascale performance mainly due to power limitations. Extrapolating from today’s top HPC systems, about 100–200 MWatts would be required to sustain an exaflop-level of performance. A promising solution for tackling power limitations is the deployment of energy-efficient reconfigurable resources (in the form of Field-programmable Gate Arrays (FPGAs)) tightly integrated with conventional CPUs. However, current FPGA tools and programming environments are optimized for accelerating a single application or even task on a single FPGA device. In this work, we present UNILOGIC (Unified Logic), a novel HPC-tailored parallel architecture that efficiently incorporates FPGAs. UNILOGIC adopts the Partitioned Global Address Space (PGAS) model and extends it to include hardware accelerators, i.e., tasks implemented on the reconfigurable resources. The main advantages of UNILOGIC are that (i) the hardware accelerators can be accessed directly by any processor in the system, and (ii) the hardware accelerators can access any memory location in the system. In this way, the proposed architecture offers a unified environment where all the reconfigurable resources can be seamlessly used by any processor/operating system. The UNILOGIC architecture also provides hardware virtualization of the reconfigurable logic so that the hardware accelerators can be shared among multiple applications or tasks. The FPGA layer of the architecture is implemented by splitting its reconfigurable resources into (i) a static partition, which provides the PGAS-related communication infrastructure, and (ii) fixed-size and dynamically reconfigurable slots that can be programmed and accessed independently or combined together to support both fine and coarse grain reconfiguration.1 Finally, the UNILOGIC architecture has been evaluated on a custom prototype that consists of two 1U chassis, each of which includes eight interconnected daughter boards, called Quad-FPGA Daughter Boards (QFDBs); each QFDB supports four tightly coupled Xilinx Zynq Ultrascale+ MPSoCs as well as 64 Gigabytes of DDR4 memory, and thus, the prototype features a total of 64 Zynq MPSoCs and 1 Terabyte of memory. We tuned and evaluated the UNILOGIC prototype using both low-level (baremetal) performance tests, as well as two popular real-world HPC applications, one compute-intensive and one data-intensive. Our evaluation shows that UNILOGIC offers impressive performance that ranges from being 2.5 to 400 times faster and 46 to 300 times more energy efficient compared to conventional parallel systems utilizing only high-end CPUs, while it also outperforms GPUs by a factor ranging from 3 to 6 times in terms of time to solution, and from 10 to 20 times in terms of energy to solution.
- AXI 2017. AXI Reference Guide. Retrieved from www.xilinx.com/support/documentation/ip_documentation/axi_ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf.Google Scholar
- BittWare. 2019. BittWare FPGA Acceleration. Retrieved from https://www.bittware.com/.Google Scholar
- M. Blott. 2016. Reconfigurable future for HPC. In Proceedings of the International Conference on High Performance Computing Simulation (HPCS’16). 130--131.Google Scholar
Cross Ref
- B. Brech, J. Rubio, and M. Hollinger. 2015. Data Engine for NoSQL-IBM Power Systems Edition. White Paper.Google Scholar
- A. Cilardo. 2018. HtComp: Bringing reconfigurable hardware to future high-performance applications. Int. J. High Perform. Comput. Appl. 12, 1 (2018), 74--83. Google Scholar
Digital Library
- Convey Computer Corp. 2012. The Convey HC-2 Computer Architectural Overview (White Paper). Retrieved from https://www.micron.com/-/media/documents/products/white-paper/wp_convey_hc2_architectual_overview.pdf.Google Scholar
- R. S. Correa and J. P. David. 2018. Ultra-low latency communication channels for FPGA-based HPC cluster. Integration 63 (2018), 41--55.Google Scholar
Cross Ref
- F. A. Escobar, X. Chang, and C. Valderrama. 2016. Suitability analysis of FPGAs for heterogeneous platforms in HPC. IEEE Trans. Parallel Distrib. Syst. 27, 2 (2016), 600--612. Google Scholar
Digital Library
- A. Arif et al. 2020. Performance and energy-efficient implementation of a smart city application on FPGAs. J. Real-Time Image Process. 17, 3 (2020), 729--743.Google Scholar
Cross Ref
- A. D. George et al. 2016. Novo-G#: Large-scale reconfigurable computing with direct and programmable interconnects. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’16). 1--7.Google Scholar
Cross Ref
- A. Iordache et al. 2016. High performance in the cloud with FPGA groups. In Proceedings of the 9th International Conference on Utility and Cloud Computing (UCC’16). 1--10. Google Scholar
Digital Library
- A. Ioannou et al. 2019. Optimized FPGA implementation of a compute-intensive oil reservoir simulation algorithm. In Embedded Computer Systems: Architectures, Modeling, and Simulation. Springer International Publishing, 442--454.Google Scholar
- A. Mondigo et al. 2017. Design and scalability analysis of bandwidth-compressed stream computing with multiple FPGAs. In Proceedings of the 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC’17). 1--8.Google Scholar
Cross Ref
- A. Putnam et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). 13--24. Google Scholar
Digital Library
- A. Putnam et al. 2016. A reconfigurable fabric for accelerating large-scale datacenter services. Commun. ACM 59, 11 (2016), 114--122. Google Scholar
Digital Library
- A. Rigo et al. 2017. Paving the way towards a highly energy-efficient and highly integrated compute node for the exascale revolution: The ExaNoDe approach. In Proceedings of the Euromicro Conference on Digital System Design (DSD’17). 486--493.Google Scholar
Cross Ref
- B. Subramaniam et al. 2013. Trends in energy-efficient computing: A perspective from the Green500. In Proceedings of the International Green Computing Conference (IGCC’13). 1--8.Google Scholar
Cross Ref
- C. Vatsolakis et al. 2017. RACOS: Transparent access and virtualization of reconfigurable hardware accelerators. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’17). 11--19. DOI:https://doi.org/10.1109/SAMOS.2017.8344606Google Scholar
Cross Ref
- D. C. Price et al. 2016. Optimizing performance-per-watt on GPUs in high performance computing: Temperature, frequency and voltage effects. Comput. Sci. Res. Dev. 31, 4 (2016), 185--193. Google Scholar
Digital Library
- D. V. Vu et al. 2014. Enabling partial reconfiguration for coprocessors in mixed criticality multicore systems using PCI express single-root I/O virtualization. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig’14). 1--6. DOI:https://doi.org/10.1109/ReConFig.2014.7032516Google Scholar
Cross Ref
- F. Chaix et al. 2019. Implementation and impact of an ultra-compact multi-FPGA board for large system prototyping. In Proceedings of the 5th International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC’19).Google Scholar
Cross Ref
- G. Pitsis et al. 2019. Efficient convolutional neural network weight compression for space data classification on multi-FPGA platforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). 3917--3921. DOI:https://doi.org/10.1109/ICASSP.2019.8682732Google Scholar
Cross Ref
- I. Kalomoiris et al. 2019. An experimental analysis of the opportunities to use field programmable gate array multiprocessors for on-board satellite deep learning classification of spectroscopic observations from future ESA space missions. In Proceedings of the Conference on On-board Data Processing (OBDP’19).Google Scholar
- I. Mavroidis et al. 2016. ECOSCALE: Reconfigurable computing and runtime system for future exascale systems. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’16). 696--701. Google Scholar
Digital Library
- J. Korinth et al. 2019. The TaPaSCo Open-Source Toolflow for the Automated Composition of Task-Based Parallel Reconfigurable Computing Systems. 214--229.Google Scholar
- J. Ouyang et al. 2014. SDA: Software-defined accelerator for large-scale DNN systems. In Proceedings of the IEEE Hot Chips 26 Symposium (HCS’14). 1--23.Google Scholar
Cross Ref
- J. Weerasinghe et al. 2016. Network-attached FPGAs for data center applications. In Proceedings of the International Conference on Field-Programmable Technology (FPT’16). 36--43. DOI:https://doi.org/10.1109/FPT.2016.7929186Google Scholar
Cross Ref
- J. Weerasinghe et al. 2016. Network-attached FPGAs for data center applications. Proceedings of the International Conference on Field-Programmable Technology (FPT’16). 36--43.Google Scholar
Cross Ref
- K. Pham et al. 2017. BITMAN: A tool and API for FPGA bitstream manipulations. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’17). IEEE, 894--897. DOI:https://doi.org/10.23919/DATE.2017.7927114 Google Scholar
Digital Library
- K. Pham et al. 2018. IPRDF: An isolated partial reconfiguration design flow for Xilinx FPGAs. In Proceedings of the 12th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC’18). 36--43.Google Scholar
Cross Ref
- Lee Howes et al. 2015. TheOpenCL Specification. Retrieved from www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf.Google Scholar
- M. Huang et al. 2016. Programming and runtime support to Blaze FPGA accelerator deployment at datacenter scale. In Proceedings of the 7th ACM Symposium on Cloud Computing. 456--469. Google Scholar
Digital Library
- M. Katevenis et al. 2016. The ExaNeSt project: Interconnects, storage, and packaging for exascale systems. In Proceedings of the Euromicro Conference on Digital System Design (DSD’16). 60--67.Google Scholar
Cross Ref
- M. Marazakis et al. 2016. EUROSERVER: Share-anything scale-out micro-server design. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’16). 678--683. Google Scholar
Digital Library
- M. Makni et al. 2017. Performance exploration of AMBA AXI4 bus protocols for wireless sensor networks. In 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA'17). 1163--1169.Google Scholar
Cross Ref
- M. Vesper et al. 2016. JetStream: An open-source high-performance PCI Express 3 streaming library for FPGA-to-Host and FPGA-to-FPGA communication. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). 1--9.Google Scholar
Cross Ref
- M. Yoshimi et al. 2010. A performance evaluation of CUBE: One-dimensional 512 FPGA cluster. In Proceedings of the 6th International Symposium on Reconfigurable Computing: Architectures, Tools and Applications (ARC’10). 372--381. Google Scholar
Digital Library
- N. B. Grigore et al. 2018. HLS enabled partially reconfigurable module implementation. In Proceedings of the 31st International Conference on Architecture of Computing Systems (ARCS’18). 269--282.Google Scholar
Cross Ref
- O. Sander et al. 2014. A flexible interface architecture for reconfigurable coprocessors in embedded multicore systems using PCIe Single-root I/O virtualization. In Proceedings of the International Conference on Field-Programmable Technology (FPT’14). 223--226. DOI:https://doi.org/10.1109/FPT.2014.7082780Google Scholar
Cross Ref
- P. Malakonakis et al. 2018. HLS algorithmic explorations for HPC execution on reconfigurable hardware—ECOSCALE. In Proceedings of the 14th International Symposium on Applied Reconfigurable Computing. Architectures, Tools, and Applications (ARC’18). 724--736.Google Scholar
Cross Ref
- R. Ammendola et al. 2017. The next generation of exascale-class systems: The ExaNeSt project. In Proceedings of the Euromicro Conference on Digital System Design (DSD’17). 510--515.Google Scholar
Cross Ref
- R. Kobayashi et al. 2018. OpenCL-ready high speed FPGA network for reconfigurable high performance computing. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPC’18). 192--201. Google Scholar
Digital Library
- S. Lyberis et al. 2014. FPGA prototyping of emerging manycore architectures for parallel programming research using formic boards. J. Syst. Architect. 60 (June 2014).Google Scholar
Cross Ref
- V. Viswanathan et al. 2015. A parallel and scalable multi-FPGA based architecture for high performance applications (abstract only). In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 266. Google Scholar
Digital Library
- Yann Beilliard et al. 2019. FPGA-based multi-chip module for high-performance computing. CoRR abs/1906.11175. Retrieved from http://arxiv.org/abs/1906.11175.Google Scholar
- Y. Durand et al. 2014. EUROSERVER: Energy efficient node for european micro-servers. In Proceedings of the 17th Euromicro Conference on Digital System Design (DSD’14). 206--213. Google Scholar
Digital Library
- Y. Liu et al. 2010. Building a multi-FPGA-based emulation framework to support networks-on-chip design and verification. Int. J. Electron. 97 (Oct. 2010), 1241--1262.Google Scholar
- Z. Wang et al. 2016. Melia: A MapReduce framework on OpenCL-based FPGAs. IEEE Trans. Parallel Distrib. Syst. 27, 12 (2016), 3547--3560. Google Scholar
Digital Library
- EU. 2013--2017. The Euroserver Project. Retrieved from http://www.euroserver-project.eu.Google Scholar
- K. Fleming and M. Adler. 2016. The LEAP FPGA operating system. In FPGAs for Software Programmers. 245--258.Google Scholar
- Pro Design Electronic GmbH. 2019. profpga: FPGA Prototyping. Retrieved from https://www.profpga.com.Google Scholar
- SciEngines GmbH. 2019. SciEngines Hardware, High Performance Reconfigurable Computing. Retrieved from https://www.sciengines.com/technology-platform/sciengines-hardware/.Google Scholar
- Amazon.com Inc. 2019. Amazon EC2 F1 Instances. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- Digilent Inc. 2019. FPGA, Microcontrollers and Instrumentation. Retrieved from http://www.digilent.com.Google Scholar
- Maxeler Technologies Inc. 2019. Dataflow Computing. Retrieved from https://www.maxeler.com/technology/dataflow-computing/.Google Scholar
- Maxeler Technologies Inc. 2019. Maxeler Products. Retrieved from https://www.maxeler.com/products/.Google Scholar
- National Instruments. 2019. Automated Test and Automated Measurement Systems. Retrieved from http://www.ni.com/en-us/innovations/wireless/software-defined-radio.html.Google Scholar
- N. Kapre and J. Gray. 2017. Hoplite: A deflection-routed directional torus NoC for FPGAs. ACM Trans. Reconfig. Technol. Syst. 10, 2 (2017), 14:1--14:24. Google Scholar
Digital Library
- A. Kashif and M. A. S. Khalid. 2016. Experimental evaluation and comparison of time-multiplexed multi-FPGA routing architectures. In Proceedings of the IEEE 59th International Midwest Symposium on Circuits and Systems (MWSCAS’16). 1--4.Google Scholar
Cross Ref
- M. Katevenis. 2007. Interprocessor communication seen as load-store instruction generalization. In The Future of Computing, Essays in Memory of Stamatis Vassiliadis. K. Bertels (Editor), Delft, The Netherlands, 55--68.Google Scholar
- D. Koch. 2012. Partial Reconfiguration on FPGAs--Architectures, Tools and Applications. Springer. Google Scholar
Digital Library
- J. Laudon and D. Lenoski. 1997. The SGI origin: A ccNUMA highly scalable server. In Proceedings of the 24th International Symposium on Computer Architecture. 241--251. Google Scholar
Digital Library
- HiTech Global LLC. 2019. Xilinx/Altera FPGA boards, design services, IP Cores. Retrieved from http://www.hitechglobal.com/.Google Scholar
- G. Mahesh and S. M. Sakthivel. 2015. Verification of memory transactions in AXI protocol using system verilog approach. In Proceedings of the International Conference on Communications and Signal Processing (ICCSP’15). 0860--0864.Google Scholar
- N. Tarafdar et al. 2017. Enabling flexible network FPGA clusters in a heterogeneous cloud data center. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 237--246. Google Scholar
Digital Library
- O. Pell and V. Averbukh. 2012. Maximum performance computing with dataflow engines. Comput. Sci. Eng. 14, 4 (2012), 98--103. Google Scholar
Digital Library
- Oliver Pell and Oskar Mencer. 2011. Surviving the end of frequency scaling with reconfigurable dataflow computing. SIGARCH Comput. Archit. News 39, 4 (Dec. 2011), 60--65. Google Scholar
Digital Library
- C. Plessl. 2018. Bringing FPGAs to HPC production systems and codes. In Proceedings of the 4th International Workshop on Heterogeneous High-performance Reconfigurable Computing (workshop at Supercomputing).Google Scholar
- BERTEN Digital Signal Processing. 2016. GPU vs. FPGA Performance Comparison. Retrieved from http://www.bertendsp.com/pdf/whitepaper/BWP001_GPU_vs_FPGA_Performance_Comparison_v1.0.pdf.Google Scholar
- S. Ravi, K. Ezra, and H. Kittur. 2014. Design of a bus monitor for performance analysis of AXI protocol based SoC systems. Int. J. Appl. Eng. Res. 9 (Nov. 2014), 6313--6324.Google Scholar
- S. R. Pradeep. 2014. Design and verification environment for AMBA AXI protocol for SoC integration. Int. J. Res. Eng. Technol. 03 (May 2014), 338--343.Google Scholar
- Qingshan Tang. 2015. Methodology of Multi-FPGA Prototyping Platform Generation. Ph.D. Dissertation. Université Pierre et Marie Curie-Paris. Retrieved from https://tel.archives-ouvertes.fr/tel-01256510/document.Google Scholar
- Qingshan Tang and Matthieu Tuna. 2014. Performance comparison between multi-FPGA prototyping platforms: Hardwired off-the-shelf, cabling, and custom. 125--132. DOI:https://doi.org/10.1109/FCCM.2014.44 Google Scholar
Digital Library
- top500.org 2019. Green500 List—November 2019. Retrieved from www.top500.org/green500/list/2019/11/.Google Scholar
- top500.org 2019. Top500 List—November 2019. Retrieved from www.top500.org/lists/2019/11/.Google Scholar
- A. Vaishnav, K. D. Pham, and D. Koch. 2018. A survey on FPGA virtualization. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18).Google Scholar
- C. Whitson and M. Michelsen. 1989. The negative flash. In Fluid Phase Equilibria, Vol. 35. 51--71.Google Scholar
Cross Ref
Index Terms
UNILOGIC: A Novel Architecture for Highly Parallel Reconfigurable Systems
Recommendations
Modeling a versatile FPGA for prototyping adaptive systems
RSP '95: Proceedings of the Sixth IEEE International Workshop on Rapid System Prototyping (RSP'95)Currently, the Computer-Aided Engineering (CAE) environments for designing Field-Programmable Gate Arrays (FPGAs) do not support the simulation of FPGA reprogrammability, hence prototyping of adaptive systems relies upon using the actual FPGAs. The FPGA ...
High Speed Dynamic Partial Reconfiguration for Real Time Multimedia Signal Processing
DSD '12: Proceedings of the 2012 15th Euromicro Conference on Digital System DesignThe use of Field Programmable Gate Array (FPGA) based System on Chip (SoC) is a promising approach in Multimedia applications. In SoC, computationally intensive tasks are off-loaded to the hardware logic. A feature introduced with new FPGA devices, ...
Implementation of Math PRR and LED Processing Using Xilinx PlanAhead
ICCUBEA '15: Proceedings of the 2015 International Conference on Computing Communication Control and AutomationRuntime Partial Reconfiguration (PR) of FPGA is an attractive feature which offers countless benefits across multiple industries. Xilinx has supported PR for many generation of devices. PR dynamically modified hardware portion of the device function ...






Comments