skip to main content
10.1145/3295500.3356201acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Streaming message interface: high-performance distributed memory programming on reconfigurable hardware

Published: 17 November 2019 Publication History

Abstract

Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.

References

[1]
A. M. Aji, L. S. Panwar, F. Ji, K. Murthy, M. Chabbi, P. Balaji, K. R. Bisset, J. Dinan, W. Feng, J. Mellor-Crummey, X. Ma, and R. Thakur. 2016. MPI-ACC: Accelerator-Aware MPI for Scientific Applications. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016), 1401--1414.
[2]
Susan Blackford, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, Michael Heroux, Linda Kaufman, Andrew Lumsdaine, Antoine Petitet, Roldan Pozo, Karin Remington, and Clint Whaley. 2002. An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw. 28, 2 (June 2002), 135--151.
[3]
E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Yi Xiao, D. Zhang, R. Zhao, and D. Burger. 2018. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro 38, 2 (Mar 2018), 8--20.
[4]
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. 2011. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 4 (April 2011), 473--491.
[5]
Tomasz S Czajkowski, Utku Aydonat, Dmitry Denisenko, John Freeman, Michael Kinsner, David Neto, Jason Wong, Peter Yiannacouras, and Deshanand P Singh. 2012. From OpenCL to high-performance hardware on FPGAs. In 22nd international conference on field programmable logic and applications (FPL). IEEE, 531--534.
[6]
Johannes de Fine Licht, Simon Meierhans, and Torsten Hoefler. 2018. Transformations of High-Level Synthesis Codes for High-Performance Computing. CoRR abs/1805.08288 (2018). arXiv:1805.08288 http://arxiv.org/abs/1805.08288
[7]
Rob Dimond, Sébastien Racaniere, and Oliver Pell. 2011. Accelerating large-scale HPC Applications using FPGAs. In 2011 IEEE 20th Symposium on Computer Arithmetic. IEEE, 191--192.
[8]
J. Domke, T. Hoefler, and W. E. Nagel. 2011. Deadlock-Free Oblivious Routing for Arbitrary Topologies. In 2011 IEEE International Parallel Distributed Processing Symposium. 616--627.
[9]
Nariman Eskandari, Naif Tarafdar, Daniel Ly-Ma, and Paul Chow. 2019. A Modular Heterogeneous Stack for Deploying FPGAs and CPUs in the Data Center. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '19). ACM, New York, NY, USA, 262--271.
[10]
T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Xu, R. Patel, and M. Herbordt. 2018. FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 81--84.
[11]
A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng, and C. Yang. 2016. Novo-G#: Large-scale reconfigurable computing with direct and programmable interconnects. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.
[12]
T. Gysi, J. BÃd'r, and T. Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 609--620.
[13]
Amazon EC2 F1 instances. [n. d.]. https://aws.amazon.com/ec2/instance-types/f1/.
[14]
Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, and Ce Zhang. 2017. FPGA-accelerated dense linear machine learning: A precision-convergence trade-off. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 160--167.
[15]
Jungwon Kim, Seyong Lee, and Jeffrey S. Vetter. 2016. IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). ACM, New York, NY, USA, 189--201.
[16]
Ryohei Kobayashi, Yuma Oobata, Norihisa Fujita, Yoshiki Yamaguchi, and Taisuke Boku. 2018. OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2018). ACM, New York, NY, USA, 192--201.
[17]
A. Lawande, A. D. George, and H. Lam. 2016. An OpenCL Framework for Distributed Apps on a Multidimensional Network of FPGAs. In 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3). 42--49.
[18]
Tiziano De Matteis, Johannes de Fine Licht, and Torsten Hoefler. 2019. FBLAS: Streaming Linear Algebra on FPGA. CoRR (Aug. 2019).
[19]
Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard, Version 3.1. Specification. https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
[20]
M. Owaida and G. Alonso. 2018. Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 295--2955.
[21]
Manuel Saldaña, Arun Patel, Christopher Madill, Daniel Nunes, Danyao Wang, Paul Chow, Ralph Wittig, Henry Styles, and Andrew Putnam. 2010. MPI As a Programming Model for High-Performance Reconfigurable Computers. ACM Trans. Reconfigurable Technol. Syst. 3, 4, Article 22 (Nov. 2010), 29 pages.
[22]
Kentaro Sano, Yoshiaki Hatsuda, and Satoru Yamamoto. 2014. Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth. IEEE Transactions on Parallel and Distributed Systems 25, 3 (2014), 695--705.
[23]
Ran Shu, Peng Cheng, Guo Chen, Zhiyuan Guo, Lei Qu, Yongqiang Xiong, Derek Chiou, and Thomas Moscibroda. 2019. Direct Universal Access: Making Data Center Resources Available to FPGA. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 127--140. https://www.usenix.org/conference/nsdi19/presentation/shu
[24]
Stratix 10 GX/SX Product Table. [n. d.]. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/pt/stratix-10-product-table.pdf.
[25]
Versal ACAP AI Core Series Product Table. [n. d.]. https://www.xilinx.com/support/documentation/selection-guides/versal-ai-core-product-selection-guide.pdf.
[26]
Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM, New York, NY, USA, 65--74.
[27]
Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design (ISLPED '16). ACM, New York, NY, USA, 326--331.
[28]
Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 153--162.

Cited By

View all
  • (2024)SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656616(413-425)Online publication date: 30-May-2024
  • (2024)Efficient and Scalable Architecture for Multiple-Chip Implementation of Simulated Bifurcation MachinesIEEE Access10.1109/ACCESS.2024.337408912(36606-36621)Online publication date: 2024
  • (2023)Enabling Communication with FPGA-based Network-attached Accelerators for HPC WorkloadsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624540(530-538)Online publication date: 12-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. distributed memory programming
  2. high-level synthesis tools
  3. reconfigurable computing

Qualifiers

  • Research-article

Funding Sources

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)14
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656616(413-425)Online publication date: 30-May-2024
  • (2024)Efficient and Scalable Architecture for Multiple-Chip Implementation of Simulated Bifurcation MachinesIEEE Access10.1109/ACCESS.2024.337408912(36606-36621)Online publication date: 2024
  • (2023)Enabling Communication with FPGA-based Network-attached Accelerators for HPC WorkloadsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624540(530-538)Online publication date: 12-Nov-2023
  • (2023)Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Water Model on Unstructured MeshesProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3592979.3593407(1-12)Online publication date: 26-Jun-2023
  • (2023)Implementation and Performance Evaluation of Collective Communications Using CIRCUS on Multiple FPGAsProceedings of the HPC Asia 2023 Workshops10.1145/3581576.3581602(15-23)Online publication date: 27-Feb-2023
  • (2023)VCSN: Virtual Circuit-Switching Network for Flexible and Simple-to-Operate Communication in HPC FPGA ClusterACM Transactions on Reconfigurable Technology and Systems10.1145/357984816:2(1-32)Online publication date: 13-Jan-2023
  • (2023)GPU–FPGA-accelerated Radiative Transfer Simulation with Inter-FPGA CommunicationProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3578178.3578231(117-125)Online publication date: 27-Feb-2023
  • (2023)Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-switched Inter-FPGA NetworksACM Transactions on Reconfigurable Technology and Systems10.1145/357620016:2(1-27)Online publication date: 9-Jan-2023
  • (2023)Implementation and Performance Evaluation of Memory System Using Addressable Cache for HPC Applications on HBM2 Equipped FPGAsEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_9(121-132)Online publication date: 2-May-2023
  • (2022)Cygnus - World First Multihybrid Accelerated Cluster with GPU and FPGA CouplingWorkshop Proceedings of the 51st International Conference on Parallel Processing10.1145/3547276.3548629(1-8)Online publication date: 29-Aug-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media