skip to main content
research-article

SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs

Authors Info & Claims
Published:17 April 2023Publication History
Skip Abstract Section

Abstract

Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image processing. While there are promising studies that accelerate stencils on FPGAs, there lacks an automated acceleration framework to systematically explore both spatial and temporal parallelisms for iterative stencils that could be either computation-bound or memory-bound. In this article, we present SASA, a scalable and automatic stencil acceleration framework on modern HBM-based FPGAs. SASA takes the high-level stencil DSL and FPGA platform as inputs, automatically exploits the best spatial and temporal parallelism configuration based on our accurate analytical model, and generates the optimized FPGA design with the best parallelism configuration in TAPA high-level synthesis C++ as well as its corresponding host code. Compared to state-of-the-art automatic stencil acceleration framework SODA that only exploits temporal parallelism, SASA achieves an average speedup of 3.41× and up to 15.73× speedup on the HBM-based Xilinx Alveo U280 FPGA board for a wide range of stencil kernels.

REFERENCES

  1. [1] Alobaid Falah, Baraki Nabil, and Epple Bernd. 2014. Investigation into improving the efficiency and accuracy of CFD/DEM simulations. Particuology 16 (2014), 4153.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Cattaneo Riccardo, Natale Giuseppe, Sicignano Carlo, Sciuto Donatella, and Santambrogio Marco Domenico. 2015. On how to accelerate iterative stencil loops: A scalable streaming-based approach. ACM Trans. Archit. Code Optim. 12, 4 (Dec.2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Chi Yuze and Cong Jason. 2020. Exploiting computation reuse for stencil accelerators. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Chi Yuze, Cong Jason, Wei Peng, and Zhou Peipei. 2018. SODA: Stencil with optimized dataflow architecture. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Chi Yuze, Guo Licheng, Lau Jason, Choi Young-Kyu, Wang Jie, and Cong Jason. 2021. Extending high-level synthesis for task-parallel programs. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 204213. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cong Jason, Fang Zhenman, Lo Michael, Wang Hanrui, Xu Jingxian, and Zhang Shaochong. 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 9396. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Cooke Patrick, Fowers Jeremy, Hunt Lee, and Stitt Greg. 2013. A high-performance, low-energy FPGA accelerator for correntropy-based feature tracking (Abstract Only). In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’13). Association for Computing Machinery, New York, NY, 278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Dejanović I., Vaderna R., Milosavljević G., and Vuković Ž.. 2017. TextX: A PyThon tool for domain-specific languages implementation. Knowl.-Based Syst. 115 (2017), 14. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Du Changdao and Yamaguchi Yoshiki. 2020. High-level synthesis design for stencil computations on FPGA with High bandwidth memory. Electronics 9, 8 (2020).Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Escobedo Juan and Lin Mingjie. 2018. Graph-theoretically optimal memory banking for stencil-based computing kernels. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18). Association for Computing Machinery, New York, NY, 199208.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Faramarzi Esmaeil, Rajan Dinesh, and Christensen Marc P.. 2013. Unified blind method for multi-image super-resolution and single/multi-image blur deconvolution. IEEE Trans. Image Process. 22, 6 (2013), 21012114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Firmansyah Iman, Wijayanto Yusuf Nur, and Yamaguchi Yoshiki. 2018. 2D stencil computation on cyclone V SoC FPGA using OpenCL. In Proceedings of the International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET). 121124.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Guo Licheng, Chi Yuze, Wang Jie, Lau Jason, Qiao Weikang, Ustun Ecenur, Zhang Zhiru, and Cong Jason. 2021. Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, 8192.Google ScholarGoogle Scholar
  14. [14] Guo Licheng, Maidee Pongstorn, Zhou Yun, Lavin Chris, Wang Jie, Chi Yuze, Qiao Weikang, Kaviani Alireza, Zhang Zhiru, and Cong Jason. 2022. RapidStream: Parallel physical implementation of FPGA HLS designs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Holewinski Justin, Pouchet Louis-Noël, and Sadayappan P.. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing. 311320.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Kamalakkannan Kamalavasan, Mudalige Gihan R., Reguly István Z., and Fahmy Suhaib A.. 2021. High-level FPGA accelerator design for structured-mesh-based explicit numerical solvers. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10871096.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Kyparissas Nikolaos and Dollas Apostolos. 2020. Large-scale cellular automata on FPGAs: A new generic architecture and a framework. ACM Trans. Reconfig. Technol. Syst. 14, 1 (Dec.2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Matsumura Kazuaki, Reza Zohouri Hamid, Wahib Mohamed, Endo Toshio, and Matsuoka Satoshi. 2020. AN5D: automated stencil framework for high-degree temporal blocking on GPUs. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO’20), Association for Computing Machinery, 199–211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Natale Giuseppe, Stramondo Giulio, Bressana Pietro, Cattaneo Riccardo, Sciuto Donatella, and Santambrogio Marco D.. 2016. A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Nguyen Anthony, Satish Nadathur, Chhugani Jatin, Kim Changkyu, and Dubey Pradeep. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Reggiani Enrico, Sozzo Emanuele Del, Conficconi Davide, Natale Giuseppe, Moroni Carlo, and Santambrogio Marco D.. 2021. Enhancing the scalability of multi-FPGA stencil computations via highly optimized HDL components. ACM Trans. Reconfig. Technol. Syst. 14, 3 (Aug.2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Singh Gagandeep, Diamantopoulos Dionysios, Hagleitner Christoph, Gomez-Luna Juan, Stuijk Sander, Mutlu Onur, and Corporaal Henk. 2020. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL). 917.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Waidyasooriya Hasitha Muthumala and Hariyama Masanori. 2019. Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability. IEEE Access 7 (2019), 5318853201.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Wang Hengjie and Chandramowlishwaran Aparna. 2020. Pencil: A pipelined algorithm for distributed stencils. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 116.Google ScholarGoogle Scholar
  25. [25] Wang Shuo and Liang Yun. 2017. A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Wolfram Stephen. 1984. Computation theory of cellular automata. Commun. Math. Phys. 96 (1984), 15–57. Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Xilinx. 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds963-u280.pdf.Google ScholarGoogle Scholar
  28. [28] Xilinx. 2020. Vitis Unified Software Platform. Retrieved from https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html#development.Google ScholarGoogle Scholar
  29. [29] Zohouri Hamid Reza, Podobas Artur, and Matsuoka Satoshi. 2018. Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 153162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Zohouri Hamid Reza, Podobas Artur, and Matsuoka Satoshi. 2018. High-performance high-order stencil computation on FPGAs Using OpenCL. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 123130.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Reconfigurable Technology and Systems
            ACM Transactions on Reconfigurable Technology and Systems  Volume 16, Issue 2
            June 2023
            451 pages
            ISSN:1936-7406
            EISSN:1936-7414
            DOI:10.1145/3587031
            • Editor:
            • Deming Chen
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 17 April 2023
            • Online AM: 31 January 2023
            • Accepted: 7 November 2022
            • Revised: 1 September 2022
            • Received: 25 January 2022
            Published in trets Volume 16, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)111
            • Downloads (Last 6 weeks)29

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!