Abstract
Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image processing. While there are promising studies that accelerate stencils on FPGAs, there lacks an automated acceleration framework to systematically explore both spatial and temporal parallelisms for iterative stencils that could be either computation-bound or memory-bound. In this article, we present SASA, a scalable and automatic stencil acceleration framework on modern HBM-based FPGAs. SASA takes the high-level stencil DSL and FPGA platform as inputs, automatically exploits the best spatial and temporal parallelism configuration based on our accurate analytical model, and generates the optimized FPGA design with the best parallelism configuration in TAPA high-level synthesis C++ as well as its corresponding host code. Compared to state-of-the-art automatic stencil acceleration framework SODA that only exploits temporal parallelism, SASA achieves an average speedup of 3.41× and up to 15.73× speedup on the HBM-based Xilinx Alveo U280 FPGA board for a wide range of stencil kernels.
- [1] . 2014. Investigation into improving the efficiency and accuracy of CFD/DEM simulations. Particuology 16 (2014), 41–53.Google Scholar
Cross Ref
- [2] . 2015. On how to accelerate iterative stencil loops: A scalable streaming-based approach. ACM Trans. Archit. Code Optim. 12, 4 (
Dec. 2015).Google ScholarDigital Library
- [3] . 2020. Exploiting computation reuse for stencil accelerators. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference.Google Scholar
Digital Library
- [4] . 2018. SODA: Stencil with optimized dataflow architecture. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8.Google Scholar
Digital Library
- [5] . 2021. Extending high-level synthesis for task-parallel programs. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 204–213.
DOI: Google ScholarCross Ref
- [6] . 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 93–96.
DOI: Google ScholarCross Ref
- [7] . 2013. A high-performance, low-energy FPGA accelerator for correntropy-based feature tracking (Abstract Only). In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’13). Association for Computing Machinery, New York, NY, 278.Google Scholar
Digital Library
- [8] . 2017. TextX: A PyThon tool for domain-specific languages implementation. Knowl.-Based Syst. 115 (2017), 1–4.
DOI: Google ScholarCross Ref
- [9] . 2020. High-level synthesis design for stencil computations on FPGA with High bandwidth memory. Electronics 9, 8 (2020).Google Scholar
Cross Ref
- [10] . 2018. Graph-theoretically optimal memory banking for stencil-based computing kernels. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18). Association for Computing Machinery, New York, NY, 199–208.Google Scholar
Digital Library
- [11] . 2013. Unified blind method for multi-image super-resolution and single/multi-image blur deconvolution. IEEE Trans. Image Process. 22, 6 (2013), 2101–2114.Google Scholar
Digital Library
- [12] . 2018. 2D stencil computation on cyclone V SoC FPGA using OpenCL. In Proceedings of the International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET). 121–124.Google Scholar
Cross Ref
- [13] . 2021. Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, 81–92.Google Scholar
- [14] . 2022. RapidStream: Parallel physical implementation of FPGA HLS designs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 1–12.Google Scholar
Digital Library
- [15] . 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing. 311–320.Google Scholar
Digital Library
- [16] . 2021. High-level FPGA accelerator design for structured-mesh-based explicit numerical solvers. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1087–1096.Google Scholar
Cross Ref
- [17] . 2020. Large-scale cellular automata on FPGAs: A new generic architecture and a framework. ACM Trans. Reconfig. Technol. Syst. 14, 1 (
Dec. 2020).Google ScholarDigital Library
- [18] . 2020. AN5D: automated stencil framework for high-degree temporal blocking on GPUs. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO’20), Association for Computing Machinery, 199–211. Google Scholar
Digital Library
- [19] . 2016. A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8.Google Scholar
Digital Library
- [20] . 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.Google Scholar
Digital Library
- [21] . 2021. Enhancing the scalability of multi-FPGA stencil computations via highly optimized HDL components. ACM Trans. Reconfig. Technol. Syst. 14, 3 (
Aug. 2021).Google ScholarDigital Library
- [22] . 2020. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL). 9–17.Google Scholar
Cross Ref
- [23] . 2019. Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability. IEEE Access 7 (2019), 53188–53201.Google Scholar
Cross Ref
- [24] . 2020. Pencil: A pipelined algorithm for distributed stencils. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16.Google Scholar
- [25] . 2017. A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6.Google Scholar
Digital Library
- [26] . 1984. Computation theory of cellular automata. Commun. Math. Phys. 96 (1984), 15–57. Google Scholar
Cross Ref
- [27] . 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds963-u280.pdf.Google Scholar
- [28] . 2020. Vitis Unified Software Platform. Retrieved from https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html#development.Google Scholar
- [29] . 2018. Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 153–162.Google Scholar
Digital Library
- [30] . 2018. High-performance high-order stencil computation on FPGAs Using OpenCL. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 123–130.Google Scholar
Cross Ref
Index Terms
SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs
Recommendations
Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to DiscoveryIn this work, we describe our preliminary experiences on the Stampede system in the context of the Uintah Computational Framework. Uintah was developed to provide an environment for solving a broad class of fluid-structure interaction problems on ...
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysRecent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for ...
Benchmarking Performance of a Hybrid Intel Xeon/Xeon Phi System for Parallel Computation of Similarity Measures Between Large Vectors
The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the ...






Comments