skip to main content
research-article

Hardware Acceleration of High-Performance Computational Flow Dynamics Using High-Bandwidth Memory-Enabled Field-Programmable Gate Arrays

Published:06 December 2021Publication History
Skip Abstract Section

Abstract

Scientific computing is at the core of many High-Performance Computing applications, including computational flow dynamics. Because of the utmost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. Field-Programmable Gate Arrays could accelerate scientific computing because of the possibility to fully customize the memory hierarchy important in irregular applications such as iterative linear solvers. In this article, we study the potential of using Field-Programmable Gate Arrays in High-Performance Computing because of the rapid advances in reconfigurable hardware, such as the increase in on-chip memory size, increasing number of logic cells, and the integration of High-Bandwidth Memories on board. To perform this study, we propose a novel Sparse Matrix-Vector multiplication unit and an ILU0 preconditioner tightly integrated with a BiCGStab solver kernel. We integrate the developed preconditioned iterative solver in Flow from the Open Porous Media project, a state-of-the-art open source reservoir simulator. Finally, we perform a thorough evaluation of the FPGA solver kernel in both stand-alone mode and integrated in the reservoir simulator, using the NORNE field, a real-world case reservoir model using a grid with more than 105 cells and using three unknowns per cell.

REFERENCES

  1. [1] NVIDIA. n.d. NVIDIA Nsight Compute Command Line Interface. Retrieved November 3, 2021 from https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html.Google ScholarGoogle Scholar
  2. [2] AMD. 2020. AMD to Acquire Xilinx. Retrieved November 3, 2021 from https://www.amd.com/en/press-releases/2020-10-27-amd-to-acquire-xilinx-creating-the-industry-s-high-performance-computing.Google ScholarGoogle Scholar
  3. [3] J. R. Appleyard. 1983. Nested factorization. In Proceedings of the SPE Reservoir Simulation Symposium.Google ScholarGoogle Scholar
  4. [4] Bastian Peter, Blatt Markus, Dedner Andreas, Dreier Nils-Arne, Engwer Christian, Fritze Rene, Graser Carsten, et al. 2021. The Dune framework: Basic concepts and recent developments. Computers & Mathematics with Applications 81 (2021), 75112. https://doi.org/10.1016/j.camwa.2020.06.007Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chow Edmond and Patel Aftab. 2015. Fine-grained parallel incomplete LU factorization. SIAM Journal on Scientific Computing 37, 2 (2015), C169–C193. https://doi.org/10.1137/140968896 arXiv: https://doi.org/10.1137/140968896Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chow Gary C. T., Grigoras Paul, Burovskiy Pavel, and Luk Wayne. 2014. An efficient sparse conjugate gradient solver using a Beneš permutation network. In Proceedings of the 2014 24th International Conference on Field Programmable Logic and Applications (FPL’14). IEEE, Los Alamitos, CA, 17.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Davis Timothy A. and Hu Yifan. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software 38, 1 (Dec. 2011), Article 1, 25 pages. https://doi.org/10.1145/2049662.2049663 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Dorrance Richard, Ren Fengbo, and Marković Dejan. 2014. A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 161170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Dune. 2020. Dune Project. Retrieved November 3, 2021 from https://dune-project.org/.Google ScholarGoogle Scholar
  10. [10] Eberhardt R. and Hoemmen M.. 2016. Optimization of block sparse matrix-vector multiplication on shared-memory parallel architectures. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’16). 663672. https://doi.org/10.1109/IPDPSW.2016.42Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Balay. Satish2020. PETSc Users Manual. Technical Report ANL-95/11—Revision 3.14. Argonne National Laboratory. https://www.mcs.anl.gov/petsc.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Fowers Jeremy, Ovtcharov Kalin, Strauss Karin, Chung Eric S., and Stitt Greg. 2014. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, Los Alamitos, CA, 3643. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Intel. 2015. Intel Acquisition of Altera. Retrieved November 3, 2021 from https://newsroom.intel.com/press-kits/intel-acquisition-of-altera.Google ScholarGoogle Scholar
  14. [14] Jones Mark T. and Plassmann Paul E.. 1993. A parallel graph coloring heuristic. SIAM Journal of Scientific Computing 14-3 (1993), 654669. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Monakov Alexander, Lokhmotov Anton, and Avetisyan Arutyun. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In High Performance Embedded Architectures and Compilers, Patt Yale N., Foglia Pierfrancesco, Duesterwald Evelyn, Faraboschi Paolo, and Martorell Xavier (Eds.). Springer, Berlin, Germany, 111125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Morris Gerald R., Prasanna Viktor K., and Anderson Richard D.. 2006. A hybrid approach for mapping conjugate gradient onto an FPGA-augmented reconfigurable supercomputer. In Proceedings of the 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, Los Alamitos, CA, 312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Munshi Aaftab, Gaster Benedict, Mattson Timothy G., Fung James, and Ginsburg Dan. 2011. OpenCL Programming Guide.Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] NVIDIA. 2020. The API Reference Guide for cuSPARSE, the CUDA Sparse Matrix Library. Retrieved November 3, 2021 from https://docs.nvidia.com/cuda/cusparse/index.html.Google ScholarGoogle Scholar
  19. [19] OPM. 2020. Open Porous Media Project. Retrieved November 3, 2021 from https://github.com/OPM.Google ScholarGoogle Scholar
  20. [20] OPM. 2020. Open Porous Media Reservoir Simulator. Retrieved November 3, 2021 from https://github.com/OPM/opm-simulators.Google ScholarGoogle Scholar
  21. [21] OPM. 2020. Open Porous Media Reservoir Simulator—FPGA Kernels. Retrieved November 3, 2021 from https://github.com/OPM/FPGA.Google ScholarGoogle Scholar
  22. [22] OPM. 2020. OPM Tests. Retrieved November 3, 2021 from https://github.com/OPM/opm-tests/tree/master/norne.Google ScholarGoogle Scholar
  23. [23] Parravicini Alberto, Cellamare Luca Giuseppe, Siracusa Marco, and Santambrogio Marco Domenico. 2021. Scaling up HBM efficiency of top-K SpMV for approximate embedding similarity on FPGAs. CoRR abs/2103.04808 (2021). arXiv:2103.04808 https://arxiv.org/abs/2103.04808.Google ScholarGoogle Scholar
  24. [24] Pligouroudis Michail, Nuno Rafael Angel Gutierrez, and Kazmierski Tom. 2020. Modified compressed sparse row format for accelerated FPGA-based sparse matrix multiplication. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS’20). 15. https://doi.org/10.1109/ISCAS45731.2020.9181266Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Rasmussen Atgeirr Flø, Sandve Tor Harald, Bao Kai, Lauser Andreas, Hove Joakim, Skaflestad Bård, Klöfkorn Robert, et al. 2021. The Open Porous Media Flow reservoir simulator. Computers & Mathematics with Applications 81 (2021), 159–185. https://doi.org/10.1016/j.camwa.2020.05.014Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Saad Y.. 2003. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Sigurbergsson Björn, Hogervorst Tom, Qiu Tong D., and Nane Razvan. 2019. Sparstition: A partitioning scheme for large-scale sparse matrix vector multiplication on FPGA. In Proceedings of the I nternational Conference on Application-Specific Systems, Architectures, and Processors (ASAP’19).5158.Google ScholarGoogle Scholar
  28. [28] Hamdi Tchelepi and Yifan Zhou. 2013. Multi-GPU parallelization of nested factorization for solving large linear systems. In Proceedings of the SPE Reservoir Simulation Symposium.Google ScholarGoogle Scholar
  29. [29] Topsakal Erdem, Kindt Rick, Sertel Kubilay, and Volakis John. 2001. Evaluation of the BICGSTAB (l) algorithm for the finite-element/boundary-integral method. IEEE Antennas and Propagation Magazine 43, 6 (2001), 124131.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Umuroğlu Yaman. 2018. Accelerating Sparse Linear Algebra and Deep Neural Networks on Reconfigurable Platforms. Ph.D. Dissertation. NTNU.Google ScholarGoogle Scholar
  31. [31] Vorst H. A. van der. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing 13, 2 (1992), 631644. https://doi.org/10.1137/0913035 arXiv: https://doi.org/10.1137/0913035 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Wu Guiming, Xie Xianghui, Dou Yong, and Wang Miao. 2013. High-performance architecture for the conjugate gradient solver on FPGAs. IEEE Transactions on Circuits and Systems II: Express Briefs 60, 11 (2013), 791795.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Xilinx. 2020. Alveo U280 Data Center Accelerator Card. Retrieved November 3, 2021 from https://www.xilinx.com/products/boards-and-kits/alveo/u280.html.Google ScholarGoogle Scholar
  34. [34] Xilinx. 2020. Ultra RAM. Retrieved November 3, 2021 from https://www.xilinx.com/support/documentation/white_papers/wp477-ultraram.pdf.Google ScholarGoogle Scholar
  35. [35] Yu Song, Liu Hui, Chen Zhangxin John, Hsieh Ben, and Lei Shao2012. GPU-based parallel reservoir simulation for large-scale simulation problems. In Proceedings of the SPE Europec/EAGE Annual Conference.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Zhuo Ling and Prasanna Viktor K.. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 6374. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hardware Acceleration of High-Performance Computational Flow Dynamics Using High-Bandwidth Memory-Enabled Field-Programmable Gate Arrays

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Reconfigurable Technology and Systems
          ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 2
          June 2022
          310 pages
          ISSN:1936-7406
          EISSN:1936-7414
          DOI:10.1145/3501287
          • Editor:
          • Deming Chen
          Issue’s Table of Contents

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 December 2021
          • Accepted: 1 July 2021
          • Revised: 1 May 2021
          • Received: 1 January 2021
          Published in trets Volume 15, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)86
          • Downloads (Last 6 weeks)4

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!