skip to main content
research-article

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

Published:06 June 2022Publication History
Skip Abstract Section

Abstract

Both modern datacenter and embedded Field Programmable Gate Arrays (FPGAs) provide great opportunities for high-performance and high-energy-efficiency computing. With the growing public availability of FPGAs from major cloud service providers such as AWS, Alibaba, and Nimbix, as well as uniform hardware accelerator development tools (such as Xilinx Vitis and Intel oneAPI) for software programmers, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS).

The major goal of this article is to figure out how to efficiently access the memory system of modern datacenter and embedded FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including (1) the clock frequency of the accelerator design, (2) the number of concurrent memory access ports, (3) the data width of each port, (4) the maximum burst access length for each port, and (5) the size of consecutive data accesses. Then, we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the memory systems of datacenter FPGAs (Xilinx Alveo U200 and U280) and embedded FPGA (Xilinx ZCU104) when changing those affecting factors, and we provide insights into efficient memory access in HLS-based accelerator designs. Comparing between the typically used soft and hardened memory systems, respectively, found on datacenter and embedded FPGAs, we further summarize their unique features and discuss the effective approaches to leverage these systems. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms on datacenter FPGAs with a soft (and thus more flexible) memory system. Compared to the baseline designs, optimized designs leveraging our insights achieve about \( 3.5\times \) and \( 8.5\times \) speedups for the KNN and SpMV accelerators. Our final optimized KNN and SpMV designs on a Xilinx Alveo U200 FPGA fully utilize its off-chip memory bandwidth, and achieve about \( 5.6\times \) and \( 3.4\times \) speedups over the 24-core CPU implementations.

REFERENCES

  1. [1] Alibaba. 2020. Alibaba compute optimized instance families with FPGAs. Retrieved from https://www.alibabacloud.com/help/doc-detail/108504.htm.Google ScholarGoogle Scholar
  2. [2] Altman N. S.. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. Amer. Stat. 46, 3 (1992), 175185.Google ScholarGoogle Scholar
  3. [3] Amazon. 2020. Amazon EC2 F1 Instances, Enable faster FPGA accelerator development and deployment in the cloud. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.Google ScholarGoogle Scholar
  4. [4] AnandTech. 2018. Intel Shows Xeon Scalable Gold 6138P with Integrated FPGA, Shipping to Vendors. Retrieved from https://www.anandtech.com/show/12773/intel-shows-xeon-scalable-gold-6138p-with-integrated-fpga-shipping-to-vendors.Google ScholarGoogle Scholar
  5. [5] Ashari A., Sedaghati N., Eisenlohr J., Parthasarath S., and Sadayappan P.. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE Press, 781792.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Bell N. and Garland M.. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, 111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chang Kevin K., Kashyap Abhijith, Hassan Hasan, Ghose Saugata, Hsieh Kevin, Lee Donghyuk, Li Tianshi, Pekhimenko Gennady, Khan Samira, and Mutlu Onur. 2016. Understanding latency variation in modern DRAM chips: Experimental characterization, analysis, and optimization. SIGMETRICS Perform. Eval. Rev. 44, 1 (June 2016), 323336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Che Shuai, Boyer Michael, Meng Jiayuan, Tarjan David, Sheaffer Jeremy W., Lee Sang-Ha, and Skadron Kevin. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, 4454. Retrieved from http://rodinia.cs.virginia.edu/doku.php?id=start.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Choi Young-kyu, Chi Yuze, Qiao Weikang, Samardzic Nikola, and Cong Jason. 2021. HBM connect: High-performance HLS interconnect for FPGA HBM. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). Association for Computing Machinery, New York, NY, 116126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Choi Young-kyu, Cong Jason, Fang Zhenman, Hao Yuchen, Reinman Glenn, and Wei Peng. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). Association for Computing Machinery, New York, NY, Article 109, 6 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Choi Young-Kyu, Cong Jason, Fang Zhenman, Hao Yuchen, Reinman Glenn, and Wei Peng. 2019. In-depth analysis on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM Trans. Reconfig. Technol. Syst. 12, 1, Article 4 (Feb. 2019), 20 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Cong Jason, Fang Zhenman, Hao Yuchen, Wei Peng, Yu Cody Hao, Zhang Chen, and Zhou Peipei. 2018. Best-effort FPGA programming: A few steps can go a long way. Retrieved from https://arxiv.org/abs/1807.01340.Google ScholarGoogle Scholar
  13. [13] Cong Jason, Fang Zhenman, Lo Michael, Wang Hanrui, Xu Jingxian, and Zhang Shaochong. 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). Association for Computing Machinery, New York, NY, 9396.Google ScholarGoogle Scholar
  14. [14] Cong Jason, Wei Peng, Yu Cody Hao, and Zhou Peipei. 2017. Bandwidth optimization through on-chip memory restructuring for HLS. In Proceedings of the 54th Annual Design Automation Conference (DAC’17). Association for Computing Machinery, New York, NY, Article 43, 6 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Esmaeilzadeh Hadi, Blem Emily, Amant Renee St., Sankaralingam Karthikeyan, and Burger Doug. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). Association for Computing Machinery, New York, NY, 365376.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Fang Zhenman, Mehta Sanyam, Yew Pen-Chung, Zhai Antonia, Greensky James, Beeraka Gautham, and Zang Binyu. 2015. Measuring microarchitectural details of multi- and many-core memory systems through microbenchmarking. ACM Trans. Archit. Code Optim. 11, 4, Article 55 (Jan. 2015), 26 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003. KNN model-based approach in classification. In Proceedings of the International Conference on Ontologies, Databases and Applications of Semantics (ODBASE’03). Springer, Switzerland, 986996.Google ScholarGoogle Scholar
  18. [18] Houstis Weerawarana, Weerawarana S., Houstis E. N., and Rice J. R.. 1990. An interactive symbolic–numeric interface to parallel ELLPACK for building general PDE solvers. In Symbolic and Numerical Computation for Artificial Intelligence. ACM, 303322.Google ScholarGoogle Scholar
  19. [19] Intel. 2020. Intel Stratix 10 FPGAs. Retrieved from https://www.intel.ca/content/www/ca/en/products/programmable/fpga/stratix-10.html.Google ScholarGoogle Scholar
  20. [20] Intel. 2021. Intel oneAPI: A Unified X-Architecture Programming Model. Retrieved from https://software.intel.com/content/www/us/en/develop/tools/oneapi.html#gs.9wo7rg.Google ScholarGoogle Scholar
  21. [21] Kim Jin Hee, Grady Brett, Lian Ruolong, Brothers John, and Anderson Jason H.. 2017. FPGA-based CNN inference accelerator synthesized from multi-threaded C software. In Proceedings of the 30th IEEE International System-on-Chip Conference (SOCC’17). 268273.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Liu Shaoshan, Liu Liangkai, Tang Jie, Yu Bo, Wang Yifan, and Shi Weisong. 2019. Edge computing for autonomous driving: Opportunities and challenges. Proc. IEEE 107, 8 (2019), 16971716.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Lu Alec, Fang Zhenman, Farahpour Nazanin, and Shannon Lesley. 2020. CHIP-KNN: A configurable and high-performance k-nearest neighbors accelerator on cloud FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (FPT’20). 139147.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Lu Alec, Fang Zhenman, Liu Weihua, and Shannon Lesley. 2021. Demystifying the memory system of modern datacenter FPGAs for software programmers through microbenchmarking. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’21). 105115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Microsoft. 2020. Azure SmartNIC. Retrieved from https://www.microsoft.com/en-us/research/project/azure-smartnic/.Google ScholarGoogle Scholar
  26. [26] Nimbix. 2020. Xilinx Alveo Accelerator Cards. Retrieved from https://www.nimbix.net/alveo.Google ScholarGoogle Scholar
  27. [27] Nurvitadhi E., Mishra A., and Marr D.. 2015. A sparse matrix vector multiply accelerator for support vector machine. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’15). 109116.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Putnam Andrew, Caulfield Adrian M., Chung Eric S., Chiou Derek, Constantinides Kypros, Demme John, Esmaeilzadeh Hadi, Fowers Jeremy, Gopal Gopi Prashanth, Gray Jan, Haselman Michael, Hauck Scott, Heil Stephen, Hormati Amir, Kim Joo-Young, Lanka Sitaram, Larus James, Peterson Eric, Pope Simon, Smith Aaron, Thong Jason, Xiao Phillip Yi, and Burger Doug. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the 41st Annual International Symposium on Computer Architecuture (ISCA’14). IEEE Press, 1324.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Reagen Brandon, Adolf Robert, Shao Yakun Sophia, Wei Gu-Yeon, and Brooks David. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In Proceedings of the IEEE International Symposium on Workload Characterization. 110119.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Seidl Thomas and Kriegel Hans-Peter. 1998. Optimal multi-step k-nearest neighbor search. SIGMOD Rec. 27, 2 (June 1998), 154165.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Shimoda Masayuki, Sada Youki, Kuramochi Ryosuke, and Nakahara Hiroki. 2019. An FPGA implementation of real-time object detection with a thermal camera. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). 413414.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Stuecheli J., Blaner Bart, Johns C. R., and Siegel M. S.. 2015. CAPI: A coherent accelerator processor interface. IBM J. Res. Dev. 59, 1 (2015), 7:1–7:7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Wang Zeke, Huang Hongjing, Zhang Jie, and Alonso Gustavo. 2020. Shuhai: Benchmarking high bandwidth memory on FPGAs. In Proceedings of the 28th IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM’20). 111119.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Wong Henry, Papadopoulou Myrto, Sadooghi-Alvandi Maryam, and Moshovos Andreas. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems Software (ISPASS’10). 235246.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Xilinx. 2017. Vivado Design Suite Vivado AXI Reference. Retrieved from https://www.xilinx.com/support/documentation/ip%_documentation/axi_ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf.Google ScholarGoogle Scholar
  36. [36] Xilinx. 2018. ZCU104 Evaluation Board—User Guide. Retrieved from https://www.xilinx.com/support/documentation/boards_and_kits/zcu104/ug1267-zcu104-eval-bd.pdf.Google ScholarGoogle Scholar
  37. [37] Xilinx. 2020. Alveo U200 and U250 Data Center Accelerator Cards Data Sheet. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds962-u200-u250.pdf.Google ScholarGoogle Scholar
  38. [38] Xilinx. 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds963-u280.pdf.Google ScholarGoogle Scholar
  39. [39] Xilinx. 2020. Vitis Unified Software Platform. Retrieved from https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html#development.Google ScholarGoogle Scholar
  40. [40] Xilinx. 2021. Vitis High-Level Synthesis User Guide. Retrieved from https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_2/ug1399-vitis-hls.pdf.Google ScholarGoogle Scholar
  41. [41] Yao Bin, Li Feifei, and Kumar Piyush. 2010. K nearest neighbor queries and kNN-joins in large relational databases (almost) for free. In Proceedings of the IEEE 26th International Conference on Data Engineering (ICDE’10). 415.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Zhang Chen, Sun Guangyu, Fang Zhenman, Zhou Peipei, Pan Peichen, and Cong Jason. 2019. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 38, 11 (2019), 20722085.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 4
      December 2022
      476 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3540252
      • Editor:
      • Deming Chen
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 June 2022
      • Online AM: 9 February 2022
      • Accepted: 1 February 2022
      • Revised: 1 December 2021
      • Received: 1 September 2021
      Published in trets Volume 15, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)257
      • Downloads (Last 6 weeks)7

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!