Abstract
Both modern datacenter and embedded Field Programmable Gate Arrays (FPGAs) provide great opportunities for high-performance and high-energy-efficiency computing. With the growing public availability of FPGAs from major cloud service providers such as AWS, Alibaba, and Nimbix, as well as uniform hardware accelerator development tools (such as Xilinx Vitis and Intel oneAPI) for software programmers, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS).
The major goal of this article is to figure out how to efficiently access the memory system of modern datacenter and embedded FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including (1) the clock frequency of the accelerator design, (2) the number of concurrent memory access ports, (3) the data width of each port, (4) the maximum burst access length for each port, and (5) the size of consecutive data accesses. Then, we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the memory systems of datacenter FPGAs (Xilinx Alveo U200 and U280) and embedded FPGA (Xilinx ZCU104) when changing those affecting factors, and we provide insights into efficient memory access in HLS-based accelerator designs. Comparing between the typically used soft and hardened memory systems, respectively, found on datacenter and embedded FPGAs, we further summarize their unique features and discuss the effective approaches to leverage these systems. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms on datacenter FPGAs with a soft (and thus more flexible) memory system. Compared to the baseline designs, optimized designs leveraging our insights achieve about \( 3.5\times \) and \( 8.5\times \) speedups for the KNN and SpMV accelerators. Our final optimized KNN and SpMV designs on a Xilinx Alveo U200 FPGA fully utilize its off-chip memory bandwidth, and achieve about \( 5.6\times \) and \( 3.4\times \) speedups over the 24-core CPU implementations.
- [1] . 2020. Alibaba compute optimized instance families with FPGAs. Retrieved from https://www.alibabacloud.com/help/doc-detail/108504.htm.Google Scholar
- [2] . 1992. An introduction to kernel and nearest-neighbor nonparametric regression. Amer. Stat. 46, 3 (1992), 175–185.Google Scholar
- [3] . 2020. Amazon EC2 F1 Instances, Enable faster FPGA accelerator development and deployment in the cloud. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- [4] . 2018. Intel Shows Xeon Scalable Gold 6138P with Integrated FPGA, Shipping to Vendors. Retrieved from https://www.anandtech.com/show/12773/intel-shows-xeon-scalable-gold-6138p-with-integrated-fpga-shipping-to-vendors.Google Scholar
- [5] . 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE Press, 781–792.Google Scholar
Digital Library
- [6] . 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, 1–11.Google Scholar
Digital Library
- [7] . 2016. Understanding latency variation in modern DRAM chips: Experimental characterization, analysis, and optimization. SIGMETRICS Perform. Eval. Rev. 44, 1 (
June 2016), 323–336.Google ScholarDigital Library
- [8] . 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, 44–54. Retrieved from http://rodinia.cs.virginia.edu/doku.php?id=start.Google Scholar
Digital Library
- [9] . 2021. HBM connect: High-performance HLS interconnect for FPGA HBM. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). Association for Computing Machinery, New York, NY, 116–126.Google Scholar
Digital Library
- [10] . 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). Association for Computing Machinery, New York, NY, Article
109 , 6 pages.Google ScholarDigital Library
- [11] . 2019. In-depth analysis on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM Trans. Reconfig. Technol. Syst. 12, 1, Article
4 (Feb. 2019), 20 pages.Google ScholarDigital Library
- [12] . 2018. Best-effort FPGA programming: A few steps can go a long way. Retrieved from https://arxiv.org/abs/1807.01340.Google Scholar
- [13] . 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). Association for Computing Machinery, New York, NY, 93–96.Google Scholar
- [14] . 2017. Bandwidth optimization through on-chip memory restructuring for HLS. In Proceedings of the 54th Annual Design Automation Conference (DAC’17). Association for Computing Machinery, New York, NY, Article
43 , 6 pages.Google ScholarDigital Library
- [15] . 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). Association for Computing Machinery, New York, NY, 365–376.Google Scholar
Digital Library
- [16] . 2015. Measuring microarchitectural details of multi- and many-core memory systems through microbenchmarking. ACM Trans. Archit. Code Optim. 11, 4, Article
55 (Jan. 2015), 26 pages.Google ScholarDigital Library
- [17] . 2003. KNN model-based approach in classification. In Proceedings of the International Conference on Ontologies, Databases and Applications of Semantics (ODBASE’03). Springer, Switzerland, 986–996.Google Scholar
- [18] . 1990. An interactive symbolic–numeric interface to parallel ELLPACK for building general PDE solvers. In Symbolic and Numerical Computation for Artificial Intelligence. ACM, 303–322.Google Scholar
- [19] . 2020. Intel Stratix 10 FPGAs. Retrieved from https://www.intel.ca/content/www/ca/en/products/programmable/fpga/stratix-10.html.Google Scholar
- [20] . 2021. Intel oneAPI: A Unified X-Architecture Programming Model. Retrieved from https://software.intel.com/content/www/us/en/develop/tools/oneapi.html#gs.9wo7rg.Google Scholar
- [21] . 2017. FPGA-based CNN inference accelerator synthesized from multi-threaded C software. In Proceedings of the 30th IEEE International System-on-Chip Conference (SOCC’17). 268–273.Google Scholar
Cross Ref
- [22] . 2019. Edge computing for autonomous driving: Opportunities and challenges. Proc. IEEE 107, 8 (2019), 1697–1716.Google Scholar
Cross Ref
- [23] . 2020. CHIP-KNN: A configurable and high-performance k-nearest neighbors accelerator on cloud FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (FPT’20). 139–147.Google Scholar
Cross Ref
- [24] . 2021. Demystifying the memory system of modern datacenter FPGAs for software programmers through microbenchmarking. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’21). 105–115.Google Scholar
Digital Library
- [25] . 2020. Azure SmartNIC. Retrieved from https://www.microsoft.com/en-us/research/project/azure-smartnic/.Google Scholar
- [26] . 2020. Xilinx Alveo Accelerator Cards. Retrieved from https://www.nimbix.net/alveo.Google Scholar
- [27] . 2015. A sparse matrix vector multiply accelerator for support vector machine. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’15). 109–116.Google Scholar
Cross Ref
- [28] . 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the 41st Annual International Symposium on Computer Architecuture (ISCA’14). IEEE Press, 13–24.Google Scholar
Digital Library
- [29] . 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In Proceedings of the IEEE International Symposium on Workload Characterization. 110–119.Google Scholar
Cross Ref
- [30] . 1998. Optimal multi-step k-nearest neighbor search. SIGMOD Rec. 27, 2 (
June 1998), 154–165.Google ScholarDigital Library
- [31] . 2019. An FPGA implementation of real-time object detection with a thermal camera. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). 413–414.Google Scholar
Cross Ref
- [32] . 2015. CAPI: A coherent accelerator processor interface. IBM J. Res. Dev. 59, 1 (2015), 7:1–7:7.Google Scholar
Digital Library
- [33] . 2020. Shuhai: Benchmarking high bandwidth memory on FPGAs. In Proceedings of the 28th IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM’20). 111–119.Google Scholar
Cross Ref
- [34] . 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems Software (ISPASS’10). 235–246.Google Scholar
Cross Ref
- [35] . 2017. Vivado Design Suite Vivado AXI Reference. Retrieved from https://www.xilinx.com/support/documentation/ip%_documentation/axi_ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf.Google Scholar
- [36] . 2018. ZCU104 Evaluation Board—User Guide. Retrieved from https://www.xilinx.com/support/documentation/boards_and_kits/zcu104/ug1267-zcu104-eval-bd.pdf.Google Scholar
- [37] . 2020. Alveo U200 and U250 Data Center Accelerator Cards Data Sheet. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds962-u200-u250.pdf.Google Scholar
- [38] . 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds963-u280.pdf.Google Scholar
- [39] . 2020. Vitis Unified Software Platform. Retrieved from https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html#development.Google Scholar
- [40] . 2021. Vitis High-Level Synthesis User Guide. Retrieved from https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_2/ug1399-vitis-hls.pdf.Google Scholar
- [41] . 2010. K nearest neighbor queries and kNN-joins in large relational databases (almost) for free. In Proceedings of the IEEE 26th International Conference on Data Engineering (ICDE’10). 4–15.Google Scholar
Cross Ref
- [42] . 2019. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 38, 11 (2019), 2072–2085.Google Scholar
Digital Library
Index Terms
Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking
Recommendations
Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking
FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysWith the public availability of FPGAs from major cloud service providers like AWS, Alibaba, and Nimbix, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially ...
Hardware and software infrastructure to implement many-core systems in modern FPGAs
SBCCI '17: Proceedings of the 30th Symposium on Integrated Circuits and Systems Design: Chip on the SandsMany-core systems are increasingly popular in embedded systems due to their high-performance and flexibility to execute different workloads. These many-core systems provide a rich processing fabric but lack the flexibility to accelerate critical ...






Comments