Abstract
The underlying goal of FPGA architecture research is to devise flexible substrates that implement a wide variety of circuits efficiently. Contemporary FPGA architectures have been optimized to support networking, signal processing, and image processing applications through high-precision digital signal processing (DSP) blocks. The recent emergence of machine learning has created a new set of demands characterized by: (1) higher computational density and (2) low precision arithmetic requirements. With the goal of exploring this new design space in a methodical manner, we first propose a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations, which covers many basic linear algebra primitives and standard deep neural network (DNN) kernels. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then proposed together with a family of new embedded blocks, called MLBlocks. An MLBlock instance includes several multiply-accumulate units connected via a flexible routing, where each configuration performs a few parallel dot-products in a systolic array fashion. This architecture is parameterized with support for different data movements, reuse, and precisions, utilizing a columnar arrangement that is compatible with existing FPGA architectures. On synthetic benchmarks, we demonstrate that for 8-bit arithmetic, MLBlocks offer 6× improved performance over the commercial Xilinx DSP48E2 architecture with smaller area and delay; and for time-multiplexed 16-bit arithmetic, achieves 2× higher performance per area with the same area and frequency. All source codes and data, along with documents to reproduce all the results in this article, are available at http://github.com/raminrasoulinezhad/MLBlocks.
- [1] . 2019. Deep neural network approximation for custom hardware: Where we’ve been, where we’re going. ACM Comput. Surv. 52, 2 (2019), 40:1–40:39.
DOI: DOI: http://dx.doi.org/10.1145/3309551 Google ScholarCross Ref
- [2] . 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’17). Association for Computing Machinery, New York, NY, 5–14.
DOI: DOI: http://dx.doi.org/10.1145/3020078.3021740 Google ScholarCross Ref
- [3] . 2018. Computation error analysis of block floating point arithmetic oriented convolution neural network accelerator design. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI’18), the 30th Innovative Applications of Artificial Intelligence (IAAI’18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI’18), and (Eds.). AAAI Press, 816–823. Retrieved from https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16057. Google Scholar
Digital Library
- [4] . 2017. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 1 (
Jan. 2017), 6869–6898. Google ScholarDigital Library
- [5] . 2019. SWALP: Stochastic weight averaging in low precision training. In Proceedings of the 36th International Conference on Machine Learning(
Proceedings of Machine Learning Research Vol. 97), and (Eds.). PMLR, 7015–7024. Retrieved from http://proceedings.mlr.press/v97/yang19d.html.Google Scholar - [6] . 2020. High density pipelined 8bit multiplier systolic arrays for FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY.
DOI: DOI: http://dx.doi.org/10.1145/3373087.3375352 Google ScholarCross Ref
- [7] . 2017. Deep Learning with INT8 Optimization on Xilinx Devices - WP486 (v1.0.1). Retrieved from
DOI:
https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf.Google Scholar
- [8] . 2019. Scaling the cascades: Interconnect-aware FPGA implementation of machine learning problems. In Proceedings of the 29th International Conference on Field-programmable Logic and Applications. 342–349.
DOI: DOI: http://dx.doi.org/10.1109/FPL.2019.00061Google ScholarCross Ref
- [9] . 2020. DS-10184-001: NVIDIA Jetson Xavier NXSystem-on-Module. https://img.iceasy.com/product/product/files/202107/8a8a8a1a7a81d57a017a9eb9bb204157.pdf.Google Scholar
- [10] . 2021. XMP103: Product Selection Guide. (2021). Retrieved from
DOI:
https://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf#VUSP.Google Scholar
- [11] . 2020. Interstellar: Using Halide’s scheduling language to analyze DNN accelerators. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). Association for Computing Machinery, New York, NY, 369–383.
DOI: DOI: http://dx.doi.org/10.1145/3373376.3378514 Google ScholarCross Ref
- [12] . 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’15). Association for Computing Machinery, New York, NY, 161–170.
DOI: DOI: http://dx.doi.org/10.1145/2684746.2689060 Google ScholarCross Ref
- [13] . 2020. SV51001 Stratix V Device Overview. (2020). Retrieved from
DOI:
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-v/stx5_51001.pdf.Google Scholar
- [14] . 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In Proceedings of the 28th International Conference on Field-programmable Logic and Applications. 35–42.
DOI: DOI: http://dx.doi.org/10.1109/FPL.2018.00014Google ScholarCross Ref
- [15] . 2019. PIR-DSP: An FPGA DSP block architecture for multi-precision deep neural networks. In Proceedings of the 27th IEEE Annual International Symposium on Field-programmable Custom Computing Machines. IEEE, 35–44.
DOI: DOI: http://dx.doi.org/10.1109/FCCM.2019.00015Google ScholarCross Ref
- [16] . 2019. Intel Agilex Variable Precision DSP Blocks User Guide. (2019). Retrieved from
DOI:
https://www.intel.com/content/dam/altera-www/global/en=US/pdfs/literature/hb/agilex/ug-ag-dsp.pdf.Google Scholar
- [17] . 2021. Versal ACAP DSP Engine, Architecture Manual, AM004 (v1.1.2). (2021). Retrieved from
DOI:
https://www.xilinx.com/support/documentation/architecture-manuals/am004-versal-dsp-engine.pdf.Google Scholar
- [18] . 2020. LUXOR: An FPGA logic cell architecture for efficient compressor tree implementations. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, and (Eds.). ACM, 161–171.
DOI: DOI: http://dx.doi.org/10.1145/3373087.3375303 Google ScholarCross Ref
- [19] . 2019. Math doesn’t have to be hard: Logic block architectures to enhance low-precision multiply-accumulate on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, and (Eds.). ACM, 94–103.
DOI: DOI: http://dx.doi.org/10.1145/3289602.3293912 Google ScholarCross Ref
- [20] . 2020. FPGA logic block architectures for efficient deep learning inference. ACM Trans. Reconfig. Technol. Syst. 13, 3 (
June 2020).DOI: DOI: http://dx.doi.org/10.1145/3393668Google ScholarCross Ref
- [21] . 2019. Xilinx adaptive compute acceleration platform: Versal
architecture. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 84–93. DOI: DOI: http://dx.doi.org/10.1145/3289602.3293906 Google ScholarCross Ref
- [22] . 2018. In-package domain-specific ASICs for Intel® Stratix® 10 FPGAs: A case study of accelerating deep learning using tensortile ASIC(Abstract Only). In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays.
DOI: DOI: http://dx.doi.org/10.1145/3174243.3174966 Google ScholarCross Ref
- [23] . Speedster7t IP Component Library User Guide (UG086). 2019. Retrieved from
DOI:
https://www.achronix.com/sites/default/files/docs/Speedster7t_IP_Component_Library_User_Guide_UG086.pdf.Google Scholar
- [24] . 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In Proceedings of the International Conference on Field-programmable Technology. IEEE, 10–19.
DOI: DOI: http://dx.doi.org/10.1109/ICFPT51103.2020.00011Google ScholarCross Ref
- [25] . 2020. Hamamu: Specializing FPGAs for ML applications by adding hard matrix multiplier blocks. In Proceedings of the 31st IEEE International Conference on Application-specific Systems, Architectures and Processors. Retrieved from http://lca.ece.utexas.edu/pubs/Hamamu___ASAP2020_Jun9.pdf.Google Scholar
Cross Ref
- [26] . 2021. Tensor slices to the rescue: Supercharging ML acceleration on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, and (Eds.). ACM, 23–33.
DOI: DOI: http://dx.doi.org/10.1145/3431920.3439282 Google ScholarCross Ref
- [27] . 2020. Image recognition based on multi-scale dilated lightweight network model. In Proceedings of the 5th International Conference on Multimedia and Image Processing (ICMIP’20). Association for Computing Machinery, New York, NY, 43–48.
DOI: DOI: http://dx.doi.org/10.1145/3381271.3381300 Google ScholarCross Ref
- [28] . 2020. A lightweight neural network combining dilated convolution and depthwise separable convolution. In Cloud Computing, Smart Grid and Innovative Frontiers in Telecommunications, , , , , and (Eds.). Springer International Publishing, Cham, 210–220.Google Scholar
- [29] . 2020. Optimizing temporal convolutional network inference on FPGA-based accelerators. CoRR abs/2005.03775 (2020).Google Scholar
- [30] . 2016. DeepBench. (2016). Retrieved from
DOI:
https://github.com/baidu-research/DeepBench.Google Scholar
- [31] . 2019. DS183 (v1.28) - Virtex-7 T and XT FPGAs Data Sheet:DC and AC Switching Characteristics. Retrieved from
DOI:
https://www.xilinx.com/support/documentation/data_sheets/ds183_Virtex_7_Data_Sheet.pdf.Google Scholar
- [32] . 2021. FPGA architecture exploration for DNN acceleration. In TRETS FPT Journal Track - under review.Google Scholar
- [33] . 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, and (Eds.). ACM, 117–125.
DOI: DOI: http://dx.doi.org/10.1145/2684746.2689071 Google ScholarDigital Library
- [34] . 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, and (Eds.). Retrieved from http://arxiv.org/abs/1409.1556.Google Scholar
- [35] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 770–778.
DOI: DOI: http://dx.doi.org/10.1109/CVPR.2016.90Google ScholarCross Ref
- [36] . 2021. Ultra96-V2 Single Board Computer Hardware User’s, Guide Version 1.3. (2021). Retrieved from
DOI:
https://www.avnet.com/wps/wcm/connect/onesite/b85b9556-0b2a-42b3-ad6a-8dcf3eac1ff9/Ultra96-V2-HW-User-Guide-v1_3.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE.Z18_NA5A1I41L0ICD0ABNDMDDG0000-b85b9556-0b2a-42b3-ad6a-8dcf3eac1ff9-nDNP5R3.Google Scholar
- [37] . 2018. ZCU111 Evaluation Board, User Guide, UG1271 (v1.2). (2018). Retrieved from
DOI:
https://www.xilinx.com/support/documentation/boards_and_kits/zcu111/ug1271-zcu111-eval-bd.pdf.Google Scholar
Index Terms
Rethinking Embedded Blocks for Machine Learning Applications
Recommendations
MLBlocks: FPGA Blocks for Machine Learning Applications
FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysThe underlying goal of FPGA architecture research is to devise flexible substrates which implement a wide variety of circuits efficiently. Contemporary FPGA architectures have been optimized to support networking, signal processing and image processing ...
Logic synthesis for a single large look-up table
ICCD '95: Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and ProcessorsLogic synthesis for look-up tables (LUTs) has received much attention in the past few years, since Xilinx introduced its LUT-based field-programmable gate array (FPGA) architectures. An m-input LUT can implement any Boolean function of up to m inputs. ...
Architectural modifications to enhance the floating-point performance of FPGAs
With the density of field-programmable gate arrays (FPGAs) steadily increasing, FPGAs have reached the point where they are capable of implementing complex floating-point applications. However, their general-purpose nature has limited the use of FPGAs ...






Comments