skip to main content
research-article
Artifacts Available / v1.1

Rethinking Embedded Blocks for Machine Learning Applications

Published:30 November 2021Publication History
Skip Abstract Section

Abstract

The underlying goal of FPGA architecture research is to devise flexible substrates that implement a wide variety of circuits efficiently. Contemporary FPGA architectures have been optimized to support networking, signal processing, and image processing applications through high-precision digital signal processing (DSP) blocks. The recent emergence of machine learning has created a new set of demands characterized by: (1) higher computational density and (2) low precision arithmetic requirements. With the goal of exploring this new design space in a methodical manner, we first propose a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations, which covers many basic linear algebra primitives and standard deep neural network (DNN) kernels. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then proposed together with a family of new embedded blocks, called MLBlocks. An MLBlock instance includes several multiply-accumulate units connected via a flexible routing, where each configuration performs a few parallel dot-products in a systolic array fashion. This architecture is parameterized with support for different data movements, reuse, and precisions, utilizing a columnar arrangement that is compatible with existing FPGA architectures. On synthetic benchmarks, we demonstrate that for 8-bit arithmetic, MLBlocks offer 6× improved performance over the commercial Xilinx DSP48E2 architecture with smaller area and delay; and for time-multiplexed 16-bit arithmetic, achieves 2× higher performance per area with the same area and frequency. All source codes and data, along with documents to reproduce all the results in this article, are available at http://github.com/raminrasoulinezhad/MLBlocks.

REFERENCES

  1. [1] Wang Erwei, Davis James J., Zhao Ruizhe, Ng Ho-Cheung, Niu Xinyu, Luk Wayne, Cheung Peter Y. K., and Constantinides George A.. 2019. Deep neural network approximation for custom hardware: Where we’ve been, where we’re going. ACM Comput. Surv. 52, 2 (2019), 40:1–40:39. DOI: DOI: http://dx.doi.org/10.1145/3309551 Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Nurvitadhi Eriko, Venkatesh Ganesh, Sim Jaewoong, Marr Debbie, Huang Randy, Hock Jason Ong Gee, Liew Yeong Tat, Srivatsan Krishnan, Moss Duncan, Subhaschandra Suchit, and Boudoukh Guy. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’17). Association for Computing Machinery, New York, NY, 514. DOI: DOI: http://dx.doi.org/10.1145/3020078.3021740 Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Song Zhourui, Liu Zhenyu, and Wang Dongsheng. 2018. Computation error analysis of block floating point arithmetic oriented convolution neural network accelerator design. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI’18), the 30th Innovative Applications of Artificial Intelligence (IAAI’18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI’18), McIlraith Sheila A. and Weinberger Kilian Q. (Eds.). AAAI Press, 816823. Retrieved from https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Hubara Itay, Courbariaux Matthieu, Soudry Daniel, El-Yaniv Ran, and Bengio Yoshua. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 1 (Jan. 2017), 68696898. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Yang Guandao, Zhang Tianyi, Kirichenko Polina, Bai Junwen, Wilson Andrew Gordon, and Sa Christopher De. 2019. SWALP: Stochastic weight averaging in low precision training. In Proceedings of the 36th International Conference on Machine Learning(Proceedings of Machine Learning Research Vol. 97), Chaudhuri Kamalika and Salakhutdinov Ruslan (Eds.). PMLR, 70157024. Retrieved from http://proceedings.mlr.press/v97/yang19d.html.Google ScholarGoogle Scholar
  6. [6] Langhammer Martin, Gribok Sergey, and Baeckler Gregg. 2020. High density pipelined 8bit multiplier systolic arrays for FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY. DOI: DOI: http://dx.doi.org/10.1145/3373087.3375352 Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices - WP486 (v1.0.1). Retrieved from DOI: https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf.Google ScholarGoogle Scholar
  8. [8] Samajdar Ananda, Garg Tushar, Krishna Tushar, and Kapre Nachiket. 2019. Scaling the cascades: Interconnect-aware FPGA implementation of machine learning problems. In Proceedings of the 29th International Conference on Field-programmable Logic and Applications. 342349. DOI: DOI: http://dx.doi.org/10.1109/FPL.2019.00061Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Nvidia. 2020. DS-10184-001: NVIDIA Jetson Xavier NXSystem-on-Module. https://img.iceasy.com/product/product/files/202107/8a8a8a1a7a81d57a017a9eb9bb204157.pdf.Google ScholarGoogle Scholar
  10. [10] Xilinx. 2021. XMP103: Product Selection Guide. (2021). Retrieved from DOI: https://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf#VUSP.Google ScholarGoogle Scholar
  11. [11] Yang Xuan, Gao Mingyu, Liu Qiaoyi, Setter Jeff, Pu Jing, Nayak Ankita, Bell Steven, Cao Kaidi, Ha Heonjae, Raina Priyanka, Kozyrakis Christos, and Horowitz Mark. 2020. Interstellar: Using Halide’s scheduling language to analyze DNN accelerators. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). Association for Computing Machinery, New York, NY, 369383. DOI: DOI: http://dx.doi.org/10.1145/3373376.3378514 Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Zhang Chen, Li Peng, Sun Guangyu, Guan Yijin, Xiao Bingjun, and Cong Jason. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA’15). Association for Computing Machinery, New York, NY, 161170. DOI: DOI: http://dx.doi.org/10.1145/2684746.2689060 Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Intel. 2020. SV51001 Stratix V Device Overview. (2020). Retrieved from DOI: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-v/stx5_51001.pdf.Google ScholarGoogle Scholar
  14. [14] Boutros Andrew, Yazdanshenas Sadegh, and Betz Vaughn. 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In Proceedings of the 28th International Conference on Field-programmable Logic and Applications. 3542. DOI: DOI: http://dx.doi.org/10.1109/FPL.2018.00014Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Rasoulinezhad SeyedRamin, Zhou Hao, Wang Lingli, and Leong Philip H. W.. 2019. PIR-DSP: An FPGA DSP block architecture for multi-precision deep neural networks. In Proceedings of the 27th IEEE Annual International Symposium on Field-programmable Custom Computing Machines. IEEE, 3544. DOI: DOI: http://dx.doi.org/10.1109/FCCM.2019.00015Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Intel. 2019. Intel Agilex Variable Precision DSP Blocks User Guide. (2019). Retrieved from DOI: https://www.intel.com/content/dam/altera-www/global/en=US/pdfs/literature/hb/agilex/ug-ag-dsp.pdf.Google ScholarGoogle Scholar
  17. [17] Xilinx. 2021. Versal ACAP DSP Engine, Architecture Manual, AM004 (v1.1.2). (2021). Retrieved from DOI: https://www.xilinx.com/support/documentation/architecture-manuals/am004-versal-dsp-engine.pdf.Google ScholarGoogle Scholar
  18. [18] Rasoulinezhad SeyedRamin, Siddhartha, Zhou Hao, Wang Lingli, Boland David, and Leong Philip H. W.. 2020. LUXOR: An FPGA logic cell architecture for efficient compressor tree implementations. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, Neuendorffer Stephen and Shannon Lesley (Eds.). ACM, 161171. DOI: DOI: http://dx.doi.org/10.1145/3373087.3375303 Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Boutros Andrew, Eldafrawy Mohamed, Yazdanshenas Sadegh, and Betz Vaughn. 2019. Math doesn’t have to be hard: Logic block architectures to enhance low-precision multiply-accumulate on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, Bazargan Kia and Neuendorffer Stephen (Eds.). ACM, 94103. DOI: DOI: http://dx.doi.org/10.1145/3289602.3293912 Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Eldafrawy Mohamed, Boutros Andrew, Yazdanshenas Sadegh, and Betz Vaughn. 2020. FPGA logic block architectures for efficient deep learning inference. ACM Trans. Reconfig. Technol. Syst. 13, 3 (June 2020). DOI: DOI: http://dx.doi.org/10.1145/3393668Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Gaide Brian, Gaitonde Dinesh, Ravishankar Chirag, and Bauer Trevor. 2019. Xilinx adaptive compute acceleration platform: Versal architecture. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 8493. DOI: DOI: http://dx.doi.org/10.1145/3289602.3293906 Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Nurvitadhi Eriko, Cook Jeffrey J., Mishra Asit K., Marr Debbie, Nealis Kevin, Colangelo Philip, Ling Andrew C., Capalija Davor, Aydonat Utku, Shumarayev Sergey Y., and Dasu Aravind. 2018. In-package domain-specific ASICs for Intel® Stratix® 10 FPGAs: A case study of accelerating deep learning using tensortile ASIC(Abstract Only). In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. DOI: DOI: http://dx.doi.org/10.1145/3174243.3174966 Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Acceleration Achronix - Data. Speedster7t IP Component Library User Guide (UG086). 2019. Retrieved from DOI: https://www.achronix.com/sites/default/files/docs/Speedster7t_IP_Component_Library_User_Guide_UG086.pdf.Google ScholarGoogle Scholar
  24. [24] Boutros Andrew, Nurvitadhi Eriko, Ma Rui, Gribok Sergey, Zhao Zhipeng, Hoe James C., Betz Vaughn, and Langhammer Martin. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In Proceedings of the International Conference on Field-programmable Technology. IEEE, 1019. DOI: DOI: http://dx.doi.org/10.1109/ICFPT51103.2020.00011Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Aror Aman, Wei† Zhigang, and John Lizy K.. 2020. Hamamu: Specializing FPGAs for ML applications by adding hard matrix multiplier blocks. In Proceedings of the 31st IEEE International Conference on Application-specific Systems, Architectures and Processors. Retrieved from http://lca.ece.utexas.edu/pubs/Hamamu___ASAP2020_Jun9.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Arora Aman, Mehta Samidh, Betz Vaughn, and John Lizy K.. 2021. Tensor slices to the rescue: Supercharging ML acceleration on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, Shannon Lesley and Adler Michael (Eds.). ACM, 2333. DOI: DOI: http://dx.doi.org/10.1145/3431920.3439282 Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Shi Yewei, Yao Xiao, Chen Ruixuan, Yuan Lili, Xu Ning, and Liu Xiaofeng. 2020. Image recognition based on multi-scale dilated lightweight network model. In Proceedings of the 5th International Conference on Multimedia and Image Processing (ICMIP’20). Association for Computing Machinery, New York, NY, 4348. DOI: DOI: http://dx.doi.org/10.1145/3381271.3381300 Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Sun Wei, Zhou Xijie, Zhang Xiaorui, and He Xiaozheng. 2020. A lightweight neural network combining dilated convolution and depthwise separable convolution. In Cloud Computing, Smart Grid and Innovative Frontiers in Telecommunications, Zhang Xuyun, Liu Guanfeng, Qiu Meikang, Xiang Wei, and Huang Tao (Eds.). Springer International Publishing, Cham, 210220.Google ScholarGoogle Scholar
  29. [29] Carreras Marco, Deriu Gianfranco, Raffo Luigi, Benini Luca, and Meloni Paolo. 2020. Optimizing temporal convolutional network inference on FPGA-based accelerators. CoRR abs/2005.03775 (2020).Google ScholarGoogle Scholar
  30. [30] Baidu. 2016. DeepBench. (2016). Retrieved from DOI: https://github.com/baidu-research/DeepBench.Google ScholarGoogle Scholar
  31. [31] Xilinx. 2019. DS183 (v1.28) - Virtex-7 T and XT FPGAs Data Sheet:DC and AC Switching Characteristics. Retrieved from DOI: https://www.xilinx.com/support/documentation/data_sheets/ds183_Virtex_7_Data_Sheet.pdf.Google ScholarGoogle Scholar
  32. [32] Roorda Esther, Rasoulinezhad Seyedramin, Leong Philip H. W., and Wilton Steve. 2021. FPGA architecture exploration for DNN acceleration. In TRETS FPT Journal Track - under review.Google ScholarGoogle Scholar
  33. [33] Langhammer Martin and Pasca Bogdan. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, Constantinides George A. and Chen Deming (Eds.). ACM, 117125. DOI: DOI: http://dx.doi.org/10.1145/2684746.2689071 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Simonyan Karen and Zisserman Andrew. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, Bengio Yoshua and LeCun Yann (Eds.). Retrieved from http://arxiv.org/abs/1409.1556.Google ScholarGoogle Scholar
  35. [35] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 770778. DOI: DOI: http://dx.doi.org/10.1109/CVPR.2016.90Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Avnet. 2021. Ultra96-V2 Single Board Computer Hardware User’s, Guide Version 1.3. (2021). Retrieved from DOI: https://www.avnet.com/wps/wcm/connect/onesite/b85b9556-0b2a-42b3-ad6a-8dcf3eac1ff9/Ultra96-V2-HW-User-Guide-v1_3.pdf?MOD=AJPERES&CACHEID=ROOTWORKSPACE.Z18_NA5A1I41L0ICD0ABNDMDDG0000-b85b9556-0b2a-42b3-ad6a-8dcf3eac1ff9-nDNP5R3.Google ScholarGoogle Scholar
  37. [37] Xilinx. 2018. ZCU111 Evaluation Board, User Guide, UG1271 (v1.2). (2018). Retrieved from DOI: https://www.xilinx.com/support/documentation/boards_and_kits/zcu111/ug1271-zcu111-eval-bd.pdf.Google ScholarGoogle Scholar

Index Terms

  1. Rethinking Embedded Blocks for Machine Learning Applications

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Reconfigurable Technology and Systems
              ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 1
              March 2022
              262 pages
              ISSN:1936-7406
              EISSN:1936-7414
              DOI:10.1145/3494949
              • Editor:
              • Deming Chen
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 30 November 2021
              • Accepted: 1 October 2021
              • Revised: 1 August 2021
              • Received: 1 June 2021
              Published in trets Volume 15, Issue 1

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Refereed
            • Article Metrics

              • Downloads (Last 12 months)136
              • Downloads (Last 6 weeks)5

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Full Text

            View this article in Full Text.

            View Full Text

            HTML Format

            View this article in HTML Format .

            View HTML Format
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!