skip to main content
research-article

FPGA Architecture Exploration for DNN Acceleration

Published:10 May 2022Publication History
Skip Abstract Section

Abstract

Recent years have seen an explosion of machine learning applications implemented on Field-Programmable Gate Arrays (FPGAs). FPGA vendors and researchers have responded by updating their fabrics to more efficiently implement machine learning accelerators, including innovations such as enhanced Digital Signal Processing (DSP) blocks and hardened systolic arrays. Evaluating architectural proposals is difficult, however, due to the lack of publicly available benchmark circuits.

This paper addresses this problem by presenting an open-source benchmark circuit generator that creates realistic DNN-oriented circuits for use in FPGA architecture studies. Unlike previous generators, which create circuits that are agnostic of the underlying FPGA, our circuits explicitly instantiate embedded blocks, allowing for meaningful comparison of recent architectural proposals without the need for a complete inference computer-aided design (CAD) flow. Our circuits are compatible with the VTR CAD suite, allowing for architecture studies that investigate routing congestion and other low-level architectural implications.

In addition to addressing the lack of machine learning benchmark circuits, the architecture exploration flow that we propose allows for a more comprehensive evaluation of FPGA architectures than traditional static benchmark suites. We demonstrate this through three case studies which illustrate how realistic benchmark circuits can be generated to target different heterogeneous FPGAs.

REFERENCES

  1. [1] Alwani Manoj, Chen Han, Ferdman Michael, and Milder Peter. 2016. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 112. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Arora Aman, Mehta Samidh, Betz Vaughn, and John Lizy. 2020. Tensor slices to the rescue: Supercharging ML acceleration on FPGAs. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.Google ScholarGoogle Scholar
  3. [3] Arora A., Wei Z., and John L. K.. 2020. Hamamu: Specializing FPGAs for ML applications by adding hard matrix multiplier blocks. In 2020 IEEE 31st International Conference on Application-Specific Systems, Architectures and Processors (ASAP). 5360. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Azizimazreah Arash and Chen Lizhong. 2019. Shortcut mining: Exploiting cross-layer shortcut reuse in DCNN accelerators. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 94105. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Betz Vaughn, Rose Jonathan, and Marquardt Alexander. 1999. Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic Publishers.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Boutros Andrew, Eldafrawy Mohamed, Yazdanshenas Sadegh, and Betz Vaughn. 2019. Math doesn’t have to be hard: Logic block architectures to enhance low-precision multiply-accumulate on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). Association for Computing Machinery, New York, NY, USA, 94103. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Boutros Andrew, Nurvitadhi Eriko, Ma Rui, Gribok Sergey, Zhao Zhipeng, Hoe James, Betz Vaughn, and Langhammer Martin. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs(FPT’20).Google ScholarGoogle Scholar
  8. [8] Boutros A., Yazdanshenas S., and Betz V.. 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 18. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Corporation Intel. 2020. Avalon Interface Specifications.Google ScholarGoogle Scholar
  10. [10] Das J. and Wilton S. J. E.. 2011. An analytical model relating FPGA architecture parameters to routability. In ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 181184.Google ScholarGoogle Scholar
  11. [11] Eldafrawy Mohamed, Boutros Andrew, Yazdanshenas Sadegh, and Betz Vaughn. 2020. FPGA logic block architectures for efficient deep learning inference. ACM Trans. Reconfigurable Technol. Syst. 13, 3, Article 12 (June 2020), 34 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Firuzan A., Modarressi M., Daneshtalab M., and Reshadi M.. 2018. Reconfigurable network-on-chip for 3D neural network accelerators. In 2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS). 18. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Ghosh Debabrata, Kapur Nevin, Brglez Franc, and Harlow Justin E.. 1998. Synthesis of wiring signature-invariant equivalence class circuit mutants and applications to benchmarking. In 1998 Design, Automation and Test in Europe. 656663.Google ScholarGoogle Scholar
  14. [14] Gondimalla Ashish, Chesnut Noah, Thottethodi Mithuna, and Vijaykumar T. N.. 2019. SparTen: A sparse tensor accelerator for convolutional neural networks. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’52). Association for Computing Machinery, New York, NY, USA, 151165. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Grant David and Lemieux Guy. 2009. Perturb+mutate: Semi-synthetic circuit generation for incremental placement and routing. ACM Transactions on Reconfigurable Technology and Systems 1, 3 (2009), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Hegde Kartik, Agrawal Rohit, Yao Yulun, and Fletcher Christopher W.. 2018. Morph: Flexible Acceleration for 3D CNN-based Video Understanding. (2018). arxiv:cs.LG/1810.06807Google ScholarGoogle Scholar
  17. [17] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. (2017). arxiv:cs.CV/1704.04861Google ScholarGoogle Scholar
  18. [18] Hutton Michael D., Rose Jonathan, and Corneil Derek G.. 2002. Automatic generation of synthetic sequential benchmark circuits. IEEE Trans. on CAD of Integrated Circuits and Systems 21, 8 (2002), 928940.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Jiang S., Pan P., Ou Y., and Batten C.. 2020. PyMTL3: A Python framework for open-source hardware modeling, generation, simulation, and verification. IEEE Micro 40, 4 (2020), 5866. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Kim J. H., Lee J., and Anderson J. H.. 2018. FPGA architecture enhancements for efficient BNN implementation. In 2018 International Conference on Field-Programmable Technology (FPT). 214221. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Kundarewich Paul D. and Rose Jonathan. 2004. Synthetic circuit generation using clustering and iteration. IEEE Trans. on CAD of Integrated Circuits and Systems 23, 6 (2004), 869887.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Kwon Hyoukjun, Samajdar Ananda, and Krishna Tushar. 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. SIGPLAN Not. 53, 2 (March 2018), 461475. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Langhammer Martin, Nurvitadhi Eriko, Pasca Bogdan, and Gribok Sergey. 2020. Stratix 10 NX architecture and applications. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20).Google ScholarGoogle Scholar
  24. [24] Lu W., Yan G., Li J., Gong S., Han Y., and Li X.. 2017. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 553564. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Mark Cindy, Chin Scott, Shannon Lesley, and Wilton Steven. 2012. Hierarchical benchmark circuit generation for FPGA architecture evaluation. ACM Transactions on Embedded Computing Systems 11, S2 (2012), 42:1–42.25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Murray Kevin E., Petelin Oleg, Zhong Sheng, Wang Jia Min, Eldafrawy Mohamed, Legault Jean-Philippe, Sha Eugene, Graham Aaron G., Wu Jean, Walker Matthew J. P., Zeng Hanqing, Patros Panagiotis, Luu Jason, Kent Kenneth B., and Betz Vaughn. 2020. VTR 8: High-Performance CAD and customizable FPGA architecture modelling. ACM Trans. Reconfigurable Technol. Syst. 13, 2, Article 9 (May 2020), 55 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Murray Kevin E., Whitty Scott, Liu Suya, Luu Jason, and Betz Vaughn. 2015. Timing-Driven titan: Enabling large benchmarks and exploring the gap between academic and commercial CAD. ACM Trans. Reconfigurable Technol. Syst. 8, 2, Article 10 (March 2015), 18 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Parashar Angshuman, Rhu Minsoo, Mukkara Anurag, Puglielli Antonio, Venkatesan Rangharajan, Khailany Brucek, Emer Joel, Keckler Stephen W., and Dally William J.. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. (2017). arxiv:cs.NE/1708.04485Google ScholarGoogle Scholar
  29. [29] Pistorius J., Legai E., and Minoux M.. 2000. PartGen: A generator of very large circuits to benchmark the partitioning of FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 19, 11 (2000), 13141321.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Rasoulinezhad Seyedramin, Siddhartha, Zhou Hao, Wang Lingli, Boland David, and Leong Philip H. W.. 2020. LUXOR: An FPGA logic cell architecture for efficient compressor tree implementations. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY, USA, 161171. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Rasoulinezhad S., Zhou H., Wang L., and Leong P. H. W.. 2019. PIR-DSP: An FPGA DSP block architecture for multi-precision deep neural networks. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 3544. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Samajdar A., Garg T., Krishna T., and Kapre N.. 2019. Scaling the cascades: Interconnect-Aware FPGA implementation of machine learning problems. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 342349. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Sharma H., Park J., Mahajan D., Amaro E., Kim J. K., Shao C., Mishra A., and Esmaeilzadeh H.. 2016. From high-level deep neural models to FPGAs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 112. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Sharma H., Park J., Mahajan D., Amaro E., Kim J. K., Shao C., Mishra A., and Esmaeilzadeh H.. 2016. From high-level deep neural models to FPGAs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 112. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Shen Yongming, Ferdman Michael, and Milder Peter. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). Association for Computing Machinery, New York, NY, USA, 535547. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Sohrabizadeh Atefeh, Wang Jie, and Cong Jason. 2020. End-to-end optimization of deep learning applications. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20). Association for Computing Machinery, New York, NY, USA, 133—139. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Stroobandt D., Depreitre J., and VanCampenhout J.. 1999. Generating new benchmark designs using a multi-terminal net model. Integration: The VLSI Journal 27, 2 (1999), 113129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Suda Naveen, Chandra Vikas, Dasika Ganesh, Mohanty Abinash, Ma Yufei, Vrudhula Sarma, Seo Jae-sun, and Cao Yu. 2016. Throughput-Optimized OpenCL-Based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). Association for Computing Machinery, New York, NY, USA, 1625. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Sze Vivienne, Chen Yu-Hsin, Yang Tien-Ju, and Emer Joel. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. (2017). arxiv:cs.CV/1703.09039Google ScholarGoogle Scholar
  40. [40] Tom Martin and Lemieux Guy. 2005. Logic block clustering of large designs for channel-width constrained FPGAs. In Design Automation Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Venieris S. I. and Bouganis C.. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 4047. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Verplaetse P., Stroobandt D., and VanCampenhout J.. 2002. Synthetic benchmark circuits for timing-driven physical design applications. In International Conference on VLSI. 3137.Google ScholarGoogle Scholar
  43. [43] Wang Y., Xu J., Han Y., Li H., and Li X.. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the Neural Network family. In 2016 53rd ACM/EDAC/IEEE Design Automation Conference (DAC). 16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Yan Andy, Cheng Rebecca, and Wilton Steven J. E.. 2002. On the sensitivity of FPGA architectural conclusions to experimental assumptions, tools, and techniques. In International Symposium on FPGAs. 147156.Google ScholarGoogle Scholar
  45. [45] Yang S.. 1991. Logic Synthesis and Optimization Benchmarks User Guide 3.0. Technical Report. In MCNC. 16.Google ScholarGoogle Scholar
  46. [46] Yang Xuan, Pu Jing, Rister Blaine Burton, Bhagdikar Nikhil, Richardson Stephen, Kvatinsky Shahar, Ragan-Kelley Jonathan, Pedram Ardavan, and Horowitz Mark. 2016. A systematic approach to blocking convolutional neural networks. CoRR abs/1606.04209 (2016). arxiv:1606.04209 http://arxiv.org/abs/1606.04209Google ScholarGoogle Scholar
  47. [47] Ma Yufei, Suda N., Cao Yu, Seo J., and Vrudhula S.. 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL). 18. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zeng Hanqing, Chen Ren, Zhang Chi, and Prasanna Viktor. 2018. A framework for generating high throughput CNN implementations on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18). Association for Computing Machinery, New York, NY, USA, 117126. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Zhang Chen, Li Peng, Sun Guangyu, Guan Yijin, Xiao Bingjun, and Cong Jason. 2015. Optimizing FPGA-Based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). Association for Computing Machinery, New York, NY, USA, 161170. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Zhang C., Fang Zhenman, Zhou Peipei, Pan Peichen, and Cong Jason. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 18. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Zhang X., Wang J., Zhu C., Lin Y., Xiong J., Hwu W., and Chen D.. 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 18. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Zhang Xiaofan, Ye Hanchen, Wang Junsong, Lin Yonghua, Xiong Jinjun, Hwu Wen Mei, and Chen Deming. 2021. DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator. (2021). arxiv:cs.AR/2008.12745Google ScholarGoogle Scholar

Index Terms

  1. FPGA Architecture Exploration for DNN Acceleration

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Reconfigurable Technology and Systems
                ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 3
                September 2022
                353 pages
                ISSN:1936-7406
                EISSN:1936-7414
                DOI:10.1145/3508070
                • Editor:
                • Deming Chen
                Issue’s Table of Contents

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 10 May 2022
                • Accepted: 1 November 2021
                • Revised: 1 October 2021
                • Received: 1 June 2021
                Published in trets Volume 15, Issue 3

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              Full Text

              View this article in Full Text.

              View Full Text

              HTML Format

              View this article in HTML Format .

              View HTML Format
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!