skip to main content
research-article

FPGA Logic Block Architectures for Efficient Deep Learning Inference

Published:03 June 2020Publication History
Skip Abstract Section

Abstract

Reducing the precision of deep neural network (DNN) inference accelerators can yield large efficiency gains with little or no accuracy degradation compared to half or single precision floating-point by enabling more multiplication operations per unit area. A wide range of precisions fall on the pareto-optimal curve of hardware efficiency vs. accuracy with no single precision dominating, making the variable precision capabilities of FPGAs very valuable. We propose three types of logic block architectural enhancements and fully evaluate a total of six architectures that improve the area efficiency of multiplications and additions implemented in the soft fabric. Increasing the LUT fracturability and adding two adders to the ALM (4-bit Adder Double Chain architecture) leads to a 1.5× area reduction for arithmetic heavy machine learning (ML) kernels, while increasing their speed. In addition, this architecture also reduces the logic area of general applications by 6%, while increasing the critical path delay by only 1%. However, our highest impact option, which adds a 9-bit shadow multiplier to the logic clusters, reduces the area and critical path delay of ML kernels by 2.4× and 1.2×, respectively. These large gains come at a cost of 15% logic area increase for general applications.

References

  1. Y. Cao. 2018. Predictive Technology Model (PTM). Retrieved from http://ptm.asu.edu/.Google ScholarGoogle Scholar
  2. E. Ahmed and J. Rose. 2004. The effect of LUT and cluster size on deep-submicron FPGA performance and density. IEEE Trans. VLSI Syst. 12, 3 (2004), 288--298.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Baugh and B. Wooley. 1973. A two’s complement parallel array multiplication algorithm. IEEE Trans. Comput. C-22, 12 (1973), 1045--1047.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Betz and J. Rose. 1997. Cluster-based logic blocks for FPGAs: Area-efficiency vs. input sharing and size. In Proceedings of the Custom Integrated Circuits Conference. 551--554.Google ScholarGoogle Scholar
  5. V. Betz and J. Rose. 1998. How much logic should go in an FPGA logic block. IEEE Design 8 Test of Computers, 15, 1 (1998), 10--15.Google ScholarGoogle Scholar
  6. A. Boutros et al. 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--8.Google ScholarGoogle Scholar
  7. A. Boutros et al. 2018. You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 1–23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Boutros et al. 2019. Math doesn’t have to be hard: Logic block architectures to enhance low-precision multiply-accumulate on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 94--103.Google ScholarGoogle Scholar
  9. M. Burich. 2012. Conference workshop: FPGAs in 2032, challenges and opportunities in the next 20 years, convergence of programmable solutions. In Proceedings of the International Symposium on Field-Programmable Gate Arrays.Google ScholarGoogle Scholar
  10. S. Chandrakar et al. 2015. Enhancements in UltraScale CLB architecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 108--116.Google ScholarGoogle Scholar
  11. C. Chiasson and V. Betz. 2013. COFFE: Fully-automated transistor sizing for FPGAs. In Proceedings of the International Conference on Field-Programmable Technology. 34--41.Google ScholarGoogle Scholar
  12. Intel Corporation. 2005. Stratix GX Transeiver User Guide.Google ScholarGoogle Scholar
  13. Xilinx Corporation. 2007. Virtex-II Platform FPGA User Guide.Google ScholarGoogle Scholar
  14. Xilinx Corporation. 2007. Virtex-II Pro and Virtex-II Pro X FPGA User Guide.Google ScholarGoogle Scholar
  15. M. Deo et al. 2019. Intel Stratix 10 MX devices solve the memory bandwidth challenge. Intel Whitepaper.Google ScholarGoogle Scholar
  16. J. Fowers et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the International Symposium on Computer Architecture, 1--14.Google ScholarGoogle Scholar
  17. S. Han et al. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the International Symposium on Computer Architecture. 243--254.Google ScholarGoogle Scholar
  18. K. He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google ScholarGoogle Scholar
  19. Intel Corporation. 2017. Intel Stratix 10 logic array blocks and adaptive logic modules user guide (UG-S10LAB).Google ScholarGoogle Scholar
  20. P. Jamieson and J. Rose. 2006. Enhancing the area-efficiency of FPGAs with hard circuits using shadow clusters. In Proceedings of the International Conference on Field Programmable Technology. 1--8.Google ScholarGoogle Scholar
  21. A. Krizhevsky et al. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google ScholarGoogle Scholar
  22. I. Kuon and J. Rose. 2011. Exploring area and delay tradeoffs in FPGAs with architecture and automated transistor design. IEEE Trans. VLSI Syst. 19, 1 (2011), 71--84.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Langhammer et al. 2019. Fractal synthesis: Invited tutorial. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 202--211.Google ScholarGoogle Scholar
  24. M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 117--125.Google ScholarGoogle Scholar
  25. Guy Lemieux and David Lewis. 2001. Using sparse crossbars within LUT. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 59--68.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Lewis et al. 2005. The Stratix II logic and routing architecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 14--20.Google ScholarGoogle Scholar
  27. D. Lewis et al. 2016. The Stratix 10 highly pipelined FPGA architecture. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 159--168.Google ScholarGoogle Scholar
  28. D. M. Lewis et al. 2003. The StratixTM routing and logic architecture. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 12--20.Google ScholarGoogle Scholar
  29. J. Luu et al. 2014. VTR 7.0: Next generation architecture and CAD system for FPGAs. ACM Trans. Reconfig. Technol. Syst. 7, 2 (2014), 1–30.Google ScholarGoogle Scholar
  30. A. Mishra et al. 2017. WRPN: Wide reduced-precision networks. arXiv preprint arXiv:1709.01134.Google ScholarGoogle Scholar
  31. Kevin E. Murray et al. 2020. VTR 8: High performance CAD and customizable FPGA architecture modelling. ACM Trans. Reconfig. Technol. Syst. 0, ja, 1Google ScholarGoogle Scholar
  32. E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the International Conference on Field-Programmable Technology. 77--84.Google ScholarGoogle Scholar
  33. E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 5--14.Google ScholarGoogle Scholar
  34. J. et al. Qiu. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 26--35.Google ScholarGoogle Scholar
  35. E. Real et al. 2018. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.Google ScholarGoogle Scholar
  36. J. Rose et al. 1993. Architecture of field-programmable gate arrays. IEEE J. Solid-State Circ. 81, 7 (1993), 1013--1029.Google ScholarGoogle Scholar
  37. V. Rybalkin et al. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--8.Google ScholarGoogle Scholar
  38. S. M. Trimberger. 2015. Three ages of FPGAs: A retrospective on the first thirty years of FPGA technology. Proc. IEEE, 318--331.Google ScholarGoogle ScholarCross RefCross Ref
  39. Mike W. et al. 2019. Virtex UltraScale+ HBM FPGA: A revolutionary increase in memory performance. Xilinx Whitepaper.Google ScholarGoogle Scholar
  40. Xiaowei X. et al. 2018. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 4 (2018), 216--222.Google ScholarGoogle ScholarCross RefCross Ref
  41. S. Yazdanshenas and V. Betz. 2017. Automatic circuit design and modelling for heterogeneous FPGAs. In Proceedings of the International Conference on Field Programmable Technology. 9--16.Google ScholarGoogle Scholar
  42. S. Yazdanshenas and V. Betz. 2019. COFFE 2: Automatic modelling and optimization of complex and heterogeneous FPGA architectures. ACM Trans. Reconfig. Technol. Syst. 12, 1 (2019), 1–27.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FPGA Logic Block Architectures for Efficient Deep Learning Inference

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 13, Issue 3
        September 2020
        182 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3404107
        • Editor:
        • Deming Chen
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 June 2020
        • Online AM: 7 May 2020
        • Accepted: 1 April 2020
        • Revised: 1 March 2020
        • Received: 1 October 2019
        Published in trets Volume 13, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!