Abstract
Reducing the precision of deep neural network (DNN) inference accelerators can yield large efficiency gains with little or no accuracy degradation compared to half or single precision floating-point by enabling more multiplication operations per unit area. A wide range of precisions fall on the pareto-optimal curve of hardware efficiency vs. accuracy with no single precision dominating, making the variable precision capabilities of FPGAs very valuable. We propose three types of logic block architectural enhancements and fully evaluate a total of six architectures that improve the area efficiency of multiplications and additions implemented in the soft fabric. Increasing the LUT fracturability and adding two adders to the ALM (4-bit Adder Double Chain architecture) leads to a 1.5× area reduction for arithmetic heavy machine learning (ML) kernels, while increasing their speed. In addition, this architecture also reduces the logic area of general applications by 6%, while increasing the critical path delay by only 1%. However, our highest impact option, which adds a 9-bit shadow multiplier to the logic clusters, reduces the area and critical path delay of ML kernels by 2.4× and 1.2×, respectively. These large gains come at a cost of 15% logic area increase for general applications.
- Y. Cao. 2018. Predictive Technology Model (PTM). Retrieved from http://ptm.asu.edu/.Google Scholar
- E. Ahmed and J. Rose. 2004. The effect of LUT and cluster size on deep-submicron FPGA performance and density. IEEE Trans. VLSI Syst. 12, 3 (2004), 288--298.Google Scholar
Digital Library
- C. Baugh and B. Wooley. 1973. A two’s complement parallel array multiplication algorithm. IEEE Trans. Comput. C-22, 12 (1973), 1045--1047.Google Scholar
Digital Library
- V. Betz and J. Rose. 1997. Cluster-based logic blocks for FPGAs: Area-efficiency vs. input sharing and size. In Proceedings of the Custom Integrated Circuits Conference. 551--554.Google Scholar
- V. Betz and J. Rose. 1998. How much logic should go in an FPGA logic block. IEEE Design 8 Test of Computers, 15, 1 (1998), 10--15.Google Scholar
- A. Boutros et al. 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--8.Google Scholar
- A. Boutros et al. 2018. You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 1–23.Google Scholar
Digital Library
- A. Boutros et al. 2019. Math doesn’t have to be hard: Logic block architectures to enhance low-precision multiply-accumulate on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 94--103.Google Scholar
- M. Burich. 2012. Conference workshop: FPGAs in 2032, challenges and opportunities in the next 20 years, convergence of programmable solutions. In Proceedings of the International Symposium on Field-Programmable Gate Arrays.Google Scholar
- S. Chandrakar et al. 2015. Enhancements in UltraScale CLB architecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 108--116.Google Scholar
- C. Chiasson and V. Betz. 2013. COFFE: Fully-automated transistor sizing for FPGAs. In Proceedings of the International Conference on Field-Programmable Technology. 34--41.Google Scholar
- Intel Corporation. 2005. Stratix GX Transeiver User Guide.Google Scholar
- Xilinx Corporation. 2007. Virtex-II Platform FPGA User Guide.Google Scholar
- Xilinx Corporation. 2007. Virtex-II Pro and Virtex-II Pro X FPGA User Guide.Google Scholar
- M. Deo et al. 2019. Intel Stratix 10 MX devices solve the memory bandwidth challenge. Intel Whitepaper.Google Scholar
- J. Fowers et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the International Symposium on Computer Architecture, 1--14.Google Scholar
- S. Han et al. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the International Symposium on Computer Architecture. 243--254.Google Scholar
- K. He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google Scholar
- Intel Corporation. 2017. Intel Stratix 10 logic array blocks and adaptive logic modules user guide (UG-S10LAB).Google Scholar
- P. Jamieson and J. Rose. 2006. Enhancing the area-efficiency of FPGAs with hard circuits using shadow clusters. In Proceedings of the International Conference on Field Programmable Technology. 1--8.Google Scholar
- A. Krizhevsky et al. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google Scholar
- I. Kuon and J. Rose. 2011. Exploring area and delay tradeoffs in FPGAs with architecture and automated transistor design. IEEE Trans. VLSI Syst. 19, 1 (2011), 71--84.Google Scholar
Digital Library
- M. Langhammer et al. 2019. Fractal synthesis: Invited tutorial. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 202--211.Google Scholar
- M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 117--125.Google Scholar
- Guy Lemieux and David Lewis. 2001. Using sparse crossbars within LUT. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 59--68.Google Scholar
Digital Library
- D. Lewis et al. 2005. The Stratix II logic and routing architecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 14--20.Google Scholar
- D. Lewis et al. 2016. The Stratix 10 highly pipelined FPGA architecture. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 159--168.Google Scholar
- D. M. Lewis et al. 2003. The StratixTM routing and logic architecture. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 12--20.Google Scholar
- J. Luu et al. 2014. VTR 7.0: Next generation architecture and CAD system for FPGAs. ACM Trans. Reconfig. Technol. Syst. 7, 2 (2014), 1–30.Google Scholar
- A. Mishra et al. 2017. WRPN: Wide reduced-precision networks. arXiv preprint arXiv:1709.01134.Google Scholar
- Kevin E. Murray et al. 2020. VTR 8: High performance CAD and customizable FPGA architecture modelling. ACM Trans. Reconfig. Technol. Syst. 0, ja, 1Google Scholar
- E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the International Conference on Field-Programmable Technology. 77--84.Google Scholar
- E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 5--14.Google Scholar
- J. et al. Qiu. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 26--35.Google Scholar
- E. Real et al. 2018. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.Google Scholar
- J. Rose et al. 1993. Architecture of field-programmable gate arrays. IEEE J. Solid-State Circ. 81, 7 (1993), 1013--1029.Google Scholar
- V. Rybalkin et al. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--8.Google Scholar
- S. M. Trimberger. 2015. Three ages of FPGAs: A retrospective on the first thirty years of FPGA technology. Proc. IEEE, 318--331.Google Scholar
Cross Ref
- Mike W. et al. 2019. Virtex UltraScale+ HBM FPGA: A revolutionary increase in memory performance. Xilinx Whitepaper.Google Scholar
- Xiaowei X. et al. 2018. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 4 (2018), 216--222.Google Scholar
Cross Ref
- S. Yazdanshenas and V. Betz. 2017. Automatic circuit design and modelling for heterogeneous FPGAs. In Proceedings of the International Conference on Field Programmable Technology. 9--16.Google Scholar
- S. Yazdanshenas and V. Betz. 2019. COFFE 2: Automatic modelling and optimization of complex and heterogeneous FPGA architectures. ACM Trans. Reconfig. Technol. Syst. 12, 1 (2019), 1–27.Google Scholar
Digital Library
Index Terms
FPGA Logic Block Architectures for Efficient Deep Learning Inference
Recommendations
Compact FPGA-Based Hardware Architectures for GF(2^m) Multipliers
DSD '13: Proceedings of the 2013 Euromicro Conference on Digital System DesignThis work describes FPGA hardware architectures of GF(2m) multipliers being more compact than a bit-serial multiplier and outperforming software counterparts. The proposed multiplier is more compact than a hardware implementation of the bit-serial ...
Accelerating Montgomery Modulo Multiplication for Redundant Radix-64k Number System on the FPGA Using Dual-Port Block RAMs
EUC '08: Proceedings of the 2008 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing - Volume 01The main contribution of this paper is to present hardware algorithms for redundant radix-2$^r$ number system in the FPGA to accelerate Montgomery modulo multiplication with many bits, which have applications in security systems such as RSA encryption ...






Comments