Abstract
Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is appealing due to the rapid change in DL models but also causes lower performance and area-efficiency compared to ASICs. In this article, we implement three state-of-the-art computing architectures (CAs) for convolutional neural network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations, we highlight the area and performance costs of programmability to pinpoint the inefficiencies in current FPGA architectures. We perform our experiments using three variations of these CAs for AlexNet, VGG-16 and ResNet-50 to allow extensive comparisons. We find that the performance gap varies significantly from 2.8× to 6.3×, while the area gap is consistent across CAs with an 8.7 average FPGA-to-ASIC area ratio. Among different blocks of the CAs, the convolution engine, constituting up to 60% of the total area, has a high area ratio ranging from 13 to 31. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking the on-chip memories to reduce the programmability gap for DL applications.
- M. Abadi et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the OSDI. 265--283. Google Scholar
Digital Library
- U. Aydonat et al. 2017. An OpenCL (TM) deep learning accelerator on Arria 10. In Proceedings of the FPGA. 55--64. Google Scholar
Digital Library
- Y. Chen et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the MICRO. 609--622. Google Scholar
Digital Library
- Y. Chen et al. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the JSSC, Vol. 52. 127--138.Google Scholar
- S. Chetlur et al. 2014. CuDNN: Efficient primitives for deep learning. arXiv:1410.0759.Google Scholar
- E. Chung and J. Fowers. 2017. Accelerating persistent neural networks at datacenter scale. In Proceedings of the HOT CHIPS, Vol. 29.Google Scholar
- F. Colombo et al. 2017. Deep artificial composer: A creative neural network model for automated melody generation. In Proceedings of the EvoMUSART. 81--96.Google Scholar
Cross Ref
- Y. Fu et al. 2016. Deep learning with INT8 optimization on Xilinx devices. In white paper of Xilinx.Google Scholar
- L. Gatys et al. 2015. A neural algorithm of artistic style. arXiv:1508.06576.Google Scholar
- A. Graves et al. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP. 6645--6649.Google Scholar
Cross Ref
- Y. Guan et al. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the FCCM. 152--159.Google Scholar
Cross Ref
- Matthew R. Guthaus et al. 2016. OpenRAM: An open-source memory compiler. In Proceedings of the ICCAD. Google Scholar
Digital Library
- P. Gysel et al. 2016. Hardware-oriented approximation of convolutional neural networks. arXiv:1604.03168.Google Scholar
- K. He et al. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the ICCV. 1026--1034. Google Scholar
Digital Library
- K. He et al. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770--778.Google Scholar
Cross Ref
- S. Herculano-Houzel. 2009. The human brain in numbers: A linearly scaled-up primate brain. In Frontiers in Human Neuroscience, Vol. 3.Google Scholar
Cross Ref
- S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the ICML. 448--456. Google Scholar
Digital Library
- Y. Jia et al. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.Google Scholar
- N. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ISCA. 1--12. Google Scholar
Digital Library
- A. Krizhevsky et al. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the NIPS. 1097--1105. Google Scholar
Digital Library
- M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the FPGA. 117--125. Google Scholar
Digital Library
- A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the CVPR. 4013--4021.Google Scholar
- Z. Liu et al. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. In Proceedings of the FPT. 61--68.Google Scholar
- L. Lu et al. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the FCCM. 101--108.Google Scholar
Cross Ref
- Y. Ma et al. 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In Proceedings of the FPL. 1--8.Google Scholar
- Y. Ma et al. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the FPL. 1--8.Google Scholar
Cross Ref
- Y. Ma et al. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the FPGA. 45--54. Google Scholar
Digital Library
- A. Mishra et al. 2017. WRPN: Wide reduced-precision networks. arXiv:1709.01134.Google Scholar
- E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the FPT. 77--84.Google Scholar
Cross Ref
- K. Ovtcharov et al. 2015. Accelerating deep convolutional neural networks using specialized hardware. In Microsoft Research Whitepaper, Vol. 2.Google Scholar
- A. Prost-Boucle et al. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the FPL. 1--7.Google Scholar
- A. Putnam et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the ISCA. 13--24. Google Scholar
Digital Library
- J. Qiu et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the FPGA. 26--35. Google Scholar
Digital Library
- R. Rashid et al. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS. In Proceedings of the FPT. 20--27.Google Scholar
Cross Ref
- D. E. Rumelhart et al. 1985. Learning Internal Representations by Error Propagation. Technical Report.Google Scholar
- O. Russakovsky et al. 2015. Imagenet large scale visual recognition challenge. In Proceedings of the IJCV, Vol. 115. 211--252. Google Scholar
Digital Library
- H. Sharma et al. 2016. From high-level deep neural models to FPGAs. In Proceedings of the MICRO. 1--12. Google Scholar
Digital Library
- F. Shen et al. 2016. Weighted residuals for very deep networks. In Proceedings of the ICSAI. 936--941.Google Scholar
Cross Ref
- Y. Shen et al. 2016. Overcoming resource underutilization in spatial CNN accelerators. In Proceedings of the FPL. 1--4.Google Scholar
Cross Ref
- Y. Shen et al. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the ISCA. 535--547. Google Scholar
Digital Library
- D. Silver et al. 2017. Mastering the game of go without human knowledge. In Nature, Vol. 550. 354--359.Google Scholar
Cross Ref
- N. Suda et al. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the FPGA. 16--25. Google Scholar
Digital Library
- A. Suleiman et al. 2017. Towards closing the energy Gap between HOG and CNN features for embedded vision. arXiv:1703.05853.Google Scholar
- I. Sutskever et al. 2014. Sequence to sequence learning with neural networks. In Proceedings of the NIPS. 3104--3112. Google Scholar
Digital Library
- C. Szegedy et al. 2015. Going deeper with convolutions. In Proceedings of the CVPR.Google Scholar
Cross Ref
- Kosuke Tatsumura et al. 2016. High density, low energy, magnetic tunnel junction based block RAMs for memory-rich FPGAs. In Proceedings of the FPT. 4--11.Google Scholar
- Y. Umuroglu et al. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the FPGA. 65--74. Google Scholar
Digital Library
- S. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the FCCM. 40--47.Google Scholar
- G. Venkatesh et al. 2017. Accelerating deep convolutional networks using low-precision and sparsity. In Proceedings of the ICASSP. 2861--2865.Google Scholar
Cross Ref
- S. Wang et al. 2017. Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks. In Proceedings of the DATE. 1032--1037. Google Scholar
Digital Library
- Y. Wang et al. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the DAC. 1--6. Google Scholar
Digital Library
- X. Wei et al. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the DAC. 1--6. Google Scholar
Digital Library
- H. Wong et al. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the FPGA. 5--14. Google Scholar
Digital Library
- S. Yazdanshenas et al. 2017. Don’t forget the memory: Automatic block RAM modelling, optimization, and architecture exploration. In Proceedings of the FPGA. 115--124. Google Scholar
Digital Library
- C. Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the FPGA. 161--170. Google Scholar
Digital Library
- C. Zhang et al. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the ISLPED. 326--331. Google Scholar
Digital Library
- C. Zhang and V. Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the FPGA. 35--44. Google Scholar
Digital Library
Index Terms
You Cannot Improve What You Do not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference
Recommendations
In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFPGAs or ASICs? There is a long-running debate on this. FPGAs are extremely flexible while ASICs offer top efficiency but inflexible. We believe that FPGAs and ASICs are better together, to offer both flexible and efficient solutions. We propose single-...
Structured Weight Matrices-Based Hardware Accelerators in Deep Neural Networks: FPGAs and ASICs
GLSVLSI '18: Proceedings of the 2018 on Great Lakes Symposium on VLSIBoth industry and academia have extensively investigated hardware accelerations. To address the demands in increasing computational capability and memory requirement, in this work, we propose the structured weight matrices (SWM)-based compression ...
Putting together what fits together: grÆstl
CARDIS'12: Proceedings of the 11th international conference on Smart Card Research and Advanced ApplicationsWe present GrÆStl, a combined hardware architecture for the Advanced Encryption Standard (AES) and Grøstl, one of the final round candidates of the SHA-3 hash competition. GrÆStl has been designed for low-resource devices implementing AES-128 (...






Comments