skip to main content
research-article

You Cannot Improve What You Do not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference

Authors Info & Claims
Published:12 December 2018Publication History
Skip Abstract Section

Abstract

Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is appealing due to the rapid change in DL models but also causes lower performance and area-efficiency compared to ASICs. In this article, we implement three state-of-the-art computing architectures (CAs) for convolutional neural network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations, we highlight the area and performance costs of programmability to pinpoint the inefficiencies in current FPGA architectures. We perform our experiments using three variations of these CAs for AlexNet, VGG-16 and ResNet-50 to allow extensive comparisons. We find that the performance gap varies significantly from 2.8× to 6.3×, while the area gap is consistent across CAs with an 8.7 average FPGA-to-ASIC area ratio. Among different blocks of the CAs, the convolution engine, constituting up to 60% of the total area, has a high area ratio ranging from 13 to 31. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking the on-chip memories to reduce the programmability gap for DL applications.

References

  1. M. Abadi et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the OSDI. 265--283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. U. Aydonat et al. 2017. An OpenCL (TM) deep learning accelerator on Arria 10. In Proceedings of the FPGA. 55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Chen et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the MICRO. 609--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Chen et al. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the JSSC, Vol. 52. 127--138.Google ScholarGoogle Scholar
  5. S. Chetlur et al. 2014. CuDNN: Efficient primitives for deep learning. arXiv:1410.0759.Google ScholarGoogle Scholar
  6. E. Chung and J. Fowers. 2017. Accelerating persistent neural networks at datacenter scale. In Proceedings of the HOT CHIPS, Vol. 29.Google ScholarGoogle Scholar
  7. F. Colombo et al. 2017. Deep artificial composer: A creative neural network model for automated melody generation. In Proceedings of the EvoMUSART. 81--96.Google ScholarGoogle ScholarCross RefCross Ref
  8. Y. Fu et al. 2016. Deep learning with INT8 optimization on Xilinx devices. In white paper of Xilinx.Google ScholarGoogle Scholar
  9. L. Gatys et al. 2015. A neural algorithm of artistic style. arXiv:1508.06576.Google ScholarGoogle Scholar
  10. A. Graves et al. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP. 6645--6649.Google ScholarGoogle ScholarCross RefCross Ref
  11. Y. Guan et al. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the FCCM. 152--159.Google ScholarGoogle ScholarCross RefCross Ref
  12. Matthew R. Guthaus et al. 2016. OpenRAM: An open-source memory compiler. In Proceedings of the ICCAD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Gysel et al. 2016. Hardware-oriented approximation of convolutional neural networks. arXiv:1604.03168.Google ScholarGoogle Scholar
  14. K. He et al. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the ICCV. 1026--1034. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. He et al. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  16. S. Herculano-Houzel. 2009. The human brain in numbers: A linearly scaled-up primate brain. In Frontiers in Human Neuroscience, Vol. 3.Google ScholarGoogle ScholarCross RefCross Ref
  17. S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the ICML. 448--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Jia et al. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.Google ScholarGoogle Scholar
  19. N. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ISCA. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Krizhevsky et al. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the NIPS. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the FPGA. 117--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the CVPR. 4013--4021.Google ScholarGoogle Scholar
  23. Z. Liu et al. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. In Proceedings of the FPT. 61--68.Google ScholarGoogle Scholar
  24. L. Lu et al. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the FCCM. 101--108.Google ScholarGoogle ScholarCross RefCross Ref
  25. Y. Ma et al. 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In Proceedings of the FPL. 1--8.Google ScholarGoogle Scholar
  26. Y. Ma et al. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the FPL. 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  27. Y. Ma et al. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the FPGA. 45--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Mishra et al. 2017. WRPN: Wide reduced-precision networks. arXiv:1709.01134.Google ScholarGoogle Scholar
  29. E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the FPT. 77--84.Google ScholarGoogle ScholarCross RefCross Ref
  30. K. Ovtcharov et al. 2015. Accelerating deep convolutional neural networks using specialized hardware. In Microsoft Research Whitepaper, Vol. 2.Google ScholarGoogle Scholar
  31. A. Prost-Boucle et al. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the FPL. 1--7.Google ScholarGoogle Scholar
  32. A. Putnam et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the ISCA. 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Qiu et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the FPGA. 26--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Rashid et al. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS. In Proceedings of the FPT. 20--27.Google ScholarGoogle ScholarCross RefCross Ref
  35. D. E. Rumelhart et al. 1985. Learning Internal Representations by Error Propagation. Technical Report.Google ScholarGoogle Scholar
  36. O. Russakovsky et al. 2015. Imagenet large scale visual recognition challenge. In Proceedings of the IJCV, Vol. 115. 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Sharma et al. 2016. From high-level deep neural models to FPGAs. In Proceedings of the MICRO. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. F. Shen et al. 2016. Weighted residuals for very deep networks. In Proceedings of the ICSAI. 936--941.Google ScholarGoogle ScholarCross RefCross Ref
  39. Y. Shen et al. 2016. Overcoming resource underutilization in spatial CNN accelerators. In Proceedings of the FPL. 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  40. Y. Shen et al. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the ISCA. 535--547. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. D. Silver et al. 2017. Mastering the game of go without human knowledge. In Nature, Vol. 550. 354--359.Google ScholarGoogle ScholarCross RefCross Ref
  42. N. Suda et al. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the FPGA. 16--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. A. Suleiman et al. 2017. Towards closing the energy Gap between HOG and CNN features for embedded vision. arXiv:1703.05853.Google ScholarGoogle Scholar
  44. I. Sutskever et al. 2014. Sequence to sequence learning with neural networks. In Proceedings of the NIPS. 3104--3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. C. Szegedy et al. 2015. Going deeper with convolutions. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  46. Kosuke Tatsumura et al. 2016. High density, low energy, magnetic tunnel junction based block RAMs for memory-rich FPGAs. In Proceedings of the FPT. 4--11.Google ScholarGoogle Scholar
  47. Y. Umuroglu et al. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the FPGA. 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. S. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the FCCM. 40--47.Google ScholarGoogle Scholar
  49. G. Venkatesh et al. 2017. Accelerating deep convolutional networks using low-precision and sparsity. In Proceedings of the ICASSP. 2861--2865.Google ScholarGoogle ScholarCross RefCross Ref
  50. S. Wang et al. 2017. Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks. In Proceedings of the DATE. 1032--1037. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Y. Wang et al. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the DAC. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. X. Wei et al. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the DAC. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. H. Wong et al. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the FPGA. 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S. Yazdanshenas et al. 2017. Don’t forget the memory: Automatic block RAM modelling, optimization, and architecture exploration. In Proceedings of the FPGA. 115--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. C. Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the FPGA. 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. C. Zhang et al. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the ISLPED. 326--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. C. Zhang and V. Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the FPGA. 35--44. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. You Cannot Improve What You Do not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 11, Issue 3
        Special Issue on Deep learning on FPGAs
        September 2018
        187 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3299999
        • Editor:
        • Steve Wilton
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 December 2018
        • Accepted: 1 July 2018
        • Revised: 1 April 2018
        • Received: 1 December 2017
        Published in trets Volume 11, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!