skip to main content
research-article

FFConv: An FPGA-based Accelerator for Fast Convolution Layers in Convolutional Neural Networks

Published:11 March 2020Publication History
Skip Abstract Section

Abstract

Image classification is known to be one of the most challenging problems in the domain of computer vision. Significant research is being done on developing systems and algorithms improving accuracy, performance, area, and power consumption for related problems. Convolutional Neural Networks (CNNs) have shown to give outstanding accuracies for problems such as image classification, object detection, and semantic segmentation. While CNNs are pioneering the development of high accuracy systems, their excessive computational complexity presents a barrier for a more permeated deployment. Although Graphical Processing Units (GPUs), due to their massively parallel architecture, have shown to give performance orders of magnitude better than general purpose processors, the former are limited by their high power consumption and generality. Consequently, Field Programmable Gate Arrays (FPGAs) are being explored to implement CNN architectures, as they also provide massively parallel logic resources but with a relatively lower power consumption than GPUs. In this article, we present FFConv, an efficient FPGA-based fast convolutional layer accelerator for CNNs. We design a pipelined, high-throughput convolution engine based on the Winograd minimal filtering (also called Fast Convolution) algorithms for computing the convolutional layers of three popular CNN architectures: VGG16, Alexnet, and Shufflenet. We implement our accelerator on a Virtex-7 FPGA platform where we exploit the computational parallelization to the maximum while exploring optimizations aimed at improving performance. The resultant design loses only 0.43%, 0.47%, and 0.61% Top-1 classification accuracy for VGG16, Alexnet, and Shufflenet-v1, respectively, while significantly improving throughput, resource, and power efficiency compared to previous state-of-the-art designs.

References

  1. A. Ahmad and M. A. Pasha. 2019. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’19). IEEE, 1106--1111.Google ScholarGoogle Scholar
  2. Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL™deep learning accelerator on Arria 10. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, 55--64.Google ScholarGoogle Scholar
  3. Andrew Boutros, Sadegh Yazdanshenas, and Vaughn Betz. 2018. You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Trans. Reconfig. Technol. Syst. 11, 3, Article 20 (Dec. 2018).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Tony Bybell. 2010. GtkWave Electronic Waveform Viewer. http://gtkwave.sourceforge.net/.Google ScholarGoogle Scholar
  5. Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. 2017. Deep learning with low precision by half-wave Gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, 5406--5414.Google ScholarGoogle ScholarCross RefCross Ref
  6. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. CoRR abs/1410.0759 (2014).Google ScholarGoogle Scholar
  7. Soumith Chintala. 2017. convnet-benchmarks. Technical Project (GitHub). https://github.com/soumith/convnet-benchmarks.Google ScholarGoogle Scholar
  8. Jason Cong and Bingjun Xiao. 2014. Minimizing computation in convolutional neural networks. In Proceedings of the International Conference on Artificial Neural Networks and Machine Learning (ICANN’14), Stefan Wermter, Cornelius Weber, Włodzisław Duch, Timo Honkela, Petia Koprinkova-Hristova, Sven Magg, Günther Palm, and Alessandro E. P. Villa (Eds.). Springer International Publishing, Cham, 281--290.Google ScholarGoogle ScholarCross RefCross Ref
  9. Matthieu Courbariaux and Yoshua Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or −1. CoRR abs/1602.02830 (2016).Google ScholarGoogle Scholar
  10. Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024 2015 (2015).Google ScholarGoogle Scholar
  11. J. Deng, W. Dong, R. Socher, and L. Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.Google ScholarGoogle Scholar
  12. Wei Ding, Zeyu Huang, Zunkai Huang, Li Tian, Hui Wang, and Songlin Feng. 2019. Designing efficient accelerator of depthwise separable convolutional neural network on FPGA. J. Syst. Archit. 97 (2019), 278--286. DOI:https://doi.org/10.1016/j.sysarc.2018.12.008Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Fan, S. Liu, M. Ferianc, H. Ng, Z. Que, S. Liu, X. Niu, and W. Luk. 2018. A real-time object detection accelerator with compressed SSDLite on FPGA. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). IEEE, 14--21. DOI:https://doi.org/10.1109/FPT.2018.00014Google ScholarGoogle Scholar
  14. R. Girshick, J. Donahue, T. Darrell, and J. Malik. 2016. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1 (Jan. 2016), 142--158.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Song Han, Huizi Mao, and William J. Dally. 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16). http://arxiv.org/abs/1510.00149.Google ScholarGoogle Scholar
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  17. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017).Google ScholarGoogle Scholar
  18. Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Speeding up convolutional neural networks with low rank expansions. CoRR abs/1405.3866 (2014).Google ScholarGoogle Scholar
  19. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. ACM Commun. 60, 6 (May 2017), 84--90.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 4013--4021.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2014. Fully convolutional networks for semantic segmentation. CoRR abs/1411.4038 (2014).Google ScholarGoogle Scholar
  22. L. Lu, Y. Liang, Q. Xiao, and S. Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 101--108. DOI:https://doi.org/10.1109/FCCM.2017.64Google ScholarGoogle Scholar
  23. L. Lu, Y. Liang, Q. Xiao, and S. Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 101--108.Google ScholarGoogle Scholar
  24. Hiroki Nakahara, Haruyoshi Yonekawa, Tomoya Fujii, and Shimpei Sato. 2018. A lightweight YOLOv2: A binarized CNN with a parallel support vector regression for an FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18). ACM, 31--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Podili, C. Zhang, and V. Prasanna. 2017. Fast and efficient implementation of convolutional neural networks on FPGA. In Proceedings of the IEEE 28th International Conference on Application-specific Systems, Architectures, and Processors (ASAP’17). IEEE, 11--18.Google ScholarGoogle Scholar
  26. Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 26--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Dominik Scherer, Hannes Schulz, and Sven Behnke. 2010. Accelerating large-scale convolutional neural networks with parallel graphics multiprocessors. In Proceedings of the 20th International Conference on Artificial Neural Networks: Part III (ICANN’10). Springer-Verlag, 82--91.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014).Google ScholarGoogle Scholar
  29. Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 16--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  31. Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR abs/1412.7580 (2014).Google ScholarGoogle Scholar
  32. Stephen Williams. 2006. Icarus Verilog. http://iverilog.icarus.com/.Google ScholarGoogle Scholar
  33. Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr. 2009), 65--76. DOI:https://doi.org/10.1145/1498765.1498785Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Shmuel Winograd. 1980. Arithmetic Complexity of Computations. Vol. 33. Siam, Philadelphia, PA.Google ScholarGoogle Scholar
  35. Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 4820--4828.Google ScholarGoogle ScholarCross RefCross Ref
  36. Q. Xiao, Y. Liang, L. Lu, and S. Yan. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). IEEE, 1--6.Google ScholarGoogle Scholar
  37. C. Yang, Y. Wang, X. Wang, and L. Geng. 2018. A reconfigurable accelerator based on fast Winograd algorithm for convolutional neural network in Internet of Things. In Proceedings of the 14th IEEE International Conference on Solid-state and Integrated Circuit Technology (ICSICT’18). IEEE, 1--3.Google ScholarGoogle Scholar
  38. Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2016. Designing energy-efficient convolutional neural networks using energy-aware pruning. CoRR abs/1611.05128 (2016).Google ScholarGoogle Scholar
  39. J. Yu, Y. Hu, X. Ning, J. Qiu, K. Guo, Y. Wang, and H. Yang. 2017. Instruction driven cross-layer CNN accelerator with Winograd transformation on FPGA. In Proceedings of the International Conference on Field Programmable Technology (ICFPT’17). IEEE, 227--230.Google ScholarGoogle Scholar
  40. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, 161--170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong. 2018. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput.-aided Des. Integr. Circ. Syst. 38, 11 (Nov. 2018), 2072--2085. DOI:https://doi.org/10.1109/TCAD.2017.2785257Google ScholarGoogle Scholar
  42. Jialiang Zhang and Jing Li. 2017. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, 25--34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  44. Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. 2017. Accelerating binarized convolutional neural networks with software-programmable FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, 15--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless CNNs with low-precision weights. In Proceedings of the International Conference on Learning Representations, (ICLR’17).Google ScholarGoogle Scholar

Index Terms

  1. FFConv: An FPGA-based Accelerator for Fast Convolution Layers in Convolutional Neural Networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!