Abstract
Image classification is known to be one of the most challenging problems in the domain of computer vision. Significant research is being done on developing systems and algorithms improving accuracy, performance, area, and power consumption for related problems. Convolutional Neural Networks (CNNs) have shown to give outstanding accuracies for problems such as image classification, object detection, and semantic segmentation. While CNNs are pioneering the development of high accuracy systems, their excessive computational complexity presents a barrier for a more permeated deployment. Although Graphical Processing Units (GPUs), due to their massively parallel architecture, have shown to give performance orders of magnitude better than general purpose processors, the former are limited by their high power consumption and generality. Consequently, Field Programmable Gate Arrays (FPGAs) are being explored to implement CNN architectures, as they also provide massively parallel logic resources but with a relatively lower power consumption than GPUs. In this article, we present FFConv, an efficient FPGA-based fast convolutional layer accelerator for CNNs. We design a pipelined, high-throughput convolution engine based on the Winograd minimal filtering (also called Fast Convolution) algorithms for computing the convolutional layers of three popular CNN architectures: VGG16, Alexnet, and Shufflenet. We implement our accelerator on a Virtex-7 FPGA platform where we exploit the computational parallelization to the maximum while exploring optimizations aimed at improving performance. The resultant design loses only 0.43%, 0.47%, and 0.61% Top-1 classification accuracy for VGG16, Alexnet, and Shufflenet-v1, respectively, while significantly improving throughput, resource, and power efficiency compared to previous state-of-the-art designs.
- A. Ahmad and M. A. Pasha. 2019. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’19). IEEE, 1106--1111.Google Scholar
- Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL™deep learning accelerator on Arria 10. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, 55--64.Google Scholar
- Andrew Boutros, Sadegh Yazdanshenas, and Vaughn Betz. 2018. You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Trans. Reconfig. Technol. Syst. 11, 3, Article 20 (Dec. 2018).Google Scholar
Digital Library
- Tony Bybell. 2010. GtkWave Electronic Waveform Viewer. http://gtkwave.sourceforge.net/.Google Scholar
- Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. 2017. Deep learning with low precision by half-wave Gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, 5406--5414.Google Scholar
Cross Ref
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. CoRR abs/1410.0759 (2014).Google Scholar
- Soumith Chintala. 2017. convnet-benchmarks. Technical Project (GitHub). https://github.com/soumith/convnet-benchmarks.Google Scholar
- Jason Cong and Bingjun Xiao. 2014. Minimizing computation in convolutional neural networks. In Proceedings of the International Conference on Artificial Neural Networks and Machine Learning (ICANN’14), Stefan Wermter, Cornelius Weber, Włodzisław Duch, Timo Honkela, Petia Koprinkova-Hristova, Sven Magg, Günther Palm, and Alessandro E. P. Villa (Eds.). Springer International Publishing, Cham, 281--290.Google Scholar
Cross Ref
- Matthieu Courbariaux and Yoshua Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or −1. CoRR abs/1602.02830 (2016).Google Scholar
- Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024 2015 (2015).Google Scholar
- J. Deng, W. Dong, R. Socher, and L. Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.Google Scholar
- Wei Ding, Zeyu Huang, Zunkai Huang, Li Tian, Hui Wang, and Songlin Feng. 2019. Designing efficient accelerator of depthwise separable convolutional neural network on FPGA. J. Syst. Archit. 97 (2019), 278--286. DOI:https://doi.org/10.1016/j.sysarc.2018.12.008Google Scholar
Digital Library
- H. Fan, S. Liu, M. Ferianc, H. Ng, Z. Que, S. Liu, X. Niu, and W. Luk. 2018. A real-time object detection accelerator with compressed SSDLite on FPGA. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). IEEE, 14--21. DOI:https://doi.org/10.1109/FPT.2018.00014Google Scholar
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. 2016. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1 (Jan. 2016), 142--158.Google Scholar
Digital Library
- Song Han, Huizi Mao, and William J. Dally. 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16). http://arxiv.org/abs/1510.00149.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 770--778.Google Scholar
Cross Ref
- Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017).Google Scholar
- Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Speeding up convolutional neural networks with low rank expansions. CoRR abs/1405.3866 (2014).Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. ACM Commun. 60, 6 (May 2017), 84--90.Google Scholar
Digital Library
- Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 4013--4021.Google Scholar
Cross Ref
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2014. Fully convolutional networks for semantic segmentation. CoRR abs/1411.4038 (2014).Google Scholar
- L. Lu, Y. Liang, Q. Xiao, and S. Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 101--108. DOI:https://doi.org/10.1109/FCCM.2017.64Google Scholar
- L. Lu, Y. Liang, Q. Xiao, and S. Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 101--108.Google Scholar
- Hiroki Nakahara, Haruyoshi Yonekawa, Tomoya Fujii, and Shimpei Sato. 2018. A lightweight YOLOv2: A binarized CNN with a parallel support vector regression for an FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18). ACM, 31--40.Google Scholar
Digital Library
- A. Podili, C. Zhang, and V. Prasanna. 2017. Fast and efficient implementation of convolutional neural networks on FPGA. In Proceedings of the IEEE 28th International Conference on Application-specific Systems, Architectures, and Processors (ASAP’17). IEEE, 11--18.Google Scholar
- Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 26--35.Google Scholar
Digital Library
- Dominik Scherer, Hannes Schulz, and Sven Behnke. 2010. Accelerating large-scale convolutional neural networks with parallel graphics multiprocessors. In Proceedings of the 20th International Conference on Artificial Neural Networks: Part III (ICANN’10). Springer-Verlag, 82--91.Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014).Google Scholar
- Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 16--25.Google Scholar
Digital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, 1--9.Google Scholar
Cross Ref
- Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR abs/1412.7580 (2014).Google Scholar
- Stephen Williams. 2006. Icarus Verilog. http://iverilog.icarus.com/.Google Scholar
- Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr. 2009), 65--76. DOI:https://doi.org/10.1145/1498765.1498785Google Scholar
Digital Library
- Shmuel Winograd. 1980. Arithmetic Complexity of Computations. Vol. 33. Siam, Philadelphia, PA.Google Scholar
- Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 4820--4828.Google Scholar
Cross Ref
- Q. Xiao, Y. Liang, L. Lu, and S. Yan. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). IEEE, 1--6.Google Scholar
- C. Yang, Y. Wang, X. Wang, and L. Geng. 2018. A reconfigurable accelerator based on fast Winograd algorithm for convolutional neural network in Internet of Things. In Proceedings of the 14th IEEE International Conference on Solid-state and Integrated Circuit Technology (ICSICT’18). IEEE, 1--3.Google Scholar
- Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2016. Designing energy-efficient convolutional neural networks using energy-aware pruning. CoRR abs/1611.05128 (2016).Google Scholar
- J. Yu, Y. Hu, X. Ning, J. Qiu, K. Guo, Y. Wang, and H. Yang. 2017. Instruction driven cross-layer CNN accelerator with Winograd transformation on FPGA. In Proceedings of the International Conference on Field Programmable Technology (ICFPT’17). IEEE, 227--230.Google Scholar
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, 161--170.Google Scholar
Digital Library
- C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong. 2018. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput.-aided Des. Integr. Circ. Syst. 38, 11 (Nov. 2018), 2072--2085. DOI:https://doi.org/10.1109/TCAD.2017.2785257Google Scholar
- Jialiang Zhang and Jing Li. 2017. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, 25--34.Google Scholar
Digital Library
- Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
Cross Ref
- Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. 2017. Accelerating binarized convolutional neural networks with software-programmable FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, 15--24.Google Scholar
Digital Library
- Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless CNNs with low-precision weights. In Proceedings of the International Conference on Learning Representations, (ICLR’17).Google Scholar
Index Terms
FFConv: An FPGA-based Accelerator for Fast Convolution Layers in Convolutional Neural Networks
Recommendations
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysAs convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple ...
ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler
AbstractDeploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (...






Comments