Abstract
In recent years, Convolutional Neural Networks (CNNs) have been widely applied in computer vision and have achieved significant improvements in object detection tasks. Although there are many optimizing methods to speed up CNN-based detection algorithms, it is still difficult to deploy detection algorithms on real-time low-power systems. Field-Programmable Gate Array (FPGA) has been widely explored as a platform for accelerating CNN due to its promising performance, high energy efficiency, and flexibility. Previous works show that the energy consumption of CNN accelerators is dominated by the memory access. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. However, previous accelerators with the cross-layer scheduling are designed for a particular CNN model. In addition to the memory access optimization, the Winograd algorithm can greatly improve the computational performance of convolution.
In this article, to improve the flexibility of hardware, we design an instruction-driven CNN accelerator, supporting the Winograd algorithm and the cross-layer scheduling, for object detection. We modify the loop unrolling order of CNN, so that we can schedule a CNN across different layers with instructions and eliminate the intermediate data transfer. We propose a hardware architecture to support the instructions with Winograd computation units and reach the state-of-the-art energy efficiency. To deploy image detection algorithms onto the proposed accelerator with fixed-point computation units, we adopt the fixed-point fine-tune method, which can guarantee the accuracy of the detection algorithms.
We evaluate our accelerator and scheduling policy on the Xilinx KU115 FPGA platform. The intermediate data transfer can be reduced by more than 90% on the VGG-D CNN model with the cross-layer strategy. Thus, the performance of our hardware accelerator reaches 1700GOP/s on the classification model VGG-D. We also implement a framework for object detection algorithms, which achieves 2.3× and 50× in energy efficiency compared with GPU and CPU, respectively. Compared with floating-point algorithms, the accuracy of the fixed-point detection algorithms only drops by less than 1%.
- M. Alwani, H. Chen, M. Ferdman, and P. Milder. 2016. Fused-layer CNN accelerators. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--12. Google Scholar
Digital Library
- Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL deep learning accelerator on arria 10. arXiv preprint arXiv:1701.03534.Google Scholar
- Tianshi Chen, Zidong Du, Ninghui Sun, et al. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM SIGPLAN Notices, Vol. 49. ACM, 269--284. Google Scholar
Digital Library
- Yunji Chen, Tao Luo, et al. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE Computer Society, 609--622. Google Scholar
Digital Library
- Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (Jan. 2017), 127--138.Google Scholar
Cross Ref
- Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, et al. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 92--104. Google Scholar
Digital Library
- Mark Everingham, S. M. Ali Eslami, et al. 2015. The Pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 111, 1 (Jan. 2015), 98--136. Google Scholar
Digital Library
- C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 109--116.Google Scholar
- Ross Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google Scholar
Digital Library
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 580--587. Google Scholar
Digital Library
- Kaiyuan Guo, Song Han, Song Yao, Yu Wang, Yuan Xie, and Huazhong Yang. 2017. Software-hardware codesign for efficient neural network acceleration. IEEE Micro 37, 2 (2017), 18--25. Google Scholar
Digital Library
- Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.Google Scholar
Digital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia (MM'14). 675--678. Google Scholar
Digital Library
- Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1--9.Google Scholar
- Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 393--405. Google Scholar
Digital Library
- Wei Liu, Dragomir Anguelov, C. Alexander, et al. 2016. SSD: Single shot MultiBox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16) (Lecture Notes in Computer Science). Springer, Cham, 21--37.Google Scholar
Cross Ref
- Zhiqiang Liu, Yong Dou, Jingfei Jiang, Jinwei Xu, Shijie Li, Yongmei Zhou, and Yingnan Xu. 2017. Throughput-optimized FPGA accelerator for deep convolutional neural networks. ACM Trans. Reconfig. Technol. Syst. 10, 3 (July 2017), 17:1--17:23. Google Scholar
Digital Library
- L. Lu, Y. Liang, Q. Xiao, and S. Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). 101--108.Google Scholar
- Yufei Ma, N. Suda, Yu Cao, J. S. Seo, and S. Vrudhula. 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). 1--8.Google Scholar
- Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 26--35. Google Scholar
Digital Library
- Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779--788.Google Scholar
Cross Ref
- S. Ren, K. He, R. Girshick, and J. Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (June 2017), 1137--1149. Google Scholar
Digital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.Google Scholar
- Shmuel Winograd. 1980. Arithmetic Complexity of Computations. Vol. 33. Siam.Google Scholar
- Q. Xiao, Y. Liang, L. Lu, S. Yan, and Yu-Wing Tai. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). 1--6. Google Scholar
Digital Library
- Jincheng Yu, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu Wang, and Huazhong Yang. 2017. Instruction driven cross-layer CNN accelerator with winograd transformation on FPGA. In Proceedings of the International Conference on Field Programmable Technology (ICFPT’17). IEEE, 227--230.Google Scholar
Cross Ref
- C. Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’16). 1--8. Google Scholar
Digital Library
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, 161--170. Google Scholar
Digital Library
- Chi Zhang and Viktor K. Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 35--44. Google Scholar
Digital Library
- Jialiang Zhang and Jing Li. 2017. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 25--34. Google Scholar
Digital Library
- Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160.Google Scholar
Index Terms
Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA
Recommendations
An FPGA-based Fine Tuning Accelerator for a Sparse CNN
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFine-tuning learns abundant feature expression for a wide range of natural images by using a pre-trained CNN model. It can be applied to a wide range of the neural network (NN)based computer vision problems. This paper proposes an FPGA-based fine-tuning ...
Target detection algorithm based on CNN and its FPGA implementation
ICIAI '21: Proceedings of the 2021 5th International Conference on Innovation in Artificial IntelligenceWhen the deep learning algorithm is deployed on FPGA platform, it is difficult to deploy different network structures with a single hardware structure. The iteration of the algorithm becomes complex and increases the iteration time. Aiming at these ...
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple ...






Comments