skip to main content
research-article

Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA

Published:15 December 2018Publication History
Skip Abstract Section

Abstract

In recent years, Convolutional Neural Networks (CNNs) have been widely applied in computer vision and have achieved significant improvements in object detection tasks. Although there are many optimizing methods to speed up CNN-based detection algorithms, it is still difficult to deploy detection algorithms on real-time low-power systems. Field-Programmable Gate Array (FPGA) has been widely explored as a platform for accelerating CNN due to its promising performance, high energy efficiency, and flexibility. Previous works show that the energy consumption of CNN accelerators is dominated by the memory access. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. However, previous accelerators with the cross-layer scheduling are designed for a particular CNN model. In addition to the memory access optimization, the Winograd algorithm can greatly improve the computational performance of convolution.

In this article, to improve the flexibility of hardware, we design an instruction-driven CNN accelerator, supporting the Winograd algorithm and the cross-layer scheduling, for object detection. We modify the loop unrolling order of CNN, so that we can schedule a CNN across different layers with instructions and eliminate the intermediate data transfer. We propose a hardware architecture to support the instructions with Winograd computation units and reach the state-of-the-art energy efficiency. To deploy image detection algorithms onto the proposed accelerator with fixed-point computation units, we adopt the fixed-point fine-tune method, which can guarantee the accuracy of the detection algorithms.

We evaluate our accelerator and scheduling policy on the Xilinx KU115 FPGA platform. The intermediate data transfer can be reduced by more than 90% on the VGG-D CNN model with the cross-layer strategy. Thus, the performance of our hardware accelerator reaches 1700GOP/s on the classification model VGG-D. We also implement a framework for object detection algorithms, which achieves 2.3× and 50× in energy efficiency compared with GPU and CPU, respectively. Compared with floating-point algorithms, the accuracy of the fixed-point detection algorithms only drops by less than 1%.

References

  1. M. Alwani, H. Chen, M. Ferdman, and P. Milder. 2016. Fused-layer CNN accelerators. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL deep learning accelerator on arria 10. arXiv preprint arXiv:1701.03534.Google ScholarGoogle Scholar
  3. Tianshi Chen, Zidong Du, Ninghui Sun, et al. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM SIGPLAN Notices, Vol. 49. ACM, 269--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yunji Chen, Tao Luo, et al. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE Computer Society, 609--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (Jan. 2017), 127--138.Google ScholarGoogle ScholarCross RefCross Ref
  6. Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, et al. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 92--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mark Everingham, S. M. Ali Eslami, et al. 2015. The Pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 111, 1 (Jan. 2015), 98--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 109--116.Google ScholarGoogle Scholar
  9. Ross Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 580--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kaiyuan Guo, Song Han, Song Yao, Yu Wang, Yuan Xie, and Huazhong Yang. 2017. Software-hardware codesign for efficient neural network acceleration. IEEE Micro 37, 2 (2017), 18--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia (MM'14). 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1--9.Google ScholarGoogle Scholar
  15. Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 393--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Wei Liu, Dragomir Anguelov, C. Alexander, et al. 2016. SSD: Single shot MultiBox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16) (Lecture Notes in Computer Science). Springer, Cham, 21--37.Google ScholarGoogle ScholarCross RefCross Ref
  17. Zhiqiang Liu, Yong Dou, Jingfei Jiang, Jinwei Xu, Shijie Li, Yongmei Zhou, and Yingnan Xu. 2017. Throughput-optimized FPGA accelerator for deep convolutional neural networks. ACM Trans. Reconfig. Technol. Syst. 10, 3 (July 2017), 17:1--17:23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Lu, Y. Liang, Q. Xiao, and S. Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). 101--108.Google ScholarGoogle Scholar
  19. Yufei Ma, N. Suda, Yu Cao, J. S. Seo, and S. Vrudhula. 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). 1--8.Google ScholarGoogle Scholar
  20. Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 26--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779--788.Google ScholarGoogle ScholarCross RefCross Ref
  22. S. Ren, K. He, R. Girshick, and J. Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (June 2017), 1137--1149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.Google ScholarGoogle Scholar
  25. Shmuel Winograd. 1980. Arithmetic Complexity of Computations. Vol. 33. Siam.Google ScholarGoogle Scholar
  26. Q. Xiao, Y. Liang, L. Lu, S. Yan, and Yu-Wing Tai. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jincheng Yu, Yiming Hu, Xuefei Ning, Jiantao Qiu, Kaiyuan Guo, Yu Wang, and Huazhong Yang. 2017. Instruction driven cross-layer CNN accelerator with winograd transformation on FPGA. In Proceedings of the International Conference on Field Programmable Technology (ICFPT’17). IEEE, 227--230.Google ScholarGoogle ScholarCross RefCross Ref
  28. C. Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’16). 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chi Zhang and Viktor K. Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 35--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jialiang Zhang and Jing Li. 2017. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160.Google ScholarGoogle Scholar

Index Terms

  1. Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 11, Issue 3
        Special Issue on Deep learning on FPGAs
        September 2018
        187 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3299999
        • Editor:
        • Steve Wilton
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 December 2018
        • Accepted: 1 October 2018
        • Revised: 1 June 2018
        • Received: 1 November 2017
        Published in trets Volume 11, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!