skip to main content
research-article

Synergy: An HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

Published:18 March 2019Publication History
Skip Abstract Section

Abstract

Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale environments. The recent proliferation of mobile and Internet of Things (IoT) devices have necessitated real-time, energy-efficient deep neural network inference on embedded-class, resource-constrained platforms. In this context, we present Synergy, an automated, hardware-software co-designed, pipelined, high-throughput CNN inference framework on embedded heterogeneous system-on-chip (SoC) architectures (Xilinx Zynq). Synergy leverages, through multi-threading, all the available on-chip resources, which includes the dual-core ARM processor along with the FPGA and the NEON Single-Instruction Multiple-Data (SIMD) engines as accelerators. Moreover, Synergy provides a unified abstraction of the heterogeneous accelerators (FPGA and NEON) and can adapt to different network configurations at runtime without changing the underlying hardware accelerator architecture by balancing workload across accelerators through work-stealing. Synergy achieves 7.3X speedup, averaged across seven CNN models, over a well-optimized software-only solution. Synergy demonstrates substantially better throughput and energy-efficiency compared to the contemporary CNN implementations on the same SoC architecture.

References

  1. ARM Infocenter. 2018. Retrieved from http://infocenter.arm.com.Google ScholarGoogle Scholar
  2. A. Dundar, J. Jin, B. Martini, and E. Culurciello. 2017. Embedded streaming deep neural networks accelerator with applications. IEEE Trans. Neural Networks Learn. Syst. 28, 7 (July 2017), 1572--1583.Google ScholarGoogle ScholarCross RefCross Ref
  3. Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 152--159.Google ScholarGoogle Scholar
  4. K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang. 2018. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37, 1 (Jan. 2018), 35--47.Google ScholarGoogle ScholarCross RefCross Ref
  5. S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda. 2017. Understanding the impact of precision quantization on the accuracy and energy of neural networks. In Design, Automation Test in Europe Conference Exhibition (DATE), 2017. 1474--1479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, and Nachiket Kapre. 2016. CaffePresso: An optimized library for deep learning on embedded accelerator-based platforms. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’16). ACM, New York, NY, Article 14, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). ACM, New York, NY, 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. H. Kim, B. Grady, R. Lian, J. Brothers, and J. H. Anderson. 2017. FPGA-based CNN inference accelerator synthesized from multi-threaded C software. In 2017 30th IEEE International System-on-Chip Conference (SOCC). 268--273.Google ScholarGoogle Scholar
  9. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  10. Enno Lübbers and Marco Platzner. 2009. ReconOS: Multithreaded programming for reconfigurable computers. ACM Trans. Embed. Comput. Syst. 9, 1, Article 8 (Oct. 2009), 33 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cireşan, U. Meier, A. Giusti, F. Nagi, J. Schmidhuber, and L. M. Gambardella. 2011. Max-pooling convolutional neural networks for vision-based hand gesture recognition. In 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA). 342--347.Google ScholarGoogle Scholar
  12. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning.Google ScholarGoogle Scholar
  13. Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, New York, NY, 26--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Joseph Redmon. 2018. Darknet: Open Source Neural Networks in C. Retrieved February 8, 2018 from https://pjreddie.com/darknet.Google ScholarGoogle Scholar
  15. Christoph Rüthing et al. 2016. Self-adaptation in programmable automation controllers based on hybrid multi-cores. Master Thesis, University of Paderborn (2016).Google ScholarGoogle Scholar
  16. Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, New York, NY, 535--547. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, New York, NY, 16--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. I. Venieris and C. S. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 40--47.Google ScholarGoogle Scholar
  20. Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. fpgaConvNet: Automated mapping of convolutional neural networks on FPGAs (abstract only). In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 291--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). ACM, New York, NY, Article 110, 6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xilinx Inc. 2018. Retrieved from http://www.xilinx.com.Google ScholarGoogle Scholar
  23. Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the 35th International Conference on Computer-Aided Design (ICCAD’16). ACM, New York, NY, Article 12, 8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design (ISLPED’16). ACM, New York, NY, 326--331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar. 2017. Design space exploration of FPGA-based accelerators with multi-level parallelism. In Design, Automation Test in Europe Conference Exhibition (DATE). 1141--1146. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Synergy: An HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!