Abstract
Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale environments. The recent proliferation of mobile and Internet of Things (IoT) devices have necessitated real-time, energy-efficient deep neural network inference on embedded-class, resource-constrained platforms. In this context, we present Synergy, an automated, hardware-software co-designed, pipelined, high-throughput CNN inference framework on embedded heterogeneous system-on-chip (SoC) architectures (Xilinx Zynq). Synergy leverages, through multi-threading, all the available on-chip resources, which includes the dual-core ARM processor along with the FPGA and the NEON Single-Instruction Multiple-Data (SIMD) engines as accelerators. Moreover, Synergy provides a unified abstraction of the heterogeneous accelerators (FPGA and NEON) and can adapt to different network configurations at runtime without changing the underlying hardware accelerator architecture by balancing workload across accelerators through work-stealing. Synergy achieves 7.3X speedup, averaged across seven CNN models, over a well-optimized software-only solution. Synergy demonstrates substantially better throughput and energy-efficiency compared to the contemporary CNN implementations on the same SoC architecture.
- ARM Infocenter. 2018. Retrieved from http://infocenter.arm.com.Google Scholar
- A. Dundar, J. Jin, B. Martini, and E. Culurciello. 2017. Embedded streaming deep neural networks accelerator with applications. IEEE Trans. Neural Networks Learn. Syst. 28, 7 (July 2017), 1572--1583.Google Scholar
Cross Ref
- Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 152--159.Google Scholar
- K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang. 2018. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37, 1 (Jan. 2018), 35--47.Google Scholar
Cross Ref
- S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda. 2017. Understanding the impact of precision quantization on the accuracy and energy of neural networks. In Design, Automation Test in Europe Conference Exhibition (DATE), 2017. 1474--1479. Google Scholar
Digital Library
- Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, and Nachiket Kapre. 2016. CaffePresso: An optimized library for deep learning on embedded accelerator-based platforms. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’16). ACM, New York, NY, Article 14, 10 pages. Google Scholar
Digital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). ACM, New York, NY, 675--678. Google Scholar
Digital Library
- J. H. Kim, B. Grady, R. Lian, J. Brothers, and J. H. Anderson. 2017. FPGA-based CNN inference accelerator synthesized from multi-threaded C software. In 2017 30th IEEE International System-on-Chip Conference (SOCC). 268--273.Google Scholar
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278--2324.Google Scholar
Cross Ref
- Enno Lübbers and Marco Platzner. 2009. ReconOS: Multithreaded programming for reconfigurable computers. ACM Trans. Embed. Comput. Syst. 9, 1, Article 8 (Oct. 2009), 33 pages. Google Scholar
Digital Library
- J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cireşan, U. Meier, A. Giusti, F. Nagi, J. Schmidhuber, and L. M. Gambardella. 2011. Max-pooling convolutional neural networks for vision-based hand gesture recognition. In 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA). 342--347.Google Scholar
- Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning.Google Scholar
- Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, New York, NY, 26--35. Google Scholar
Digital Library
- Joseph Redmon. 2018. Darknet: Open Source Neural Networks in C. Retrieved February 8, 2018 from https://pjreddie.com/darknet.Google Scholar
- Christoph Rüthing et al. 2016. Self-adaptation in programmable automation controllers based on hybrid multi-cores. Master Thesis, University of Paderborn (2016).Google Scholar
- Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, New York, NY, 535--547. Google Scholar
Digital Library
- Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, New York, NY, 16--25. Google Scholar
Digital Library
- Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 65--74. Google Scholar
Digital Library
- S. I. Venieris and C. S. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 40--47.Google Scholar
- Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. fpgaConvNet: Automated mapping of convolutional neural networks on FPGAs (abstract only). In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 291--292. Google Scholar
Digital Library
- Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). ACM, New York, NY, Article 110, 6 pages. Google Scholar
Digital Library
- Xilinx Inc. 2018. Retrieved from http://www.xilinx.com.Google Scholar
- Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the 35th International Conference on Computer-Aided Design (ICCAD’16). ACM, New York, NY, Article 12, 8 pages. Google Scholar
Digital Library
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 161--170. Google Scholar
Digital Library
- Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design (ISLPED’16). ACM, New York, NY, 326--331. Google Scholar
Digital Library
- G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar. 2017. Design space exploration of FPGA-based accelerators with multi-level parallelism. In Design, Automation Test in Europe Conference Exhibition (DATE). 1141--1146. Google Scholar
Digital Library
Index Terms
Synergy: An HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC
Recommendations
Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor
SBAC-PAD '13: Proceedings of the 2013 25th International Symposium on Computer Architecture and High Performance ComputingThis paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
An integrated high-level hardware/software partitioning methodology
Embedded systems are widely used in many sophisticated applications. To speed the time-to-market cycle, the hardware and software co-design has become one of the main methodologies in modern embedded systems. The most important challenge in the embedded ...






Comments