Abstract
Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple’s Siri) and edge computing (e.g., Google/Waymo’s driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, “Super-LIP”, which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48× speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.
- Guha Balakrishnan, Amy Zhao, Mert R. Sabuncu, John Guttag, and Adrian V. Dalca. 2018. An unsupervised learning model for deformable medical image registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9252--9260.Google Scholar
- Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, et al. 2018. Serving dnns in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8--20.Google Scholar
Cross Ref
- Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. 2018. On the universal approximability and complexity bounds of quantized ReLU neural networks. arXiv preprint arXiv:1802.03646 (2018).Google Scholar
- Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 1--14.Google Scholar
Digital Library
- Tong Geng, Tianqi Wang, Ahmed Sanaullah, Chen Yang, Rui Xu, Rushi Patel, and Martin Herbordt. 2018. FPDeep: Acceleration and load balancing of CNN training on FPGA clusters. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 81--84.Google Scholar
Cross Ref
- Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2016. Angel-eye: A complete design flow for mapping cnn onto customized hardware. In 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 24--29.Google Scholar
Cross Ref
- Weiwen Jiang, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Lei Yang, Xianzhang Chen, and Jingtong Hu. 2018. Heterogeneous fpga-based cost-optimal design for timing-constrained cnns. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2542--2554.Google Scholar
Cross Ref
- Weiwen Jiang, Lei Yang, Edwin Sha, Qingfeng Zhuge, Shouzhen Gu, Yiyu Shi, and Jingtong Hu. 2019. Hardware/Software co-exploration of neural architectures. arXiv preprint arXiv:1907.04650 (2019).Google Scholar
- Weiwen Jiang, Xinyi Zhang, Edwin H.-M. Sha, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Accuracy vs. Efficiency: Achieving both through FPGA-Implementation aware neural architecture search. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 5.Google Scholar
Digital Library
- Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1--12.Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google Scholar
Digital Library
- Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 45--54.Google Scholar
Digital Library
- Mohammad Motamedi, Daniel Fong, and Soheil Ghiasi. 2018. Cappuccino: Efficient cnn inference software synthesis for mobile system-on-chips. IEEE Embedded Systems Letters 11, 1 (2018), 9--12.Google Scholar
Digital Library
- Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. 2014. SDA: Software-defined accelerator for large-scale DNN systems. In 2014 IEEE Hot Chips 26 Symposium (HCS). IEEE, 1--23.Google Scholar
Cross Ref
- Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779--788.Google Scholar
Cross Ref
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.Google Scholar
- Junzhong Shen, Deguang Wang, You Huang, Mei Wen, and Chunyuan Zhang. 2019. Accelerating 3D CNN-based lung nodule segmentation on a multi-FPGA system. In FPGA. 117.Google Scholar
- Junzhong Shen, Deguang Wang, You Huang, Mei Wen, and Chunyuan Zhang. 2019. Scale-out acceleration for 3D CNN-based lung nodule segmentation on a multi-FPGA system. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 207.Google Scholar
Digital Library
- Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 535--547.Google Scholar
Digital Library
- Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 16--25.Google Scholar
Digital Library
- Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 40--47.Google Scholar
- Siqi Wang, Gayathri Ananthanarayanan, and Tulika Mitra. 2018. OPTiC: Optimizing collaborative CPU--GPU computing on mobile devices with thermal constraints. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 3 (2018), 393--406.Google Scholar
Cross Ref
- Yi Wang, Weixuan Chen, Jing Yang, and Tao Li. 2018. Exploiting parallelism for CNN applications on 3D stacked processing-in-memory architecture. IEEE Transactions on Parallel and Distributed Systems 30, 3 (2018), 589--600.Google Scholar
Cross Ref
- Yi Wang, Weixuan Chen, Jing Yang, and Tao Li. 2018. Towards memory-efficient allocation of CNNs on processing-in-memory architecture. IEEE Transactions on Parallel and Distributed Systems 29, 6 (2018), 1428--1441.Google Scholar
Cross Ref
- Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti, Stephan Thesing, David Whalley, Guillem Bernat, Christian Ferdinand, Reinhold Heckmann, Tulika Mitra, et al. 2008. The worst-case execution-time problemâoverview of methods and survey of tools. ACM Transactions on Embedded Computing Systems (TECS) 7, 3 (2008), 36.Google Scholar
Digital Library
- Shangyu Wu, Yi Wang, Amelie Chi Zhou, Rui Mao, Zili Shao, and Tao Li. 2019. Towards cross-platform inference on edge devices with emerging neuromorphic architecture. In 2019 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 806--811.Google Scholar
- Xiaowei Xu, Yukun Ding, Sharon Xiaobo Hu, Michael Niemier, Jason Cong, Yu Hu, and Yiyu Shi. 2018. Scaling for edge inference of deep neural networks. Nature Electronics 1, 4 (2018), 216.Google Scholar
Cross Ref
- Xiaowei Xu, Qing Lu, Lin Yang, Sharon Hu, Danny Chen, Yu Hu, and Yiyu Shi. 2018. Quantization of fully convolutional networks for accurate biomedical image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8300--8308.Google Scholar
Cross Ref
- Xiaowei Xu, Tianchen Wang, Qing Lu, and Yiyu Shi. 2018. Resource constrained cellular neural networks for real-time obstacle detection using fpgas. In 2018 19th International Symposium on Quality Electronic Design (ISQED). IEEE, 437--440.Google Scholar
Cross Ref
- Lei Yang, Weichen Liu, Peng Chen, Nan Guan, and Mengquan Li. 2017. Task mapping on smart noc: Contention matters, not the distance. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google Scholar
Digital Library
- Lei Yang, Weichen Liu, Nan Guan, and Nikil D. Dutt. 2018. Optimal application mapping and scheduling for network-on-chips with computation in STT-RAM based router. IEEE Trans. Comput. (2018).Google Scholar
- Lei Yang, Weichen Liu, Weiwen Jiang, Mengquan Li, Peng Chen, and Edwin Hsing-Mean Sha. 2016. FoToNoC: A folded torus-like network-on-chip based many-core systems-on-chip in the dark silicon era. IEEE Transactions on Parallel and Distributed Systems 28, 7 (2016), 1905--1918.Google Scholar
Digital Library
- Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Computational intelligenCe Magazine 13, 3 (2018), 55--75.Google Scholar
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.Google Scholar
Digital Library
- Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design. ACM, 326--331.Google Scholar
Digital Library
- Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. 2017. Live video analytics at scale with approximation and delay-tolerance. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17). 377--392.Google Scholar
- Wentai Zhang, Jiaxi Zhang, Minghua Shen, Guojie Luo, and Nong Xiao. 2019. An efficient mapping approach to large-scale DNNs on multi-FPGA architectures. In 2019 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1241--1244.Google Scholar
- Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the International Conference on Computer-Aided Design. ACM, 56.Google Scholar
Digital Library
Index Terms
Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference
Recommendations
A hardware-efficient computing engine for FPGA-based deep convolutional neural network accelerator
AbstractDeep convolutional neural networks (DCNNs) have recently emerged as a promising approach for computer vision tasks with many new DCNN architectures proposed to further improve their performance. However, the significant computation ...
Real-time embedded systems powered by FPGA dynamic partial self-reconfiguration: a case study oriented to biometric recognition applications
This work aims to pave the way for an efficient open system architecture applied to embedded electronic applications to manage the processing of computationally complex algorithms at real-time and low-cost. The target is to define a standard ...
A Linux-based support for developing real-time applications on heterogeneous platforms with dynamic FPGA reconfiguration
AbstractComputing platforms for next-generation cyber–physical systems are evolving towards heterogeneous architectures comprising different processing elements and hardware accelerators. In particular, SoC-FPGA platforms, including multiple ...
Highlights- Cyber-physical systems demand complex computing workloads with real-time constraints;






Comments