skip to main content
research-article
Public Access

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Published:08 October 2019Publication History
Skip Abstract Section

Abstract

Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple’s Siri) and edge computing (e.g., Google/Waymo’s driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, “Super-LIP”, which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48× speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.

References

  1. Guha Balakrishnan, Amy Zhao, Mert R. Sabuncu, John Guttag, and Adrian V. Dalca. 2018. An unsupervised learning model for deformable medical image registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9252--9260.Google ScholarGoogle Scholar
  2. Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, et al. 2018. Serving dnns in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8--20.Google ScholarGoogle ScholarCross RefCross Ref
  3. Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. 2018. On the universal approximability and complexity bounds of quantized ReLU neural networks. arXiv preprint arXiv:1802.03646 (2018).Google ScholarGoogle Scholar
  4. Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Tong Geng, Tianqi Wang, Ahmed Sanaullah, Chen Yang, Rui Xu, Rushi Patel, and Martin Herbordt. 2018. FPDeep: Acceleration and load balancing of CNN training on FPGA clusters. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 81--84.Google ScholarGoogle ScholarCross RefCross Ref
  6. Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2016. Angel-eye: A complete design flow for mapping cnn onto customized hardware. In 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 24--29.Google ScholarGoogle ScholarCross RefCross Ref
  7. Weiwen Jiang, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Lei Yang, Xianzhang Chen, and Jingtong Hu. 2018. Heterogeneous fpga-based cost-optimal design for timing-constrained cnns. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2542--2554.Google ScholarGoogle ScholarCross RefCross Ref
  8. Weiwen Jiang, Lei Yang, Edwin Sha, Qingfeng Zhuge, Shouzhen Gu, Yiyu Shi, and Jingtong Hu. 2019. Hardware/Software co-exploration of neural architectures. arXiv preprint arXiv:1907.04650 (2019).Google ScholarGoogle Scholar
  9. Weiwen Jiang, Xinyi Zhang, Edwin H.-M. Sha, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Accuracy vs. Efficiency: Achieving both through FPGA-Implementation aware neural architecture search. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 45--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mohammad Motamedi, Daniel Fong, and Soheil Ghiasi. 2018. Cappuccino: Efficient cnn inference software synthesis for mobile system-on-chips. IEEE Embedded Systems Letters 11, 1 (2018), 9--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. 2014. SDA: Software-defined accelerator for large-scale DNN systems. In 2014 IEEE Hot Chips 26 Symposium (HCS). IEEE, 1--23.Google ScholarGoogle ScholarCross RefCross Ref
  15. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779--788.Google ScholarGoogle ScholarCross RefCross Ref
  16. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.Google ScholarGoogle Scholar
  17. Junzhong Shen, Deguang Wang, You Huang, Mei Wen, and Chunyuan Zhang. 2019. Accelerating 3D CNN-based lung nodule segmentation on a multi-FPGA system. In FPGA. 117.Google ScholarGoogle Scholar
  18. Junzhong Shen, Deguang Wang, You Huang, Mei Wen, and Chunyuan Zhang. 2019. Scale-out acceleration for 3D CNN-based lung nodule segmentation on a multi-FPGA system. In Proceedings of the 56th Annual Design Automation Conference 2019. ACM, 207.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 535--547.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 16--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 40--47.Google ScholarGoogle Scholar
  22. Siqi Wang, Gayathri Ananthanarayanan, and Tulika Mitra. 2018. OPTiC: Optimizing collaborative CPU--GPU computing on mobile devices with thermal constraints. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38, 3 (2018), 393--406.Google ScholarGoogle ScholarCross RefCross Ref
  23. Yi Wang, Weixuan Chen, Jing Yang, and Tao Li. 2018. Exploiting parallelism for CNN applications on 3D stacked processing-in-memory architecture. IEEE Transactions on Parallel and Distributed Systems 30, 3 (2018), 589--600.Google ScholarGoogle ScholarCross RefCross Ref
  24. Yi Wang, Weixuan Chen, Jing Yang, and Tao Li. 2018. Towards memory-efficient allocation of CNNs on processing-in-memory architecture. IEEE Transactions on Parallel and Distributed Systems 29, 6 (2018), 1428--1441.Google ScholarGoogle ScholarCross RefCross Ref
  25. Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti, Stephan Thesing, David Whalley, Guillem Bernat, Christian Ferdinand, Reinhold Heckmann, Tulika Mitra, et al. 2008. The worst-case execution-time problemâoverview of methods and survey of tools. ACM Transactions on Embedded Computing Systems (TECS) 7, 3 (2008), 36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Shangyu Wu, Yi Wang, Amelie Chi Zhou, Rui Mao, Zili Shao, and Tao Li. 2019. Towards cross-platform inference on edge devices with emerging neuromorphic architecture. In 2019 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 806--811.Google ScholarGoogle Scholar
  27. Xiaowei Xu, Yukun Ding, Sharon Xiaobo Hu, Michael Niemier, Jason Cong, Yu Hu, and Yiyu Shi. 2018. Scaling for edge inference of deep neural networks. Nature Electronics 1, 4 (2018), 216.Google ScholarGoogle ScholarCross RefCross Ref
  28. Xiaowei Xu, Qing Lu, Lin Yang, Sharon Hu, Danny Chen, Yu Hu, and Yiyu Shi. 2018. Quantization of fully convolutional networks for accurate biomedical image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8300--8308.Google ScholarGoogle ScholarCross RefCross Ref
  29. Xiaowei Xu, Tianchen Wang, Qing Lu, and Yiyu Shi. 2018. Resource constrained cellular neural networks for real-time obstacle detection using fpgas. In 2018 19th International Symposium on Quality Electronic Design (ISQED). IEEE, 437--440.Google ScholarGoogle ScholarCross RefCross Ref
  30. Lei Yang, Weichen Liu, Peng Chen, Nan Guan, and Mengquan Li. 2017. Task mapping on smart noc: Contention matters, not the distance. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lei Yang, Weichen Liu, Nan Guan, and Nikil D. Dutt. 2018. Optimal application mapping and scheduling for network-on-chips with computation in STT-RAM based router. IEEE Trans. Comput. (2018).Google ScholarGoogle Scholar
  32. Lei Yang, Weichen Liu, Weiwen Jiang, Mengquan Li, Peng Chen, and Edwin Hsing-Mean Sha. 2016. FoToNoC: A folded torus-like network-on-chip based many-core systems-on-chip in the dark silicon era. IEEE Transactions on Parallel and Distributed Systems 28, 7 (2016), 1905--1918.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Computational intelligenCe Magazine 13, 3 (2018), 55--75.Google ScholarGoogle Scholar
  34. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design. ACM, 326--331.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. 2017. Live video analytics at scale with approximation and delay-tolerance. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17). 377--392.Google ScholarGoogle Scholar
  37. Wentai Zhang, Jiaxi Zhang, Minghua Shen, Guojie Luo, and Nong Xiao. 2019. An efficient mapping approach to large-scale DNNs on multi-FPGA architectures. In 2019 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1241--1244.Google ScholarGoogle Scholar
  38. Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of the International Conference on Computer-Aided Design. ACM, 56.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!