skip to main content
research-article

Specializing FGPU for Persistent Deep Learning

Published:15 July 2021Publication History
Skip Abstract Section

Abstract

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our persistent deep learning (PDL)-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4–3× more ALMs, 4.4–6.4× more M20ks, and 1–9.5× more DSPs than baseline, but improves performance by 56–693× for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance/power and demonstrate that PDL-FGPU is only 4.0–10.4× slower than the Nvidia V100.

References

  1. Muhammed Al Kadi, Benedikt Janssen, Jones Yudi, and Michael Huebner. 2018. General-Purpose computing with soft GPUs on FPGAs. ACM Transactions on Reconfigurable Technology and Systems 11, 1 (Jan. 2018), Article 5, 22 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Nurvitadhi, D. Kwon, A. Jafari, A. Boutros, J. Sim, P. Tomson, H. Sumbul, G. Chen, P. Knag, R. Kumar, R. Krishnamurthy, S. Gribok, B. Pasca, M. Langhammer, D. Marr, and A. Dasu. 2019. Why compete when you can work together: FPGA-ASIC integration for persistent RNNs. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). 199–207.Google ScholarGoogle Scholar
  3. Rui Ma, Jia-Ching Hsu, Tian Tan, Eriko Nurvitadhi, David Sheffield, Rob Pelt, Martin Langhammer, Jaewoong Sim, Aravind Dasu, and Derek Chiou. 2019. Specializing FGPU for persistent deep learning. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 326–333.Google ScholarGoogle ScholarCross RefCross Ref
  4. Intel Corporation. 2020. Open Programmable Acceleration Engine. Retrieved on Jun 20, 2019 from https://01.org/opae.Google ScholarGoogle Scholar
  5. R. Dey and F. M. Salem. 2017. Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS’17). 1597–1600. DOI:http://dx.doi.org/10.1109/MWSCAS.2017.8053243Google ScholarGoogle ScholarCross RefCross Ref
  6. Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. 2016. Persistent RNNs: Stashing recurrent weights on-chip. In International Conference on Machine Learning. 2024–2033. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Feiwen Zhu, Jeff Pool, Michael Andersch, Jeremy Appleyard, and Fung Xie. 2018. Sparse persistent RNNs: Squeezing large recurrent networks on-chip. In International Conference on Learning Representations. https://openreview.net/forum?id=HkxF5RgC-Google ScholarGoogle Scholar
  8. Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. GRNN: Low-latency and scalable RNN inference on GPUs. In Proceedings of the 14th EuroSys Conference 2019. 1–16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Intel Corporation. 2018. Intel® 64 and IA-32 Architectures Software Developer’s Manual.Google ScholarGoogle Scholar
  10. PDL-FGPU Kernel Sources. Retrieved on Jun 20, 2019 from https://github.com/paleolithicman/PDL-FGPU_kernels.Google ScholarGoogle Scholar
  11. MIPS Technologies. 2001. MIPS32® Architecture For Programmers Volume II: The MIPS32® Instruction Set.Google ScholarGoogle Scholar
  12. Baidu. 2020. DeepBench. Retrieved on Jun 20, 2019 from https://github.com/baidu-research/DeepBench.Google ScholarGoogle Scholar
  13. 2018. cuDNN Developer Guide. Retrieved on Jun 20, 2019 from https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html.Google ScholarGoogle Scholar
  14. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078.Google ScholarGoogle Scholar
  15. Joao Canas Ferreira and Jose Fonseca. 2016. An FPGA implementation of a long short-term memory neural network. In 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig’16). IEEE, 1–8.Google ScholarGoogle ScholarCross RefCross Ref
  16. Yijin Guan, Zhihang Yuan, Guangyu Sun, and Jason Cong. 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 629–634.Google ScholarGoogle ScholarCross RefCross Ref
  17. Vladimir Rybalkin, Norbert Wehn, Mohammad Reza Yousefi, and Didier Stricker. 2017. Hardware architecture of bidirectional long short-term memory neural network for optical character recognition. In Design, Automation & Test in Europe Conference & Exhibition (DATE’17). IEEE, 1390–1395. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 75–84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 152–159.Google ScholarGoogle ScholarCross RefCross Ref
  20. Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, and Michaela Blott. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. CoRR abs/1807.04093 (2018). arxiv:1807.04093. Retrieved on Jun 20, 2019 from http://arxiv.org/abs/1807.04093.Google ScholarGoogle Scholar
  21. E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Y. Xiao, D. Zhang, R. Zhao, and D. Burger. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (Mar. 2018), 8–20.Google ScholarGoogle ScholarCross RefCross Ref
  22. Zhiqiang Que, Hiroki Nakahara, Eriko Nurvitadhi, Hongxiang Fan, Chenglong Zeng, Jiuxi Meng, Xinyu Niu, and Wayne Luk. 2020. Optimizing reconfigurable recurrent neural networks. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 10–18.Google ScholarGoogle ScholarCross RefCross Ref
  23. Zhiqiang Que, Yongxin Zhu, Hongxiang Fan, Jiuxi Meng, Xinyu Niu, and Wayne Luk. 2020. Mapping large LSTMs to FPGAs with weight reuse. Journal of Signal Processing Systems 92 (2020), 965-979.Google ScholarGoogle ScholarCross RefCross Ref
  24. Daniele Bagni, A. Di Fresco, J. Noguera, and F. M. Vallina. 2016. A Zynq accelerator for floating point matrix multiplication designed with vivado HLS. Application Note (2016), 39–41.Google ScholarGoogle Scholar
  25. A. Severance and G. G. F. Lemieux. 2013. Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor. In 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’13). 1–10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. 2008. VESPA: Portable, scalable, and flexible FPGA-based vector processors. In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’08). ACM, New York, NY, 61–70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Kingyens and J. Gregory Steffan. 2010. A GPU-inspired soft processor for high-throughput acceleration. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW’10). 1–8. DOI:http://dx.doi.org/10.1109/IPDPSW.2010.5470679Google ScholarGoogle ScholarCross RefCross Ref
  28. R. Balasubramanian, V. Gangadhar, Z. Guo, C. Ho, C. Joseph, J. Menon, M. P. Drumond, R. Paul, S. Prasad, P. Valathol, and K. Sankaralingam. 2015. MIAOW—An open source RTL implementation of a GPGPU. In 2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII). 1–3. DOI:http://dx.doi.org/10.1109/CoolChips.2015.7158663Google ScholarGoogle ScholarCross RefCross Ref
  29. Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. 2018. DeepCPU: Serving RNN-based deep learning models 10x faster. In 2018 USENIX Annual Technical Conference (USENIX ATC’18). 951–965. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Specializing FGPU for Persistent Deep Learning

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Reconfigurable Technology and Systems
              ACM Transactions on Reconfigurable Technology and Systems  Volume 14, Issue 2
              June 2021
              107 pages
              ISSN:1936-7406
              EISSN:1936-7414
              DOI:10.1145/3468069
              • Editor:
              • Deming Chen
              Issue’s Table of Contents

              Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 15 July 2021
              • Accepted: 1 March 2021
              • Revised: 1 November 2020
              • Received: 1 December 2019
              Published in trets Volume 14, Issue 2

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!