skip to main content
research-article

Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks

Published:22 December 2022Publication History
Skip Abstract Section

Abstract

This work introduces Remarn, a reconfigurable multi-threaded multi-core accelerator supporting both spatial and temporal co-execution of Recurrent Neural Network (RNN) inferences. It increases processing capabilities and quality of service of cloud-based neural processing units (NPUs) by improving their hardware utilization and by reducing design latency, with two innovations. First, a custom coarse-grained multi-threaded RNN/Long Short-Term Memory (LSTM) hardware architecture, switching tasks among threads when RNN computational engines meet data hazards. Second, the partitioning of this hardware architecture into multiple full-fledged sub-accelerator cores, enabling spatially co-execution of multiple RNN/LSTM inferences. These innovations improve the exploitation of the available parallelism to increase runtime hardware utilization and boost design throughput. Evaluation results show that a dual-threaded quad-core Remarn NPU achieves 2.91 times higher performance while only occupying 5.0% more area than a single-threaded one on a Stratix 10 FPGA. When compared with a Tesla V100 GPU implementation, our design achieves 6.5 times better performance and 15.6 times higher power efficiency, showing that our approach contributes to high performance and energy-efficient FPGA-based multi-RNN inference designs for datacenters.

REFERENCES

  1. [1] Abdelfattah Mohamed S., Han David, Bitar Andrew, DiCecco Roberto, O’Connell Shane, Shanker Nitika, Chu Joseph, Prins Ian, Fender Joshua, Ling Andrew C., et al. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 4114117.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Amodei Dario et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning.Google ScholarGoogle Scholar
  3. [3] Ando Kota, Takamaeda-Yamazaki Shinya, Ikebe Masayuki, Asai Tetsuya, and Motomura Masato. 2017. A multithreaded CGRA for convolutional neural network processing. Circ. Syst. 8, 6 (2017), 149170.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Baek Eunjin, Kwon Dongup, and Kim Jangwoo. 2020. A multi-neural network acceleration architecture. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 940953.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bank-Tavakoli Erfan, Ghasemzadeh Seyed Abolfazl, Kamal Mehdi, Afzali-Kusha Ali, and Pedram Massoud. 2019. Polar: A pipelined/overlapped fpga-based lstm accelerator. IEEE Trans. VLSI Syst. 28, 3 (2019), 838842.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Borkenhagen John M. et al. 2000. A multithreaded PowerPC processor for commercial servers. IBM J. Res. Dev. 44, 6 (2000), 885898.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Boutros Andrew et al. 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Boutros Andrew, Nurvitadhi Eriko, Ma Rui, Gribok Sergey, Zhao Zhipeng, Hoe James C., Betz Vaughn, and Langhammer Martin. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 1019.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Cao Shijie, Zhang Chen, Yao Zhuliang, Xiao Wencong, Nie Lanshun, Zhan Dechen, Liu Yunxin, Wu Ming, and Zhang Lintao. 2019. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 6372.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Chang Andre Xian Ming, Martini Berin, and Culurciello Eugenio. 2015. Recurrent neural networks hardware implementation on FPGA. arXiv:1511.05552. Retrieved from https://arxiv.org/abs/1511.05552.Google ScholarGoogle Scholar
  11. [11] Chen Zhe, Blair Garrett J., Blair Hugh T., and Cong Jason. 2020. BLINK: Bit-sparse LSTM inference kernel enabling efficient calcium trace extraction for neurofeedback devices. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 217222.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Chen Zhe, Blair Hugh T., and Cong Jason. 2022. Energy efficient LSTM inference accelerator for real-time causal prediction. ACM Trans. Des. Autom. Electr. Syst. 27, 5, Article 44 (September 2022), 19 pages. Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Chen Zhe, Howe Andrew, Blair Hugh T., and Cong Jason. 2018. CLINK: Compact LSTM inference kernel for energy efficient neurofeedback devices. In Proceedings of the International Symposium on Low Power Electronics and Design. 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Choi Yujeong and Rhu Minsoo. 2020. PREMA: A predictive multi-task scheduling algorithm for preemptible neural processing units. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). IEEE, 220233.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Chollet François et al. 2015. Keras: Deep Learning Library for theano and tensorflow. https://keras.io/k.Google ScholarGoogle Scholar
  16. [16] Dimond R., Mencer O., and Luk W.. 2006. Application-specific customisation of multi-threaded soft processors. IEE Proc. Comput. Digit. Techn. 153, 3 (2006), 173180.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Donahue Jeffrey, Hendricks Lisa Anne, Guadarrama Sergio, Rohrbach Marcus, Venugopalan Subhashini, Saenko Kate, and Darrell Trevor. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 26252634.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Ferianc Martin, Que Zhiqiang, Fan Hongxiang, Luk Wayne, and Rodrigues Miguel. 2021. Optimizing Bayesian recurrent neural networks on an FPGA-based accelerator. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’21). IEEE, 110.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Fowers Jeremy, Ovtcharov Kalin, Papamichael Michael, Massengill Todd, Liu Ming, Lo Daniel, Alkalay Shlomi, Haselman Michael, Adams Logan, Ghandi Mahdi, et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Gao Chang, Delbruck Tobi, and Liu Shih-Chii. 2021. Spartus: A 9.4 TOp/s FPGA-based LSTM accelerator exploiting spatio-temporal sparsity. IEEE Transactions on Neural Networks and Learning Systems. (Early Access)Google ScholarGoogle Scholar
  21. [21] Gao Chang, Neil Daniel, Ceolini Enea, Liu Shih-Chii, and Delbruck Tobi. 2018. DeltaRNN: A power-efficient recurrent neural network accelerator. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Gao Pin, Yu Lingfan, Wu Yongwei, and Li Jinyang. 2018. Low latency RNN inference with cellular batching. In Proceedings of the 13th European Conference on Computer Systems Conference (EuroSys’18). 115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Ghasemzadeh Seyed Abolfazl, Tavakoli Erfan Bank, Kamal Mehdi, Afzali-Kusha Ali, and Pedram Massoud. 2021. BRDS: An FPGA-based LSTM accelerator with row-balanced dual-ratio sparsification. arXiv:2101.02667. Retrieved from https://arxiv.org/abs/2101.02667.Google ScholarGoogle Scholar
  24. [24] Ghodrati Soroush, Ahn Byung Hoon, Kim Joon Kyung, Kinzer Sean, Yatham Brahmendra Reddy, Alla Navateja, Sharma Hardik, Alian Mohammad, Ebrahimi Eiman, Kim Nam Sung, et al. 2020. Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). IEEE, 681697.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Goldberg Yoav. 2016. A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57 (2016), 345420.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Guan Yijin, Liang Hao, Xu Ningyi, Wang Wenqiang, Shi Shaoshuai, Chen Xi, Sun Guangyu, Zhang Wei, and Cong Jason. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 152159.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Guan Yijin, Yuan Zhihang, Sun Guangyu, and Cong Jason. 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In Proceedings of the 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 629634.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Han Song, Kang Junlong, Mao Huizi, Hu Yiming, Li Xin, Li Yubin, Xie Dongliang, Luo Hong, Yao Song, Wang Yu, et al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 7584.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Hannun Awni, Case Carl, Casper Jared, Catanzaro Bryan, Diamos Greg, Elsen Erich, Prenger Ryan, Satheesh Sanjeev, Sengupta Shubho, Coates Adam, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv:1412.5567. Retrieved from https://arxiv.org/abs/1412.5567.Google ScholarGoogle Scholar
  30. [30] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Hundman Kyle, Constantinou Valentino, Laporte Christopher, Colwell Ian, and Soderstrom Tom. 2018. Detecting spacecraft anomalies using LSTM and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 387395.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Intel. [n.d.]. Understanding How Hyperflex Architecture Enables High Performance Systems. White Paper 01231.Google ScholarGoogle Scholar
  33. [33] Intel. 2020. Intel Agilex Variable Precision DSP Blocks User Guide.Google ScholarGoogle Scholar
  34. [34] Jiang Jingfei, Xiao Tao, Xu Jinwei, Wen Dong, Gao Lei, and Dou Yong. 2022. A low-latency LSTM accelerator using balanced sparsity based on FPGA. Microprocess. Microsyst. 89 (2022), 104417.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Jouppi Norman P., Young Cliff, Patil Nishant, Patterson David, Agrawal Gaurav, Bajwa Raminder, Bates Sarah, Bhatia Suresh, Boden Nan, Borchers Al, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 112.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Kao Sheng-Chun and Krishna Tushar. 2022. MAGMA: An optimization framework for mapping multiple DNNs on multiple accelerator cores. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’22). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Khalil Kasem, Dey Bappaditya, Kumar Ashok, and Bayoumi Magdy. 2021. A reversible-logic based architecture for long short-term memory (LSTM) network. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’21). IEEE, 15.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Kim Jinwon, Kim Jiho, and Kim Tae-Hwan. 2021. AERO: A 1.28 MOP/s/LUT reconfigurable inference processor for recurrent neural networks in a resource-limited FPGA. Electronics 10, 11 (2021), 1249.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Kwon Dongup, Hur Suyeon, Jang Hamin, Nurvitadhi Eriko, and Kim Jangwoo. 2020. Scalable multi-FPGA acceleration for large RNNs with full parallelism levels. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Kwon Hyoukjun, Lai Liangzhen, Pellauer Michael, Krishna Tushar, Chen Yu-Hsin, and Chandra Vikas. 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 7183.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Langhammer Martin, Pasca Bogdan, Baeckler Gregg, and Gribok Sergey. 2019. Extracting INT8 multipliers from INT18 multipliers. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’19). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Li Zhe, Ding Caiwen, Wang Siyue, Wen Wujie, Zhuo Youwei, Liu Chang, Qiu Qinru, Xu Wenyao, Lin Xue, Qian Xuehai, et al. 2019. E-RNN: Design optimization for efficient recurrent neural networks in FPGAs. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 6980.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Liu Yidong, Liu Leibo, Lombardi Fabrizio, and Han Jie. 2019. An energy-efficient and noise-tolerant recurrent neural network using stochastic computing. IEEE Trans. VLSI Syst. 27, 9 (2019), 22132221.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Ma Rui, Hsu Jia-Ching, Tan Tian, Nurvitadhi Eriko, Sheffield David, Pelt Rob, Langhammer Martin, Sim Jaewoong, Dasu Aravind, and Chiou Derek. 2021. Specializing FGPU for persistent deep learning. ACM Trans. Reconfig. Technol. Syst. 14, 2 (2021), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Maas Andrew, Daly Raymond E., Pham Peter T., Huang Dan, Ng Andrew Y., and Potts Christopher. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] McNairy Cameron and Bhatia Rohit. 2005. Montecito: A dual-core, dual-thread itanium processor. IEEE Micro 25, 2 (2020), 1020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Mittal Sparsh and Umesh Sumanth. 2021. A survey on hardware accelerators and optimization techniques for RNNs. J. Syst. Arch. 112 (2021), 101839.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Nan Guocai, Wang Chenghua, Liu Weiqiang, and Lombardi Fabrizio. 2020. DC-LSTM: Deep compressed LSTM with low bit-width and structured matrices. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’20). IEEE, 15.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Nurvitadhi Eriko, Boutros Andrew, Budhkar Prerna, Jafari Ali, Kwon Dongup, Sheffield David, Prabhakaran Abirami, Gururaj Karthik, Appana Pranavi, and Naik Mishali. 2019. Scalable low-latency persistent neural machine translation on CPU server with multiple FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’19). IEEE, 307310.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Nurvitadhi Eriko, Hoe James C., Lu Shih-Lien L., and Kam Timothy. 2010. Automatic multithreaded pipeline synthesis from transactional datapath specifications. In Proceedings of the Design Automation Conference. IEEE, 314319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Nurvitadhi Eriko, Kwon Dongup, Jafari Ali, Boutros Andrew, Sim Jaewoong, Tomson Phillip, Sumbul Huseyin, Chen Gregory, Knag Phil, Kumar Raghavan, et al. 2019. Why compete when you can work together: Fpga-asic integration for persistent rnns. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 199207.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Nurvitadhi Eriko, Sim Jaewoong, Sheffield David, Mishra Asit, Krishnan Srivatsan, and Marr Debbie. 2016. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 14.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Oh Young H., Kim Seonghak, Jin Yunho, Son Sam, Bae Jonghyun, Lee Jongsung, Park Yeonhong, Kim Dong Uk, Ham Tae Jun, and Lee Jae W.. 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 584597.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Park Naebeom, Kim Yulhwa, Ahn Daehyun, Kim Taesu, and Kim Jae-Joon. 2020. Time-step interleaved weight reuse for LSTM neural network computing. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 1318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Peng Lu, Shi Wentao, Zhang Jian, and Irving Samuel. 2019. Exploiting model-level parallelism in recurrent neural network accelerators. In Proceedings of the IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC’19). IEEE, 241248.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Prabhakar Raghu, Zhang Yaqi, Koeplinger David, Feldman Matt, Zhao Tian, Hadjis Stefan, Pedram Ardavan, Kozyrakis Christos, and Olukotun Kunle. 2017. Plasticine: A reconfigurable architecture for parallel patterns. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 389402.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Que Zhiqiang et al. 2020. Optimizing reconfigurable recurrent neural networks. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Que Zhiqiang, Liu Yanyang, Guo Ce, Niu Xinyu, Zhu Yongxin, and Luk Wayne. 2019. Real-time anomaly detection for flight testing using AutoEncoder and LSTM. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’19). IEEE, 379382.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Que Zhiqiang, Nakahara Hiroki, Fan Hongxiang, Meng Jiuxi, Tsoi Kuen Hung, Niu Xinyu, Nurvitadhi Eriko, and Luk Wayne. 2020. A reconfigurable multithreaded accelerator for recurrent neural networks. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 2028.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Que Zhiqiang, Nakahara Hiroki, Nurvitadhi Eriko, Boutros Andrew, Fan Hongxiang, Zeng Chenglong, Meng Jiuxi, Tsoi Kuen Hung, Niu Xinyu, and Luk Wayne. 2022. Recurrent neural networks with column-wise matrix-vector multiplication on FPGAs. IEEE Trans. VLSI Syst. (2022).Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Que Zhiqiang, Nugent Thomas, Liu Shuanglong, Tian Li, Niu Xinyu, Zhu Yongxin, and Luk Wayne. 2019. Efficient weight reuse for large LSTMs. In Proceedings of the IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP’19), Vol. 2160. IEEE, 1724.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Que Zhiqiang, Wang Erwei, Marikar Umar, Moreno Eric, Ngadiuba Jennifer, Javed Hamza, Borzyszkowski Bartłomiej, Aarrestad Thea, Loncar Vladimir, Summers Sioni, Pierini Maurizio, Cheung Peter Y., and Luk Wayne. 2021. Accelerating recurrent neural networks for gravitational wave experiments. In Proceedings of the 32th International Conference on Application-specific Systems, Architectures and Processors (ASAP’21). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Que Zhiqiang, Zhu Yongxin, Fan Hongxiang, Meng Jiuxi, Niu Xinyu, and Luk Wayne. 2020. Mapping large LSTMs to FPGAs with weight reuse. J. Sign. Process. Syst. 92, 9 (2020), 965979.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Ribes Stefano, Trancoso Pedro, Sourdis Ioannis, and Bouganis Christos-Savvas. 2020. Mapping multiple LSTM models on FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 19.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Rizakis Michalis, Venieris Stylianos I., Kouris Alexandros, and Bouganis Christos-Savvas. 2018. Approximate FPGA-based LSTMs under computation time constraints. In Proceedings of the International Symposium on Applied Reconfigurable Computing. Springer, 315.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Rybalkin Vladimir, Pappalardo Alessandro, Ghaffar Muhammad Mohsin, Gambardella Giulio, Wehn Norbert, and Blott Michaela. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Rybalkin Vladimir, Sudarshan Chirag, Weis Christian, Lappas Jan, Wehn Norbert, and Cheng Li. 2020. Efficient hardware architectures for 1D-and MD-LSTM networks. J. Sign. Process. Syst. 92, 11 (2020), 12191245.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Rybalkin Vladimir and Wehn Norbert. 2020. When massive GPU parallelism ain’t enough: A novel hardware architecture of 2D-LSTM neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Rybalkin Vladimir, Wehn Norbert, Yousefi Mohammad Reza, and Stricker Didier. 2017. Hardware architecture of bidirectional long short-term memory neural network for optical character recognition. In Proceedings of the Conference on Design, Automation & Test in Europe. 13941399.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. [70] Shi Runbin, Liu Junjie, So K.-H. Hayden, Wang Shuo, and Liang Yun. 2019. E-LSTM: Efficient inference of sparse LSTM on embedded heterogeneous system. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. [71] Shomron Gil, Horowitz Tal, and Weiser Uri. 2019. SMT-SA: Simultaneous multithreading in systolic arrays. IEEE Comput. Arch. Lett. 18, 2 (2019), 99102.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. [72] Silfa Franyell, Arnau Jose Maria, and Gonzalez Antonio. 2020. E-BATCH: Energy-efficient and high-throughput RNN batching. ACM Transactions on Architecture and Code Optimization (TACO) 19, 1 (2020), 123.Google ScholarGoogle Scholar
  73. [73] Smith Burton J.. 1986. A pipelined, shared resource MIMD computer. In Advanced Computer Architecture. 3941.Google ScholarGoogle Scholar
  74. [74] Sun Yuxi, Ahmed Akram Ben, and Amano Hideharu. 2019. Acceleration of deep recurrent neural networks with an FPGA cluster. In Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies. 14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Sun Zhanrui, Zhu Yongxin, Zheng Yu, Wu Hao, Cao Zihao, Xiong Peng, Hou Junjie, Huang Tian, and Que Zhiqiang. 2018. FPGA acceleration of LSTM based on data for test flight. In Proceedings of the IEEE International Conference on Smart Cloud (SmartCloud’18). IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Tan Tian, Nurvitadhi Eriko, Shih David, and Chiou Derek. 2018. Evaluating the highly-pipelined intel stratix 10 FPGA architecture using open-source benchmarks. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). IEEE, 206213.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Thornton James E.. 1964. Parallel operation in the control data 6600. In Proceedings of the Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. 3340.Google ScholarGoogle Scholar
  78. [78] Venieris Stylianos I. and Bouganis Christos-Savvas. 2018. f-CNNx: A toolflow for mapping multiple convolutional neural networks on FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 3813817.Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Wang Shuo, Li Zhe, Ding Caiwen, Yuan Bo, Qiu Qinru, Wang Yanzhi, and Liang Yun. 2018. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 1120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. [81] Wang Zhisheng, Lin Jun, and Wang Zhongfeng. 2017. Accelerating recurrent neural networks: A memory-efficient approach. IEEE Trans. VLSI Syst. 25, 10 (2017), 27632775.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. [82] Wang Zhao, Sun Guangyu, Zhu Jingchen, Zhou Zhe, Guo Yijiang, and Yuan Zhihang. 2021. METRO: A software-hardware co-design of interconnections for spatial DNN accelerators. arxiv:2108.10570 [cs.AR]. Retrieved from https://arxiv.org/abs/2108.10570.Google ScholarGoogle Scholar
  83. [83] Wu Jiaquan, Li Feiteng, Chen Zhijian, and Xiang Xiaoyan. 2019. A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation. IEEE Trans. VLSI Syst. 27, 12 (2019), 29392943.Google ScholarGoogle ScholarCross RefCross Ref
  84. [84] Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Retrieved from https://www.xilinx.com/support/documentation/whitepapers/wp486-deep-learning-int8.pdf.Google ScholarGoogle Scholar
  85. [85] Yalamarthy Krishna Praveen et al. 2019. Low-complexity distributed-arithmetic-based pipelined architecture for an LSTM network. IEEE Trans. VLSI Syst. 28, 2 (2019), 329338.Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Yazdani Reza, Ruwase Olatunji, Zhang Minjia, He Yuxiong, Arnau Jose-Maria, and González Antonio. 2019. LSTM-sharp: An adaptable, energy-efficient hardware accelerator for long short-term memory. arXiv:1911.01258. Retrieved from https://arxiv.org/abs/1911.01258.Google ScholarGoogle Scholar
  87. [87] Ng Joe Yue-Hei, Hausknecht Matthew, Vijayanarasimhan Sudheendra, Vinyals Oriol, Monga Rajat, and Toderici George. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46944702.Google ScholarGoogle Scholar
  88. [88] Zaremba Wojciech, Sutskever Ilya, and Vinyals Oriol. 2014. Recurrent neural network regularization. arXiv:1409.2329. Retrieved from https://arxiv.org/abs/1409.2329.Google ScholarGoogle Scholar
  89. [89] Zhao Tian, Zhang Yaqi, and Olukotun Kunle. 2019. Serving recurrent neural networks efficiently with a spatial accelerator. Proc. Mach. Learn. Syst. 1 (2019), 166177.Google ScholarGoogle Scholar
  90. [90] Zheng Yong, Yang Haigang, Jia Yiping, and Huang Zhihong. 2021. PermLSTM: A high energy-efficiency LSTM accelerator architecture. Electronics 10, 8 (2021), 882.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Reconfigurable Technology and Systems
              ACM Transactions on Reconfigurable Technology and Systems  Volume 16, Issue 1
              March 2023
              403 pages
              ISSN:1936-7406
              EISSN:1936-7414
              DOI:10.1145/35733111
              • Editor:
              • Deming Chen
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 22 December 2022
              • Online AM: 17 May 2022
              • Accepted: 1 May 2022
              • Revised: 18 February 2022
              • Received: 2 September 2021
              Published in trets Volume 16, Issue 1

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Refereed
            • Article Metrics

              • Downloads (Last 12 months)459
              • Downloads (Last 6 weeks)48

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Full Text

            View this article in Full Text.

            View Full Text

            HTML Format

            View this article in HTML Format .

            View HTML Format
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!