Abstract
The next significant step in the evolution and proliferation of artificial intelligence technology will be the integration of neural network (NN) models within embedded and mobile systems. This calls for the design of compact, energy efficient NN models in silicon. In this article, we present a scalable application-specific integrated circuit (ASIC) design of an energy-efficient Long Short-Term Memory (<underline>LS</underline>TM) <underline>a</underline>ccelerator, named ELSA, which is suitable for energy-constrained devices. It includes several architectural innovations to achieve small area and high energy efficiency. To reduce the area and power consumption of the overall design, the compute-intensive units of ELSA employ approximate multiplications and still achieve high performance and accuracy. The performance is further improved through efficient synchronization of the elastic pipeline stages to maximize the utilization. The article also includes a performance model of ELSA, as a function of the hidden nodes and timesteps, permitting its use for the evaluation of any LSTM application. ELSA was implemented in register transfer level (RTL) and was synthesized and placed and routed in 65nm technology. Its functionality is demonstrated for language modeling—a common application of LSTM. ELSA is compared against a baseline implementation of an LSTM accelerator with standard functional units and without any of the architectural innovations of ELSA. The article demonstrates that ELSA can achieve significant improvements in power, area, and energy-efficiency when compared to the baseline design and several ASIC implementations reported in the literature, making it suitable for use in embedded systems and real-time applications.
- Jorge Albericio, Alberto Delmás, Patrick Judd, Sayeh Sharify, Gerard O’Leary, Roman Genov, and Andreas Moshovos. 2017. Bit-pragmatic deep neural network computing. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50’17). ACM, New York, NY, 382--394. DOI:https://doi.org/10.1145/3123939.3123982Google Scholar
Digital Library
- Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxin Liu, Ming Wu, and Lintao Zhang. 2019. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’19). ACM, New York, NY, 63--72. DOI:https://doi.org/10.1145/3289602.3293898Google Scholar
Digital Library
- A. X. M. Chang and E. Culurciello. 2017. Hardware accelerators for recurrent neural networks on FPGA. In Proceedings of the 2017 IEEE International Symposium on Circuits and Systems (ISCAS). 1--4. DOI:https://doi.org/10.1109/ISCAS.2017.8050816Google Scholar
Cross Ref
- Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder--decoder approaches. In Proceedings of SSST-8, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, 103--111. DOI:https://doi.org/10.3115/v1/W14-4012Google Scholar
Cross Ref
- F. Conti, L. Cavigelli, G. Paulin, I. Susmelj, and L. Benini. 2018. Chipmunk: A systolically scalable 0.9 mm2, 3.08Gop/s/mW @ 1.2 mW accelerator for near-sensor recurrent neural network inference. In Proceedings of the 2018 IEEE Custom Integrated Circuits Conference (CICC). 1--4. DOI:https://doi.org/10.1109/CICC.2018.8357068Google Scholar
Cross Ref
- Dario Amodei et al. 2016. Deep speech 2: End-to-end speech recognition in English and Mandarin. In Proceedings of the 33rd International Conference on International Conference on Machine Learning—Volume 48 (ICML’16). JMLR.org, 173--182. http://dl.acm.org/citation.cfm?id=3045390.3045410Google Scholar
- Subhasis Das and Song Han. 2015. NeuralTalk on embedded system and GPU-accelerated RNN.Google Scholar
- J. C. Ferreira and J. Fonseca. 2016. An FPGA implementation of a long short-term memory neural network. In Proceedings of the 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1--8. DOI:https://doi.org/10.1109/ReConFig.2016.7857151Google Scholar
Cross Ref
- Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE Press, Piscataway, NJ, 1--14. DOI:https://doi.org/10.1109/ISCA.2018.00012Google Scholar
Digital Library
- Alex Graves. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. Vol. 385. Springer.Google Scholar
- Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 152--159. DOI:https://doi.org/10.1109/FCCM.2017.25Google Scholar
Cross Ref
- Y. Guan, Z. Yuan, G. Sun, and J. Cong. 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In Proceedings of the 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC). 629--634. DOI:https://doi.org/10.1109/ASPDAC.2017.7858394Google Scholar
Cross Ref
- Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William (Bill) J. Dally. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 75--84. DOI:https://doi.org/10.1145/3020078.3021745Google Scholar
Digital Library
- Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jurgen Schmidhuber. 2001. Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In A Field Guide to Dynamical Recurrent Neural Networks.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (Nov. 1997), 1735--1780. DOI:https://doi.org/10.1162/neco.1997.9.8.1735Google Scholar
Digital Library
- Xiaobo Hu, Ronald G. Harber, and Steven C. Bass. 1991. Expanding the range of convergence of the CORDIC algorithm. IEEE Trans. Comput. 40, 1 (Jan. 1991), 13--21. DOI:https://doi.org/10.1109/12.67316Google Scholar
Digital Library
- Kyuyeon Hwang and Wonyong Sung. 2015. Single stream parallelization of generalized LSTM-like RNNs on a GPU. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1047--1051. DOI:https://doi.org/10.1109/ICASSP.2015.7178129Google Scholar
Cross Ref
- P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos. 2016. Stripes: Bit-serial deep neural network computing. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--12. DOI:https://doi.org/10.1109/MICRO.2016.7783722Google Scholar
Cross Ref
- Andrej Karpathy. 2016. Multi-layer Recurrent Neural Networks for character-level language models in Torch. Retrieved from github.com/karpathy/char-rnn.Google Scholar
- M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung. 2016. FPGA-based low-power speech recognition with recurrent neural networks. In Proceedings of the 2016 IEEE International Workshop on Signal Processing Systems (SiPS). 230--235. DOI:https://doi.org/10.1109/SiPS.2016.48Google Scholar
Cross Ref
- Sicheng Li, Chunpeng Wu, Hai Li, Boxun Li, Yu Wang, and Qinru Qiu. 2015. FPGA acceleration of recurrent neural network based language model. In Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’15). IEEE Computer Society, Washington, D.C., 111--118. DOI:https://doi.org/10.1109/FCCM.2015.50Google Scholar
Digital Library
- Jean-Michel Muller. 2005. Elementary Functions: Algorithms and Implementation. Birkhauser.Google Scholar
- Norman P. Jouppi et al.2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, New York, NY, 1--12. DOI:https://doi.org/10.1145/3079856.3080246Google Scholar
- E. Nurvitadhi, Jaewoong Sim, D. Sheffield, A. Mishra, S. Krishnan, and D. Marr. 2016. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL). 1--4. DOI:https://doi.org/10.1109/FPL.2016.7577314Google Scholar
Cross Ref
- Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, and Michaela Blott. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications, FPL 2018, Dublin, Ireland, August 27-31, 2018. 89--96. DOI:https://doi.org/10.1109/FPL.2018.00024Google Scholar
Cross Ref
- Vladimir Rybalkin, Norbert Wehn, Mohammad Reza Yousefi, and Didier Stricker. 2017. Hardware architecture of bidirectional long short-term memory neural network for optical character recognition. In Proceedings of the Conference on Design, Automation 8 Test in Europe (DATE’17). European Design and Automation Association, 3001 Leuven, Belgium, Belgium, 1394--1399. http://dl.acm.org/citation.cfm?id=3130379.3130707Google Scholar
Cross Ref
- D. Shin, J. Lee, J. Lee, and H. Yoo. 2017. 14.2 DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In 2017 IEEE International Solid-State Circuits Conference (ISSCC). 240--241. DOI:https://doi.org/10.1109/ISSCC.2017.7870350Google Scholar
Cross Ref
- Hyeonuk Sim and Jongeun Lee. 2017. A new stochastic computing multiplier with application to deep convolutional neural networks. In Proceedings of the 54th Annual Design Automation Conference 2017 (DAC’17). ACM, New York, NY, Article 29, 6 pages. DOI:https://doi.org/10.1145/3061639.3062290Google Scholar
Digital Library
- Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning - Volume 37 (ICML’15). JMLR.org, 843--852. http://dl.acm.org/citation.cfm?id=3045118.3045209Google Scholar
- Marijn F. Stollenga, Wonmin Byeon, Marcus Liwicki, and Juergen Schmidhuber. 2015. Parallel multi-dimensional LSTM, with application to fast biomedical volumetric image segmentation. In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 2 (NIPS’15). MIT Press, Cambridge, MA, 2998--3006. http://dl.acm.org/citation.cfm?id=2969442.2969574Google Scholar
Digital Library
- M. Sundermeyer, H. Ney, and R. Schlüter. 2015. From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 3 (March 2015), 517--529. DOI:https://doi.org/10.1109/TASLP.2015.2400218Google Scholar
Digital Library
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2 (NIPS’14). MIT Press, Cambridge, MA, 3104--3112. http://dl.acm.org/citation.cfm?id=2969033.2969173Google Scholar
- Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18). ACM, New York, NY, 11--20. DOI:https://doi.org/10.1145/3174243.3174253Google Scholar
Digital Library
- Z. Wang, J. Lin, and Z. Wang. 2017. Accelerating recurrent neural networks: A memory-efficient approach. IEEE Trans. Very Large Scale Integr. VLSI Syst. 25, 10 (Oct. 2017), 2763--2775. DOI:https://doi.org/10.1109/TVLSI.2017.2717950Google Scholar
Digital Library
Index Terms
ELSA: A Throughput-Optimized Design of an LSTM Accelerator for Energy-Constrained Devices
Recommendations
In-Datacenter Performance Analysis of a Tensor Processing Unit
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureMany architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates ...
DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysRecurrent Neural Networks (RNNs) are widely used in speech recognition and natural language processing applications because of their capability to process temporal sequences. Because RNNs are fully connected, they require a large number of weight memory ...
RNNFast: An Accelerator for Recurrent Neural Networks Using Domain-Wall Memory
Special Issue on Nanoelectronic Device, Circuit, Architecture Design, Part 2 and Regular PapersRecurrent Neural Networks (RNNs) are an important class of neural networks designed to retain and incorporate context into current decisions. RNNs are particularly well suited for machine learning problems in which context is important, such as speech ...






Comments