Abstract
Multidimensional Long Short-Term Memory (MD-LSTM) neural network is an extension of one-dimensional LSTM for data with more than one dimension. MD-LSTM achieves state-of-the-art results in various applications, including handwritten text recognition, medical imaging, and many more. However, its implementation suffers from the inherently sequential execution that tremendously slows down both training and inference compared to other neural networks.
The main goal of the current research is to provide acceleration for inference of MD-LSTM. We advocate that Field-Programmable Gate Array (FPGA) is an alternative platform for deep learning that can offer a solution when the massive parallelism of GPUs does not provide the necessary performance required by the application.
In this article, we present the first hardware architecture for MD-LSTM. We conduct a systematic exploration to analyze a tradeoff between precision and accuracy. We use a challenging dataset for semantic segmentation, namely historical document image binarization from the DIBCO 2017 contest and a well-known MNIST dataset for handwritten digit recognition. Based on our new architecture, we implement FPGA-based accelerators that outperform Nvidia Geforce RTX 2080 Ti with respect to throughput by up to 9.9
and Nvidia Jetson AGX Xavier with respect to energy efficiency by up to 48
. Our accelerators achieve higher throughput, energy efficiency, and resource efficiency than FPGA-based implementations of convolutional neural networks (CNNs) for semantic segmentation tasks. For the handwritten digit recognition task, our FPGA implementations provide higher accuracy and can be considered as a solution when accuracy is a priority. Furthermore, they outperform earlier FPGA implementations of one-dimensional LSTMs with respect to throughput, energy efficiency, and resource efficiency.
- [1] . 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 4960–4964.Google Scholar
Digital Library
- [2] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [3] . 2007. Multi-dimensional recurrent neural networks. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 549–558.Google Scholar
Digital Library
- [4] . 2016. Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR’16).
IEEE , 228–233.Google ScholarCross Ref
- [5] . 2019. Are 2d-lstm really dead for offline text recognition? International Journal on Document Analysis and Recognition (IJDAR), 22, 3 (2019), 193–208.Google Scholar
Cross Ref
- [6] . 2015. Scene labeling with lstm recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3547–3555.Google Scholar
Cross Ref
- [7] . 2015. Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation. In Advances in Neural Information Processing Systems. 2998–3006.Google Scholar
- [8] . 2018. Automatic cone photoreceptor localisation in healthy and Stargardt afflicted retinas using deep learning. Sci. Rep. 8, 1 (2018), 7911.Google Scholar
Cross Ref
- [9] . 2017. Are multidimensional recurrent layers really necessary for handwritten text recognition?. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17), Vol. 1.
IEEE , 67–72.Google ScholarCross Ref
- [10] . 2019. No padding please: Efficient neural handwriting recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 355–362.Google Scholar
Cross Ref
- [11] . 2017. ICDAR2017 competition on document image binarization (DIBCO 2017). In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17), Vol. 1.
IEEE , 1395–1403.Google ScholarCross Ref
- [12] . 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.Google Scholar
- [13] . 2017. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2881–2890.Google Scholar
Cross Ref
- [14] . 2019. Semantic segmentation of urban buildings from vhr remote sensing imagery using a deep convolutional neural network. Remote Sens. 11, 15 (2019), 1774.Google Scholar
Cross Ref
- [15] . 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention.
Springer , 234–241.Google ScholarCross Ref
- [16] . 2019. A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recogn. 22, 3 (2019), 285–302.Google Scholar
Digital Library
- [17] . 2018. Optimizing cnn-based segmentation with deeply customized convolutional and deconvolutional architectures on fpga. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 1–22.Google Scholar
Digital Library
- [18] . 2019. Towards an efficient accelerator for DNN-based remote sensing image segmentation on FPGAs. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19).
IEEE , 187–193.Google ScholarCross Ref
- [19] . 2019. Filter-wise pruning approach to FPGA implementation of fully convolutional network for semantic segmentation. In Proceedings of the International Symposium on Applied Reconfigurable Computing.
Springer , 371–386.Google ScholarCross Ref
- [20] . 2018. Real-time object detection and semantic segmentation hardware system with deep learning networks. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18).
IEEE , 389–392.Google ScholarCross Ref
- [21] . 2015. Document image binarization using lstm: A sequence learning approach. In Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing.
ACM , 79–84.Google ScholarDigital Library
- [22] . 2019. U-net-bin: Hacking the document image binarization contest. Comput. Opt. 43, 5 (2019).Google Scholar
Cross Ref
- [23] . 2019. PAI-FCNN: FPGA based inference system for complex CNN models. In Proceedings of the IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP’19), Vol. 2160.
IEEE , 107–114.Google ScholarCross Ref
- [24] . 2016. The cityscapes dataset for semantic urban scene understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3213–3223.Google Scholar
- [25] . 2008. Semantic object classes in video: A high-definition ground truth database. Pattern Recogn. Lett. 30, 2 (2008), 88–97.Google Scholar
Digital Library
- [26] . 2014. A benchmark for comparison of cell tracking algorithms. Bioinformatics 30, 11 (2014), 1609–1617.Google Scholar
Cross Ref
- [27] . 2020. Binarization of degraded document images with global-local U-Nets. Optik 203 (2020), 164025.Google Scholar
Cross Ref
- [28] . 2020. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2020).Google Scholar
- [29] . 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.Google Scholar
Cross Ref
- [30] . 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2117–2125.Google Scholar
Cross Ref
- [31] . 2018. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV’18). 405–420.Google Scholar
Cross Ref
- [32] . 2015. Grid long short-term memory. arXiv:1507.01526. Retrieved from https://arxiv.org/abs/1507.01526.Google Scholar
- [33] . 2016. Cells in multidimensional recurrent neural networks. J. Mach. Learn. Res. 17, 1 (2016), 3313–3349.Google Scholar
Digital Library
- [34] . 2016. Pixel recurrent neural networks. In International Conference on Machine Learning. PMLR, 1747–1756.Google Scholar
- [35] . 2018. ICFHR 2018 competition on handwritten document image binarization (H-DIBCO 2018). In Proceedings of the 16th International Conference on Frontiers in Handwriting Recognition (ICFHR’18), 489–493.Google Scholar
- [36] . Retrieved June 16, 2021 from https://readcoop.eu. ([n. d.]).Google Scholar
- [37] . 2018. Combination of two fully convolutional neural networks for robust binarization. In Proceedings of the Asian Conference on Computer Vision.
Springer , 509–524.Google Scholar - [38] . 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.Google Scholar
Cross Ref
- [39] . 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105.Google Scholar
Digital Library
- [40] . Retrieved June 16, 2021 from https://www.xilinx.com/support/documentation/ip_documentation/dpu/v3_1/pg338-dpu.pdf.Google Scholar
- [41] . 2020. When massive GPU parallelism ain’t enough: A novel hardware architecture of 2d-LSTM neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’20).
Association for Computing Machinery ,New York, NY , 111–121.DOI: DOI: http://dx.doi.org/10.1145/3373087.3375301Google ScholarDigital Library
- [42] . 2016. FPGA based implementation of deep neural networks using on-chip memory only. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16).
IEEE , 1011–1015.Google ScholarDigital Library
- [43] . 2017. Ternary neural networks for resource-efficient AI applications. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’17).
IEEE , 2547–2554.Google ScholarCross Ref
- [44] . 2018. FP-BNN: Binarized neural network on FPGA. Neurocomputing 275 (2018), 1072–1086.Google Scholar
Digital Library
- [45] . 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision.
Springer , 525–542.Google ScholarCross Ref
- [46] . 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 65–74.Google Scholar
Digital Library
- [47] . 2015. Recurrent neural networks hardware implementation on FPGA. arXiv:1511.05552. Retrieved from https://arxiv.org/abs/1511.05552.Google Scholar
- [48] . 2016. An FPGA implementation of a long short-term memory neural network. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig’16). IEEE, 1–8.Google Scholar
Cross Ref
- [49] . 2016. FPGA-based low-power speech recognition with recurrent neural networks. In Proceedings of the IEEE International Workshop on Signal Processing Systems (SiPS’16).
IEEE , 230–235.Google ScholarCross Ref
- [50] . 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In Proceedings of the 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17).
IEEE , 629–634.Google ScholarDigital Library
- [51] . 2020. Design and implementation of LSTM accelerator based on FPGA. In Proceedings of the IEEE 20th International Conference on Communication Technology (ICCT’20).
IEEE , 1675–1679.Google ScholarCross Ref
- [52] . 2020. Achieving full parallelism in LSTM via a unified accelerator design. In Proceedings of the IEEE 38th International Conference on Computer Design (ICCD’20).
IEEE , 469–477.Google Scholar - [53] . 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17).
IEEE , 152–159.Google ScholarCross Ref
- [54] . 2020. Mapping large LSTMs to FPGAs with weight reuse. J. Sign. Process. Syst. 92, 9 (2020), 965–979.Google Scholar
Cross Ref
- [55] . 2017. Hardware architecture of bidirectional long short-term memory neural network for optical character recognition. In Proceedings of the Conference on Design, Automation & Test in Europe.
European Design and Automation Association , 1394–1399.Google ScholarDigital Library
- [56] . 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18).
IEEE , 89–897.Google ScholarCross Ref
- [57] . 2017. High-performance video content recognition with long-term recurrent convolutional network for FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17).
IEEE , 1–4.Google ScholarCross Ref
- [58] . 2017. Ese: Efficient speech recognition rngine with sparse LSTM on FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.
ACM , 75–84.Google ScholarDigital Library
- [59] . 2018. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.
ACM , 11–20.Google ScholarDigital Library
- [60] . 2020. DC-LSTM: Deep compressed LSTM with low bit-width and structured matrices. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’20).
IEEE , 1–5.Google ScholarCross Ref
- [61] . 2019. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 63–72.Google Scholar
Digital Library
- [62] . 2021. BRDS: An FPGA-based LSTM accelerator with row-balanced dual-ratio sparsification. arXiv:2101.02667. Retrieved from https://arxiv.org/abs/2101.02667.Google Scholar
- [63] Retrieved June 16, 2021 from http://pytorch.org/.Google Scholar
- [64] . 2017. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18, 1 (2017), 6869–6898.Google Scholar
- [65] . 2016. DoReFa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160. Retrieved from https://arxiv.org/abs/1606.06160.Google Scholar
- [66] . 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 249–256.Google Scholar
- [67] . Retrieved June 16, 2021 from https://www.xilinx.com/support/documentation/ip_documentation/xdma/v4_1/pg195-pcie-dma.pdf.Google Scholar
- [68] . Retrieved June 16, 2021 from https://www.xilinx.com/support/documentation/ip_documentation/ultrascale_memory_ip/v1_4/pg150-ultrascale-memory-ip.pdf.Google Scholar
- [69] . Retrieved June 16, 2021 from https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841813/Zynq+UltraScale+MPSoC+Power+Management.Google Scholar
- [70] . 2018. Rmdl: Random multimodel deep learning for classification. In Proceedings of the 2nd International Conference on Information System and Data Mining.
ACM , 19–28.Google ScholarDigital Library
- [71] . 2017. anyocr: An open-source ocr system for historical archives. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17), Vol. 1.
IEEE , 305–310.Google ScholarCross Ref
- [72] . 2014. Robust Binarization of Stereo and Monocular Document Images Using Percentile Filter. Vol. 1. Springer, 139–149.
DOI: DOI: http://dx.doi.org/10.1007/978-3-319-05167-3_11Google Scholar - [73] . 2018. iDocChip: A configurable hardware architecture for historical document image processing: Percentile based binarization. In Proceedings of the ACM Symposium on Document Engineering 2018. 1–8.Google Scholar
Digital Library
- [74] . Retrieved June 16, 2021 from http://kallimachos.de/kallimachos/index.php/Narragonien.Google Scholar
- [75] . Retrieved June 16, 2021 from http://www.produktinfo.conrad.com/datenblaetter/100000-124999/124608-an-01-ml-VOLTCRAFT_VC_870_DMM__K__de_en_fr_nl.pdf.Google Scholar
- [76] . 2017. The implementation of a deep recurrent neural network language model on a Xilinx FPGA. arXiv:1710.10296. Retrieved from https://arxiv.org/abs/1710.10296.Google Scholar
- [77] . 2018. Driving into the memory wall: The role of memory for advanced driver assistance systems and autonomous driving. In Proceedings of the International Symposium on Memory Systems (MEMSYS’18).
Association for Computing Machinery ,New York, NY , 377–386.DOI: DOI: http://dx.doi.org/10.1145/3240302.3240322Google ScholarDigital Library
Index Terms
When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network
Recommendations
When Massive GPU Parallelism Ain't Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network
FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysMultidimensional Long Short-Term Memory (MD-LSTM) neural network is an extension of one-dimensional LSTM for data with more than one dimension that allows MD-LSTM to show state-of-the-art results in various applications including handwritten text ...
Efficient Hardware Architectures for 1D- and MD-LSTM Networks
AbstractRecurrent Neural Networks, in particular One-dimensional and Multidimensional Long Short-Term Memory (1D-LSTM and MD-LSTM) have achieved state-of-the-art classification accuracy in many applications such as machine translation, image caption ...
A high performance hardware accelerator for dynamic texture segmentation
The major contribution of this paper is the development of a hardware (FPGA) software (CPU) co-design architecture for accelerating the application of Dynamic Texture Segmentation.This work presents a FPGA implementation of FFT processing sub-system ...






Comments