Abstract
This work introduces Remarn, a reconfigurable multi-threaded multi-core accelerator supporting both spatial and temporal co-execution of Recurrent Neural Network (RNN) inferences. It increases processing capabilities and quality of service of cloud-based neural processing units (NPUs) by improving their hardware utilization and by reducing design latency, with two innovations. First, a custom coarse-grained multi-threaded RNN/Long Short-Term Memory (LSTM) hardware architecture, switching tasks among threads when RNN computational engines meet data hazards. Second, the partitioning of this hardware architecture into multiple full-fledged sub-accelerator cores, enabling spatially co-execution of multiple RNN/LSTM inferences. These innovations improve the exploitation of the available parallelism to increase runtime hardware utilization and boost design throughput. Evaluation results show that a dual-threaded quad-core Remarn NPU achieves 2.91 times higher performance while only occupying 5.0% more area than a single-threaded one on a Stratix 10 FPGA. When compared with a Tesla V100 GPU implementation, our design achieves 6.5 times better performance and 15.6 times higher power efficiency, showing that our approach contributes to high performance and energy-efficient FPGA-based multi-RNN inference designs for datacenters.
- [1] . 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 411–4117.Google Scholar
Cross Ref
- [2] . 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning.Google Scholar
- [3] . 2017. A multithreaded CGRA for convolutional neural network processing. Circ. Syst. 8, 6 (2017), 149–170.Google Scholar
Cross Ref
- [4] . 2020. A multi-neural network acceleration architecture. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 940–953.Google Scholar
Digital Library
- [5] . 2019. Polar: A pipelined/overlapped fpga-based lstm accelerator. IEEE Trans. VLSI Syst. 28, 3 (2019), 838–842.Google Scholar
Digital Library
- [6] . 2000. A multithreaded PowerPC processor for commercial servers. IBM J. Res. Dev. 44, 6 (2000), 885–898.Google Scholar
Digital Library
- [7] . 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). IEEE.Google Scholar
Cross Ref
- [8] . 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 10–19.Google Scholar
Cross Ref
- [9] . 2019. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 63–72.Google Scholar
Digital Library
- [10] . 2015. Recurrent neural networks hardware implementation on FPGA. arXiv:1511.05552. Retrieved from https://arxiv.org/abs/1511.05552.Google Scholar
- [11] . 2020. BLINK: Bit-sparse LSTM inference kernel enabling efficient calcium trace extraction for neurofeedback devices. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 217–222.Google Scholar
Digital Library
- [12] . 2022. Energy efficient LSTM inference accelerator for real-time causal prediction. ACM Trans. Des. Autom. Electr. Syst. 27, 5, Article 44 (September 2022), 19 pages. Google Scholar
Cross Ref
- [13] . 2018. CLINK: Compact LSTM inference kernel for energy efficient neurofeedback devices. In Proceedings of the International Symposium on Low Power Electronics and Design. 1–6.Google Scholar
Digital Library
- [14] . 2020. PREMA: A predictive multi-task scheduling algorithm for preemptible neural processing units. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). IEEE, 220–233.Google Scholar
Cross Ref
- [15] . 2015. Keras: Deep Learning Library for theano and tensorflow. https://keras.io/k.Google Scholar
- [16] . 2006. Application-specific customisation of multi-threaded soft processors. IEE Proc. Comput. Digit. Techn. 153, 3 (2006), 173–180.Google Scholar
Cross Ref
- [17] . 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.Google Scholar
Cross Ref
- [18] . 2021. Optimizing Bayesian recurrent neural networks on an FPGA-based accelerator. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’21). IEEE, 1–10.Google Scholar
Cross Ref
- [19] . 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 1–14.Google Scholar
Digital Library
- [20] . 2021. Spartus: A 9.4 TOp/s FPGA-based LSTM accelerator exploiting spatio-temporal sparsity. IEEE Transactions on Neural Networks and Learning Systems. (Early Access)Google Scholar
- [21] . 2018. DeltaRNN: A power-efficient recurrent neural network accelerator. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 21–30.Google Scholar
Digital Library
- [22] . 2018. Low latency RNN inference with cellular batching. In Proceedings of the 13th European Conference on Computer Systems Conference (EuroSys’18). 1–15.Google Scholar
Digital Library
- [23] . 2021. BRDS: An FPGA-based LSTM accelerator with row-balanced dual-ratio sparsification. arXiv:2101.02667. Retrieved from https://arxiv.org/abs/2101.02667.Google Scholar
- [24] . 2020. Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). IEEE, 681–697.Google Scholar
Cross Ref
- [25] . 2016. A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57 (2016), 345–420.Google Scholar
Cross Ref
- [26] . 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 152–159.Google Scholar
Cross Ref
- [27] . 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In Proceedings of the 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 629–634.Google Scholar
Digital Library
- [28] . 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 75–84.Google Scholar
Digital Library
- [29] . 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv:1412.5567. Retrieved from https://arxiv.org/abs/1412.5567.Google Scholar
- [30] . 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [31] . 2018. Detecting spacecraft anomalies using LSTM and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 387–395.Google Scholar
Digital Library
- [32] . [n.d.]. Understanding How Hyperflex Architecture Enables High Performance Systems. White Paper 01231.Google Scholar
- [33] . 2020. Intel Agilex Variable Precision DSP Blocks User Guide.Google Scholar
- [34] . 2022. A low-latency LSTM accelerator using balanced sparsity based on FPGA. Microprocess. Microsyst. 89 (2022), 104417.Google Scholar
Digital Library
- [35] . 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.Google Scholar
Cross Ref
- [36] . 2022. MAGMA: An optimization framework for mapping multiple DNNs on multiple accelerator cores. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’22). IEEE.Google Scholar
Cross Ref
- [37] . 2021. A reversible-logic based architecture for long short-term memory (LSTM) network. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’21). IEEE, 1–5.Google Scholar
Cross Ref
- [38] . 2021. AERO: A 1.28 MOP/s/LUT reconfigurable inference processor for recurrent neural networks in a resource-limited FPGA. Electronics 10, 11 (2021), 1249.Google Scholar
Cross Ref
- [39] . 2020. Scalable multi-FPGA acceleration for large RNNs with full parallelism levels. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). IEEE, 1–6.Google Scholar
Digital Library
- [40] . 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 71–83.Google Scholar
Cross Ref
- [41] . 2019. Extracting INT8 multipliers from INT18 multipliers. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’19). IEEE.Google Scholar
Cross Ref
- [42] . 2019. E-RNN: Design optimization for efficient recurrent neural networks in FPGAs. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’19). IEEE, 69–80.Google Scholar
Cross Ref
- [43] . 2019. An energy-efficient and noise-tolerant recurrent neural network using stochastic computing. IEEE Trans. VLSI Syst. 27, 9 (2019), 2213–2221.Google Scholar
Cross Ref
- [44] . 2021. Specializing FGPU for persistent deep learning. ACM Trans. Reconfig. Technol. Syst. 14, 2 (2021), 1–23.Google Scholar
Digital Library
- [45] . 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142–150.Google Scholar
Digital Library
- [46] . 2005. Montecito: A dual-core, dual-thread itanium processor. IEEE Micro 25, 2 (2020), 10–20.Google Scholar
Digital Library
- [47] . 2021. A survey on hardware accelerators and optimization techniques for RNNs. J. Syst. Arch. 112 (2021), 101839.Google Scholar
Cross Ref
- [48] . 2020. DC-LSTM: Deep compressed LSTM with low bit-width and structured matrices. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’20). IEEE, 1–5.Google Scholar
Cross Ref
- [49] . 2019. Scalable low-latency persistent neural machine translation on CPU server with multiple FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’19). IEEE, 307–310.Google Scholar
Cross Ref
- [50] . 2010. Automatic multithreaded pipeline synthesis from transactional datapath specifications. In Proceedings of the Design Automation Conference. IEEE, 314–319.Google Scholar
Digital Library
- [51] . 2019. Why compete when you can work together: Fpga-asic integration for persistent rnns. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 199–207.Google Scholar
Cross Ref
- [52] . 2016. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–4.Google Scholar
Cross Ref
- [53] . 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 584–597.Google Scholar
Cross Ref
- [54] . 2020. Time-step interleaved weight reuse for LSTM neural network computing. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 13–18.Google Scholar
Digital Library
- [55] . 2019. Exploiting model-level parallelism in recurrent neural network accelerators. In Proceedings of the IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC’19). IEEE, 241–248.Google Scholar
Cross Ref
- [56] . 2017. Plasticine: A reconfigurable architecture for parallel patterns. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 389–402.Google Scholar
Digital Library
- [57] . 2020. Optimizing reconfigurable recurrent neural networks. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE.Google Scholar
Cross Ref
- [58] . 2019. Real-time anomaly detection for flight testing using AutoEncoder and LSTM. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’19). IEEE, 379–382.Google Scholar
Cross Ref
- [59] . 2020. A reconfigurable multithreaded accelerator for recurrent neural networks. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 20–28.Google Scholar
Cross Ref
- [60] . 2022. Recurrent neural networks with column-wise matrix-vector multiplication on FPGAs. IEEE Trans. VLSI Syst. (2022).Google Scholar
Cross Ref
- [61] . 2019. Efficient weight reuse for large LSTMs. In Proceedings of the IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP’19), Vol. 2160. IEEE, 17–24.Google Scholar
Cross Ref
- [62] . 2021. Accelerating recurrent neural networks for gravitational wave experiments. In Proceedings of the 32th International Conference on Application-specific Systems, Architectures and Processors (ASAP’21). IEEE.Google Scholar
Cross Ref
- [63] . 2020. Mapping large LSTMs to FPGAs with weight reuse. J. Sign. Process. Syst. 92, 9 (2020), 965–979.Google Scholar
Cross Ref
- [64] . 2020. Mapping multiple LSTM models on FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT’20). IEEE, 1–9.Google Scholar
Cross Ref
- [65] . 2018. Approximate FPGA-based LSTMs under computation time constraints. In Proceedings of the International Symposium on Applied Reconfigurable Computing. Springer, 3–15.Google Scholar
Cross Ref
- [66] . 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE.Google Scholar
Cross Ref
- [67] . 2020. Efficient hardware architectures for 1D-and MD-LSTM networks. J. Sign. Process. Syst. 92, 11 (2020), 1219–1245.Google Scholar
Digital Library
- [68] . 2020. When massive GPU parallelism ain’t enough: A novel hardware architecture of 2D-LSTM neural network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.Google Scholar
Digital Library
- [69] . 2017. Hardware architecture of bidirectional long short-term memory neural network for optical character recognition. In Proceedings of the Conference on Design, Automation & Test in Europe. 1394–1399.Google Scholar
Digital Library
- [70] . 2019. E-LSTM: Efficient inference of sparse LSTM on embedded heterogeneous system. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 1–6.Google Scholar
Digital Library
- [71] . 2019. SMT-SA: Simultaneous multithreading in systolic arrays. IEEE Comput. Arch. Lett. 18, 2 (2019), 99–102.Google Scholar
Digital Library
- [72] . 2020. E-BATCH: Energy-efficient and high-throughput RNN batching. ACM Transactions on Architecture and Code Optimization (TACO) 19, 1 (2020), 1–23.Google Scholar
- [73] . 1986. A pipelined, shared resource MIMD computer. In Advanced Computer Architecture. 39–41.Google Scholar
- [74] . 2019. Acceleration of deep recurrent neural networks with an FPGA cluster. In Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies. 1–4.Google Scholar
Digital Library
- [75] . 2018. FPGA acceleration of LSTM based on data for test flight. In Proceedings of the IEEE International Conference on Smart Cloud (SmartCloud’18). IEEE, 1–6.Google Scholar
Cross Ref
- [76] . 2018. Evaluating the highly-pipelined intel stratix 10 FPGA architecture using open-source benchmarks. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). IEEE, 206–213.Google Scholar
Cross Ref
- [77] . 1964. Parallel operation in the control data 6600. In Proceedings of the Fall Joint Computer Conference, Part II: Very High Speed Computer Systems. 33–40.Google Scholar
- [78] . 2018. f-CNNx: A toolflow for mapping multiple convolutional neural networks on FPGAs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 381–3817.Google Scholar
Cross Ref
- [79] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [80] . 2018. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 11–20.Google Scholar
Digital Library
- [81] . 2017. Accelerating recurrent neural networks: A memory-efficient approach. IEEE Trans. VLSI Syst. 25, 10 (2017), 2763–2775.Google Scholar
Digital Library
- [82] . 2021. METRO: A software-hardware co-design of interconnections for spatial DNN accelerators.
arxiv:2108.10570 [cs.AR]. Retrieved from https://arxiv.org/abs/2108.10570.Google Scholar - [83] . 2019. A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation. IEEE Trans. VLSI Syst. 27, 12 (2019), 2939–2943.Google Scholar
Cross Ref
- [84] . 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Retrieved from https://www.xilinx.com/support/documentation/whitepapers/wp486-deep-learning-int8.pdf.Google Scholar
- [85] . 2019. Low-complexity distributed-arithmetic-based pipelined architecture for an LSTM network. IEEE Trans. VLSI Syst. 28, 2 (2019), 329–338.Google Scholar
Cross Ref
- [86] . 2019. LSTM-sharp: An adaptable, energy-efficient hardware accelerator for long short-term memory. arXiv:1911.01258. Retrieved from https://arxiv.org/abs/1911.01258.Google Scholar
- [87] . 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.Google Scholar
- [88] . 2014. Recurrent neural network regularization. arXiv:1409.2329. Retrieved from https://arxiv.org/abs/1409.2329.Google Scholar
- [89] . 2019. Serving recurrent neural networks efficiently with a spatial accelerator. Proc. Mach. Learn. Syst. 1 (2019), 166–177.Google Scholar
- [90] . 2021. PermLSTM: A high energy-efficiency LSTM accelerator architecture. Electronics 10, 8 (2021), 882.Google Scholar
Cross Ref
Index Terms
Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks
Recommendations
Performance analysis of multi-threaded multi-core CPUs
MES '13: Proceedings of the First International Workshop on Many-core Embedded SystemsProcessors are constantly changing and becoming more advanced. They incorporate new concepts and ideas into the architecture with each evolution. One such concept is multi-threading. It aims at increasing the processors performance by reducing its idle ...
A CGRA-Based Approach for Accelerating Convolutional Neural Networks
MCSOC '15: Proceedings of the 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-ChipConvolutional neural network (CNN) is an emerging approach for achieving high recognition accuracy in various machine learning applications. To accelerate CNN computations, various GPU-based or application-specific hardware approaches have been recently ...
An OpenCL Runtime Library for Embedded Multi-Core Accelerator
RTCSA '12: Proceedings of the 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and ApplicationsIn recent years, improvements of energy efficiency and computational performance have become a major issue, because smartphones and tablets become popular. To implement high performance, multi-core accelerator consists of general purpose processors and ...






Comments