Abstract
Large Vocabulary Continuous Speech Recognition systems require Viterbi searching through a large state space to find the most probable sequence of phonemes that led to a given sound sample. This needs storing and updating of a large Active State List (ASL) in the on-chip memory (OCM) at regular intervals (called frames), which poses a major performance bottleneck for speech decoding. Most works use hash tables for OCM storage while beam-width pruning to restrict the ASL size. To achieve a decent accuracy and performance, a large OCM, numerous acoustic probability computations, and DRAM accesses are incurred.
We propose to use a binary search tree for ASL storage and a max heap data structure to track the worst cost state and efficiently replace it when a better state is found. With this approach, the ASL size can be reduced from over 32K to 512 with minimal impact on recognition accuracy for a 7,000-word vocabulary model. This, combined with a caching technique for acoustic scores, reduced the DRAM data accessed by 31\( \times \) and the acoustic probability computations by 26\( \times \).
The approach has also been implemented in hardware on a Xilinx Zynq FPGA at 200 MHz using the Vivado SDS compiler. We study the tradeoffs among the amount of OCM used, word error rate, and decoding speed to show the effectiveness of the approach. The resulting implementation is capable of running faster than real time with 91% lesser block-RAMs.
- [1] . 2021. Frustratingly easy noise-aware training of acoustic models. arXiv:2011.02090. Retrieved from https://arxiv.org/abs/2011.02090.Google Scholar
- [2] . 2017. UNFOLD: A memory-efficient speech recognizer using on-the-fly WFST composition. In Proceedings of the International Symposium on Microarchitecture (ISCA). 69–81. Google Scholar
Digital Library
- [3] . 2018. A low-power speech recognizer and voice activity detector using deep neural networks. IEEE J. Solid-State Circ. 53, 1 (2018), 66–75. Google Scholar
Cross Ref
- [4] . 2019. A low-power, high-performance speech recognition accelerator. IEEE Trans. Comput. 68, 12 (2019), 1817–1831. Google Scholar
Cross Ref
- [5] . 2020. Design and evaluation of an ultra low-power human-quality speech recognition system. ACM Trans. Arch. Code Optimiz. 17, 4 (2020), 1–19. Google Scholar
Digital Library
- [6] . 2014. A 6 mW, 5,000-Word real-time speech recognizer using WFST models. IEEE J. Solid-State Circ. 50, 1 (2014), 102–112. Google Scholar
Cross Ref
- [7] . 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). IEEE, 243–254. Google Scholar
Digital Library
- [8] . 2016. Memory-Efficient modeling and search techniques for hardware ASR decoders. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’16). 1893–1897. http://people.csail.mit.edu/jrg/2016/Price-Interspeech-16.pdf.Google Scholar
Cross Ref
- [9] . 2021. Using Gaussian mixtures on triphone acoustic modelling-based Punjabi continuous speech recognition. In Advances in Computational Intelligence and Communication Technology. Springer, 395–406. Google Scholar
Cross Ref
- [10] . 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sign. Process. 28, 4 (1980), 357–366. Google Scholar
Cross Ref
- [11] . 2013. Improved feature processing for deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’16). 109–113. https://www.danielpovey.com/files/2013_interspeech_nnet_lda.pdf.Google Scholar
Cross Ref
- [12] . 2002. Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16, 1 (2002), 69–88. Google Scholar
Digital Library
- [13] . 2008. Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing. Springer, 559–584. Google Scholar
Cross Ref
- [14] . 2001. Time and memory efficient viterbi decoding for LVCSR using a precompiled search network. In Proceedings of the 7th European Conference on Speech Communication and Technology. https://www.isca-speech.org/archive_v0/archive_papers/eurospeech_2001/e01_0847.pdf.Google Scholar
- [15] . 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2 (1967), 260–269. Google Scholar
Digital Library
- [16] . 2011. The kaldi speech recognition toolkit. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. https://www.danielpovey.com/files/2011_asru_kaldi.pdf.Google Scholar
- [17] . 2021. A novel approach to perform context-based automatic spoken document retrieval of political speeches based on wavelet tree indexing. Multimedia Tools Appl. 80, 14 (2021), 22209–22229. Google Scholar
Digital Library
- [18] . 2001. Introduction to Algorithms (2nd ed.). The MIT Press. https://doc.lagout.org/science/0_Computer%20Science/2_Algorithms/Introduction%20to%20Algorithms%2C%202nd%20Edition.pdf.Google Scholar
- [19] . 2018. The dark side of DNN pruning. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 790–801. Google Scholar
Digital Library
- [20] . 2012. An investigation of tied-mixture GMM based triphone state clustering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). IEEE, 4717–4720. Google Scholar
Cross Ref
- [21] 2019. UG902 Vivado Design Suite User Guide—High-Level Synthesis. https://docs.xilinx.com/v/u/en-US/ug902-vivado-high-level-synthesis.Google Scholar
- [22] . 2016. An evaluation of Vivado HLS for efficient system design. In Proceedings of the International Symposium on Electronics in Marine (ELMAR’16). 195–199. Google Scholar
Cross Ref
- [23] 2019. UG1027 SDSoC Environment User Guide. https://www.xilinx.com/support/documents/sw_manuals/xilinx2019_1/ug1027-sdsoc-user-guide.pdf.Google Scholar
- [24] 2019. ZCU102 Evaluation Board User Guide (UG1182). Retrieved on July 29, 2021 from https://www.xilinx.com/support/documentation/boards_and_kits/zcu102/ug1182-zcu102-eval-bd.pdf.Google Scholar
- [25] 2006. Librivox— Solomon Mines Audio Book. Retrieved on October 7, 2021 from https://librivox.org/king-solomons-mines-by-haggard/.Google Scholar
- [26] 2016. DDR4 SDRAM SODIMM Features (MTA4ATF51264HZ–2G6E1). Retrieved on July 29, 2021 from https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/modules/sodimm/ddr4/atf4c512x64hz.pdf?rev=e4f0743341814159bc75d9f2511f4dfd.Google Scholar
- [27] DDR4 Power Calculator. Retrieved on June 28, 2021 from https://media-www.micron.com/-/media/client/global/documents/products/power-calculator/ddr4_power_calc.xlsm?la=en&rev=5e97be39078d4a1b8619cb85c96bbe63.Google Scholar
- [28] . 2012. A 40-nm 168-mW 2.4\( \times \)-Real-Time VLSI processor for 60-k word continuous speech recognition. In Proceedings of the IEEE Custom Integrated Circuits Conference. IEEE, 1–4. Google Scholar
Cross Ref
- [29] . 2010. Search error risk minimization in viterbi beam search for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). IEEE, 4934–4937. Google Scholar
Cross Ref
- [30] . 2012. A 40 nm 144 mW VLSI processor for real-time 60-k word continuous speech recognition. IEEE Trans. Circ. Syst. I: Regul. Pap. 59, 8 (2012), 1656–1666. Google Scholar
Cross Ref
- [31] . 2010. An FPGA implementation of speech recognition with weighted finite state transducers. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1602–1605. Google Scholar
Cross Ref
- [32] . 2012. Flexible and expandable speech recognition hardware with weighted finite state transducers. J. Sign. Process. Syst. 66, 3 (2012), 235–244. Google Scholar
Digital Library
- [33] . 2016. Energy-Scalable Speech Recognition Circuits. Ph.D. Dissertation. Massachusetts Institute of Technology. https://dspace.mit.edu/handle/1721.1/106090.Google Scholar
- [34] . 2010. A real-time FPGA-Based 20000-Word speech recognizer with optimized DRAM access. In IEEE Transactions on Circuits and Systems. IEEE, 2119–2131. Google Scholar
Digital Library
- [35] . 2006. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1. IEEE, I–I. Google Scholar
Cross Ref
Index Terms
Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition
Recommendations
On the Recognition of Cochlear Implant-Like Spectrally Reduced Speech With MFCC and HMM-Based ASR
This correspondence investigates the recognition of cochlear implant-like spectrally reduced speech (SRS) using mel frequency cepstral coefficient (MFCC) and hidden Markov model (HMM)-based automatic speech recognition (ASR). The SRS was synthesized ...
Noise Robust Speech Recognition Based on Noise-Adapted HMMs Using Speech Feature Compensation
ACSAT '13: Proceedings of the 2013 International Conference on Advanced Computer Science Applications and TechnologiesIn conventional VTS-based noisy speech recognition methods, the parameters of the clean HMM are adapted to test noisy speech, or the original clean speech is estimated from the test noisy speech. However, in noisy speech recognition, improved ...
Speech disorder Malay speech recognition system
SENSIG'09/VIS'09/MATERIALS'09: Proceedings of the 2nd WSEAS International Conference on Sensors, and Signals and Visualization, Imaging and Simulation and Materials ScienceAutomatic speech recognition systems have the potential to make hard to understand speech more easily recognizable. Designing a system that recognizes impaired speech is more difficult than a system that recognizes normal speech. The Automatic Malay ...






Comments