Abstract
In recent years, Keyword Spotting (KWS) has become a crucial human–machine interface for mobile devices, allowing users to interact more naturally with their gadgets by leveraging their own voice. Due to privacy, latency and energy requirements, the execution of KWS tasks on the embedded device itself instead of in the cloud, has attracted significant attention from the research community. However, the constraints associated with embedded systems, including limited energy, memory, and computational capacity, represent a real challenge for the embedded deployment of such interfaces. In this article, we explore and guide the reader through the design of KWS systems. To support this overview, we extensively survey the different approaches taken by the recent state-of-the-art (SotA) at the algorithmic, architectural, and circuit level to enable KWS tasks in edge, devices. A quantitative and qualitative comparison between relevant SotA hardware platforms is carried out, highlighting the current design trends, as well as pointing out future research directions in the development of this technology.
- [1] . 2015. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications. ACM, 117–122. Google Scholar
Digital Library
- [2] . 2018. Deep residual learning for small-footprint keyword spotting. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 5484–5488.Google Scholar
Cross Ref
- [3] . 2016. Personalized speech recognition on mobile devices. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 5955–5959.Google Scholar
Digital Library
- [4] . 2010. Cloud computing: Issues and challenges. In Proceedings of the 2010 24th IEEE International Conference on Advanced Information Networking and Applications. IEEE, 27–33. Google Scholar
Digital Library
- [5] . 2017. Hello edge: Keyword spotting on microcontrollers. arXiv:1711.07128. Retrieved from https://arxiv.org/abs/1711.07128.Google Scholar
- [6] . 2018. On-the-fly deterministic binary filters for memory efficient keyword spotting applications on embedded devices. In Proceedings of the 2nd International Workshop on Embedded and Mobile Deep Learning. 13–18. Google Scholar
Digital Library
- [7] . 2020. Hardware aware training for efficient keyword spotting on general purpose and specialized hardware. arXiv:2009.04465. Retrieved from https://arxiv.org/abs/2009.04465.Google Scholar
- [8] . 2016. An adaptive multi-band system for low power voice command recognition. In Proceedings of the 17th Annual Conference of the International Speech Communication Association. Nelson Morgan (Ed.), ISCA, 1888–1892.Google Scholar
Cross Ref
- [9] . 2018. Laika: A 5 uW programmable LSTM accelerator for always-on keyword spotting in 65 nm CMOS. In Proceedings of the IEEE 44th European Solid State Circuits Conference. IEEE, 166–169.Google Scholar
- [10] . 2020. NS-KWS: Joint optimization of near-sensor processing architecture and low-precision GRU for always-on keyword spotting. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 97–102. Google Scholar
Digital Library
- [11] . 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv:1804.03209. Retrieved from https://arxiv.org/abs/1804.03209.Google Scholar
- [12] . 2020. 14.1 a 510 nW 0.41 v low-memory low-computation keyword-spotting chip using serial FFT-Based MFCC and binarized depthwise separable convolutional neural network in 28 nm CMOS. In Proceedings of the 2020 IEEE International Solid-State Circuits Conference. IEEE, 230–232.Google Scholar
Cross Ref
- [13] . 1993. Tidigits Speech Corpus. Texas Instruments, Inc .Google Scholar
- [14] . 2020. Vocell: A 65-nm speech-triggered wake-up soc for 10 uW keyword spotting and speaker verification. IEEE Journal of Solid-State Circuits 55, 4 (2020), 868–878.Google Scholar
Cross Ref
- [15] . 2017. 14.4 a scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating. In Proceedings of the 2017 IEEE International Solid-State Circuits Conference. IEEE, 244–245.Google Scholar
Cross Ref
- [16] . 2018. A 141 uw, 2.46 pj/neuron binarized convolutional neural network based self-learning speech recognition processor in 28 nm CMOS. In Proceedings of the 2018 IEEE Symposium on VLSI Circuits. IEEE, 139–140.Google Scholar
- [17] . 2019. Efficient keyword spotting using dilated convolutions and gating. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6351–6355.Google Scholar
Cross Ref
- [18] . 2020. Always-on, Sub-300-nW, event-driven spiking neural network based on spike-driven clock-generation and clock-and power-gating for an ultra-low-power intelligent device. arXiv:2006.12314. Retrieved from https://arxiv.org/abs/2006.12314.Google Scholar
- [19] . 2019. A 5.1 pJ/neuron 127.3 us/inference RNN-based speech recognition processor using 16 computing-in-memory SRAM macros in 65 nm CMOS. In Proceedings of the 2019 Symposium on VLSI Circuits. IEEE, C120–C121.Google Scholar
Cross Ref
- [20] . 2019. Implementation of LSTM accelerator for speech keywords recognition. In Proceedings of the 2019 IEEE 4th International Conference on Integrated Circuits and Microsystems. IEEE, 195–198.Google Scholar
Cross Ref
- [21] . 2020. KeyRAM: A 0.34 uJ/decision 18 k decisions/s recurrent attention in-memory processor for keyword spotting. In Proceedings of the 2020 IEEE Custom Integrated Circuits Conference. IEEE, 1–4.Google Scholar
Cross Ref
- [22] . 2019. EERA-KWS: A 163 TOPS/W always-on keyword spotting accelerator in 28 nm CMOS using binary weight network and precision self-adaptive approximate computing. IEEE Access 7 (2019), 82453–82465.
DOI : 10.1109/ACCESS.2019.2924340Google ScholarCross Ref
- [23] . 2020. A 22 nm, 10.8 \(\mu\)W/15.1 \(\mu\)W dual computing modes high power-performance-area efficiency domained background noise aware keyword-spotting processor. IEEE Transactions on Circuits and Systems I: Regular Papers 67, 12 (2020), 4733–4746.Google Scholar
Cross Ref
- [24] . 2020. Ultratrail: A configurable ultralow-power TC-ResNet AI accelerator for efficient keyword spotting. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 4240–4251.Google Scholar
Cross Ref
- [25] . 2019. An ultra-low power always-on keyword spotting accelerator using quantized convolutional neural network and voltage-domain analog switching network-based approximate computing. IEEE Access 7 (2019), 186456–186469.
DOI : 10.1109/ACCESS.2019.2960948Google ScholarCross Ref
- [26] . 2019. Flexible low power CNN accelerator for edge computing with weight tuning. In Proceedings of the 2019 IEEE Asian Solid-State Circuits Conference. IEEE, 209–212.Google Scholar
Cross Ref
- [27] . 2020. RNNAccel: A fusion recurrent neural network accelerator for edge intelligence. arXiv:2010.13311. Retrieved from https://arxiv.org/abs/2010.13311.Google Scholar
- [28] . 1988. The DARPA 1000-word resource management database for continuous speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE Computer Society, 651–652.Google Scholar
Cross Ref
- [29] . 2015. A fixed-point neural network for keyword detection on resource constrained hardware. In Proceedings of the 2015 IEEE Workshop on Signal Processing Systems. IEEE, 1–6.Google Scholar
Cross Ref
- [30] . 2019. Query-by-example on-device keyword spotting. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 532–538.Google Scholar
Cross Ref
- [31] . 2005. Passive Microphone BJ-21590-000. Retrieved on April 29, 2021 from https://www.digikey.be/htmldatasheets/production/388648/0/0/1/bj-21590-000-drawing.html.Google Scholar
- [32] . 2012. The Microphone Book: From Mono to Stereo to Surround-A Guide to Microphone Design and Application. CRC Press.Google Scholar
Cross Ref
- [33] . 1975. Simple reaction-times to speech and non-speech stimuli. Cortex 11, 4 (1975), 355–360.Google Scholar
Cross Ref
- [34] . 2017. A low-power speech recognizer and voice activity detector using deep neural networks. IEEE Journal of Solid-State Circuits 53, 1 (2017), 66–75.Google Scholar
Cross Ref
- [35] . 2017. Direct modeling of raw audio with dnns for wake word detection. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 252–257.Google Scholar
Cross Ref
- [36] . 2020. Small-footprint keyword spotting on raw audio data with sinc-convolutions. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7454–7458.Google Scholar
Cross Ref
- [37] . 2018. Energy-efficient MFCC extraction architecture in mixed-signal domain for automatic speech recognition. In Proceedings of the 2018 IEEE/ACM International Symposium on Nanoscale Architectures. IEEE, 1–3. Google Scholar
Digital Library
- [38] . 2018. A 0.6 V 54DB SNR analog frontend with 0.18 THD for low power sensory applications in 65NM CMOS. In Proceedings of the 2018 IEEE Symposium on VLSI Circuits. IEEE, 241–242.Google Scholar
- [39] . 1993. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the Institute of Phonetic Sciences. Vol. 17, Amsterdam, 97–110.Google Scholar
- [40] . 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1251–1258.Google Scholar
Cross Ref
- [41] . 2018. Studying the effects of feature extraction settings on the accuracy and memory requirements of neural networks for keyword spotting. In Proceedings of the 2018 IEEE 8th International Conference on Consumer Electronics. IEEE, 1–6.Google Scholar
Cross Ref
- [42] . 2017. 14.7 a 288 \(\mu\)w programmable deep-learning processor with 270 kb on-chip weight storage using non-uniform memory hierarchy for mobile intelligence. In Proceedings of the 2017 IEEE International Solid-State Circuits Conference. IEEE, 250–251.Google Scholar
Cross Ref
- [43] . 2015. A 90 nm CMOS, 6 uW power-proportional acoustic sensing frontend for voice activity detection. IEEE Journal of Solid-State Circuits 51, 1 (2015), 291–302.Google Scholar
- [44] . 1990. A hidden Markov model based keyword recognition system. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 129–132.Google Scholar
Cross Ref
- [45] . 2014. Online word-spotting in continuous speech with recurrent neural networks. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop. IEEE, 536–541.Google Scholar
Cross Ref
- [46] . 2014. Small-footprint keyword spotting using deep neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 4087–4091.Google Scholar
Cross Ref
- [47] . 2015. Convolutional neural networks for small-footprint keyword spotting. In Proceedings of the 16th Annual Conference of the International Speech Communication Association.Google Scholar
Cross Ref
- [48] . 2007. An application of recurrent neural networks to discriminative keyword spotting. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 220–229. Google Scholar
Digital Library
- [49] . 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156–165.Google Scholar
Cross Ref
- [50] . 2011. Neuromorphic silicon neuron circuits. Frontiers in Neuroscience 31, 5 (2011), 73.Google Scholar
- [51] . 2013. An investigation of deep neural networks for noise robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7398–7402.Google Scholar
Cross Ref
- [52] . 2012. Hardware/software codesign: The past, the present, and predicting the future. Proceedings of the IEEE 100, Special Centennial Issue (2012), 1411–1430.Google Scholar
Cross Ref
- [53] . 2017. Hey Siri: An on-device DNN-powered voice trigger for Apple’s personal assistant. Apple Machine Learning Journal 1, 6 (2017).Google Scholar
- [54] . 2019. Efficient keyword spotting through hardware-aware conditional execution of deep neural networks. In Proceedings of the 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications. IEEE, 1–8.Google Scholar
Cross Ref
- [55] . 2017. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE 105, 12 (2017), 2295–2329.Google Scholar
Cross Ref
- [56] . 2017. Understanding the impact of precision quantization on the accuracy and energy of neural networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition. IEEE, 1474–1479. Google Scholar
Digital Library
- [57] . 2017. WRPN: Wide reduced-precision networks. arXiv:1709.01134. Retrieved from https://arxiv.org/abs/1709.01134.Google Scholar
- [58] . 2021. On the quantization of recurrent neural networks. arXiv:2101.05453. Retrieved from https://arxiv.org/abs/2101.05453.Google Scholar
- [59] . 2020. What is the state of neural network pruning?arXiv:2003.03033. Retrieved from https://arxiv.org/abs/2003.03033.Google Scholar
- [60] . 2016. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv:1607.03250. Retrieved from https://arxiv.org/abs/1607.03250.Google Scholar
- [61] . 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/1510.00149.Google Scholar
- [62] . 2016. LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2016), 2222–2232.Google Scholar
Cross Ref
- [63] . 2007. Analysis of dynamic voltage/frequency scaling in chip-multiprocessors. In Proceedings of the 2007 International Symposium on Low Power Electronics and Design. IEEE, 38–43. Google Scholar
Digital Library
- [64] . 2019. The Speed and Power Advantage of a Purpose-Built Neural Compute Engine. Retrieved
June 2019 from https://www.syntiant.com/post/keyword-spotting-power-comparison.Google Scholar - [65] . Controlling Leakage Power in Nanometer CMOS: Technology Meets Design. Retrieved on April 29, 2021 from https://www.edacentrum.de/controlling-leakage-power-nanometer-cmos-technology-meets-design.Google Scholar
- [66] . 2007. The Basic Practice of Statistics. Vol. 2. WH Freeman, New York, NY. Google Scholar
Digital Library
- [67] . 2011. Big. LITTLE Processing with ARM Cortex™-A15 & Cortex-A7. Retrieved on April 29, 2021 from https://www.eetimes.com/big-little-processing-with-arm-cortex-a15-cortex-a7/.Google Scholar
- [68] . 2019. 17.2 a 142 nW voice and acoustic activity detection chip for mm-scale sensor nodes using time-interleaved mixer-based frequency scanning. In Proceedings of the 2019 IEEE International Solid-State Circuits Conference. IEEE, 278–280.Google Scholar
Cross Ref
Index Terms
Hardware Acceleration for Embedded Keyword Spotting: Tutorial and Survey
Recommendations
FPGA Acceleration of RankBoost in Web Search Engines
Search relevance is a key measurement for the usefulness of search engines. Shift of search relevance among search engines can easily change a search company's market cap by tens of billions of dollars. With the ever-increasing scale of the Web, machine ...
Few-Shot Keyword Spotting With Prototypical Networks
ICMLT '22: Proceedings of the 2022 7th International Conference on Machine Learning TechnologiesRecognizing a particular command or a keyword, keyword spotting has been widely used in many voice interfaces such as Amazon’s Alexa and Google Home. In order to recognize a set of keywords, most of the recent deep learning based approaches use a ...
Convolutional neural network acceleration with hardware/software co-design
Convolutional Neural Networks (CNNs) have a broad range of applications, such as image processing and natural language processing. Inspired by the mammalian visual cortex, CNNs have been shown to achieve impressive results on a number of computer vision ...






Comments