Abstract
In this paper, we propose a novel Low-Power Feature-Attention Chinese Keyword Spotting Framework based on a depthwise separable convolution neural network (DSCNN) with distillation learning to recognize speech signals of Chinese wake-up words. The framework consists of a low-power feature-attention acoustic model and its learning methods. Different from the existing model, the proposed acoustic model based on connectionist temporal classification (CTC) focuses on the reduction of power consumption by reducing model network parameters and multiply-accumulate (MAC) operations through our designed feature-attention network and DSCNN. In particular, the feature-attention network is specially designed to extract effective syllable features from a large number of MFCC features. This could refine MFCC features by selectively focusing on important speech signal features and removing invalid speech signal features to reduce the number of speech signal features, which helps to significantly reduce the parameters and MAC operations of the whole acoustic model. Moreover, DSCNN with fewer parameters and MAC operations compared with traditional convolution neural networks is adopted to extract effective high-dimensional features from syllable features. Furthermore, we apply a distillation learning algorithm to efficiently train the proposed low-power acoustic model by utilizing the knowledge of the trained large acoustic model. Experimental results thoroughly verify the effectiveness of our model and show that the proposed acoustic model still has better accuracy than other acoustic models with the lowest power consumption and smallest latency measured by NVIDIA JETSON TX2. It has only 14.524KB parameters and consumes only 0.141J energy per query and 17.9ms latency on the platform, which is hardware-friendly.
1 INTRODUCTION
In recent years, with the great breakthroughs of speech recognition technology based on deep learning algorithms (DL) [3, 7, 15, 37], intelligent speech interaction is increasingly becoming a more natural way to interact with consumer electronic devices, such as Amazon Echo, Google Home, and smartphones [36]. However, always-on speech recognition is not an efficient way of energy utilization, and it also brings other problems, e.g., privacy concerns, transmission latency, and audio stream transmission congestion. To alleviate these problems, consumer electronic devices first detect predefined keyword(s), such as “Hi Siri,” “Alexa,” and so on, which is commonly called keyword spotting (KWS). Consumer electronic devices will wake up by KWS and then activate full-scale speech recognition. KWS is always-on, and it should have very low power consumption, high accuracy, and low latency to improve the experience of consumers and prolong the battery life of electronic devices. These requirements make the research of KWS to remain challenging.
In existing methods for KWS, most of them can be divided into three categories. The first one is based on large vocabulary continuous speech recognition [10]. This method can flexibly change keywords, but it requires a lot of data and computing resources. The second one is based on query by examples [6, 21]. This is commonly applied to KWS with low resources, but its performance is not very optimistic. The third one is based on acoustic models. Methods based on acoustic models can be further divided into typical and neural network-based methods. Typical methods are based on keyword/filler hidden Markov model [23]. Although the above methods have achieved some progress in KWS, their performances are not satisfactory. With the rapid development of deep learning, neural network-based methods have become very popular and have greatly improved the performance of KWS [33]. Some deep neural networks are used to model acoustic models for KWS, such as convolutional neural networks (CNN) [24], and recurrent neural networks (RNN) [8]. Moreover, since there is no need to pre-align the training data, the connectionist temporal classification (CTC) criterion [11] is widely used to train acoustic models.
Recently, most of the works for KWS focused on the research of neural network-based acoustic models with CTC criterion [2, 17]. They mainly design some novel deep neural network structures to build an acoustic model for KWS. The key requirements of low-power acoustic models are fewer parameters and multiply-accumulate (MAC) operations of neural networks. This is because plenty of network parameters will cause excessive data transfer between CPU and memory, and MAC operations cause power consumption of operation unit, of course, both of them cause KWS system delay as well. Meanwhile, acoustic models with huge parameters cannot be deployed to devices with limited or few hardware resources.
In order to compress the model, some acoustic models based on depthwise separable convolution neural network (DSCNN) are proposed to [25, 32] detect keyword(s). However, most of them simply use DSCNN to extract high-dimensional features from all speech features but do not consider the different importance weights of speech features. For example, there are useless features in speech features, like mute or noise features. A natural idea is to retain effective keyword speech features and remove useless speech features with an attention mechanism. Besides, DSCNN is mainly applied to English KWS with good results, while they are difficult to achieve well in Chinese KWS. This is because Chinese Mandarin is a tonal language with five tones [4]. Each Chinese character is the basic language unit and can be phonetically represented by a syllable [31]. In addition, an English word corresponds phonetically to one or more Chinese characters. Therefore, the choice of modeling unit is significant for KWS in different languages [33].
Motivated by the above discussions, we propose a novel Low-Power Feature-Attention Chinese Keyword Spotting Framework based on DSCNN using distillation learning, called LF-CKSF, to recognize speech signals of Chinese wake-up words. The framework consists of a low-power feature-attention acoustic model and its learning methods. The modeling unit of the proposed acoustic model is a syllable. Unlike the existing model, the proposed CTC-based acoustic model focuses on reducing power consumption by reducing model network parameters and MAC operations through a customized feature-attention network and DSCNN. In particular, we design the feature-attention network to extract useful syllable features from MFCC features by selectively focusing on important speech signal features to refine the MFCC features.
Moreover, DSCNN with fewer parameters and MAC operations than CNN extracts effective high-dimensional features from syllable features. Besides, since the CTC criterion with redundant symbols does not use pairs of frame-level input features and output labels, CTC loss makes it challenging to train stably. This drawback will be amplified for the training of the low-power acoustic model with few neural network parameters. Considering the drawback of CTC, we design a distillation learning algorithm motivated by [12], to efficiently and precisely train the proposed low-power acoustic model by utilizing the knowledge of the trained large acoustic model in a more simple arithmetic and more hardware-friendly way. Different from some existing distillation learning algorithms, we utilize the trained knowledge of a large model to assist the training of a low-power model in syllable space. To the best of our knowledge, there is no prior work on applying a distillation learning algorithm in Chinese acoustic feature-attention in end-to-end-based speech recognition models. Such as [27] trained multiple large teachers DNNs and formed an ensemble model by averaging posteriors. The student KD DNN was trained with the weighted criterion.
In summary, the main contributions in this paper are listed as follows:
A novel Low-Power Feature-Attention Chinese Keyword Spotting Framework based on DSCNN with distillation learning, named LF-CKSF, is proposed to recognize speech signals of Chinese wake-up words. The framework consists of a low-power feature-attention acoustic model and its learning methods. This acoustic model based on CTC still has low power consumption while maintaining high accuracy.
The proposed acoustic model mainly consists of a designed feature-attention network and DSCNN. The feature-attention network extracts an effective syllable feature from adjacent MFCC features using an attention mechanism to reduce the number of speech signal features. This model is different from attention-based sequence-to-sequence translation models with an encoder-decoder framework which is hard to train [34]. Besides, DSCNN has fewer parameters and MAC operations for the extraction of effective high-dimensional features.
A distillation learning algorithm is designed to efficiently and correctly train the proposed low-power acoustic model by utilizing the knowledge of the trained large acoustic model in syllable space, difficult from sequence-level or frame-level knowledge distillation [26].
Experiment results show our LF-CKSF with only \(14.524KB\) parameters with the state-of-the-art accuracy of \(99.0 \%\) on the test set of the HI-MIA Chinese wake-up word dataset. Our model achieves higher accuracy with fewer network parameters and MAC operations. More importantly, our model has the smallest latency and power on NVIDIA JETSON TX2 embedded AI platform [1] than baseline methods, which is hardware friendly.
The rest of this paper is organized as follows. Section 2 introduces the background. Section 3 gives the design procedure of the proposed LF-CKSF, including the overall design of LF-CKSF, low-power feature-attention acoustic model based on DSCNN, and distillation learning algorithm. In Section 4, experiments are carried out to evaluate our proposed model LF-CKSF, including experiment setting, implementation specification, and results. In the results part, we present quantitative evaluation, qualitative evaluation, and experimental results on NVIDIA JETSON TX2. Finally, conclusions are summarized in Section 5.
2 BACKGROUND
2.1 Speech Signal Feature
Typically, Mel-frequency cepstral coefficients (MFCC) [19] and filter bank (FBank) [9] are two mainstream features used as input to the deep learning-based acoustic models. MFCC, which can obtain low-frequency information, has more than one discrete cosine transform (DCT) process than FBank. We use MFCC for the following reasons: (1) Most of the energy in human voice signal is concentrated on the low-frequency part; (2) Since there is overlap between sliding windows, DCT can perform dimensionality reduction and de-correlation for data; and (3) DCT has no imaginary part, which makes it more computation-economy than FBank. The process consists of pre-emphasis, framing, windowing, fast Fourier transform (FFT), Mel-filter, log2 operation, and DCT. In this paper, we use MFCC features as the speech signal feature input of acoustic models.
2.2 Depthwise Separable Convolution Neural Network
Traditional approaches for KWS are based on Hidden Markov Models (HMM) [22]. With advances in deep learning (DL) and access to huge amounts of data, deep learning-based acoustic models have more potential than traditional HMM models [5]. However, HMM and DL can merely exploit fixed frames to employ contextual information. In recent years, CNN [24] and RNN [28] are two important approaches to exploit variable-length contextual information as acoustic models.
A DSCNN [13] is easier and more efficient than a typical CNN by explicitly taking CNN apart into a depthwise convolution layer and a pointwise convolution layer. It gets cross-channel correlation and spatial correlation while maintaining high accuracy with fewer parameters. The network convolves input data with two-dimensional filters to learn spatial correlations and then convolves the previous layer’s output with three-dimensional filters to acquire channel correlations. In addition, the ratio of numbers of MAC operations of DSCNN to that of the typical CNN can be represented and approximated to \(1/K^{2}\), typically \(K=3\) [18].
2.3 Connectionist Temporal Classification
Connectionist temporal classification (CTC) [11] is a kind of loss function in sequence labeling problem. Traditional sequence annotation algorithm requires acoustic frames and output symbols to have perfect alignment at every moment, which is time and labor-prohibitive consuming. Besides, expertise is essential as the labels must be defined consistently with lexicon and language rules. On the contrary, CTC extends the label collection by adding a blank label. The label sequence can be represented by a set of all the possible CTC paths mapped to it. The mapping function first removes the adjacent duplicate label and then removes the blank label. All the predicted sequences that can be converted into target sequences by the mapping function are the target predicted results. In other words, the predicted sequence can be obtained without data alignment in one step.
The objective function is to maximize the probability sum of all correct prediction sequences corresponding to the target label sequence. For an input sequence \(X = (x_{1}, \ldots ,x_{T})\), the objective function, i.e., (1) \(\begin{equation} \begin{split}p(y|X) = \sum _{\hat{y}\in \Omega (y^{\prime })} P(\hat{y}|X) = \sum _{\hat{y}\in \Omega (y^{\prime })} \prod _{t=1}^{T} P(\hat{y}_{t}|x_{t}), \end{split} \end{equation}\)
Where \(y^{\prime }\) is a collection that contain y and blank label, \(\Omega (y^{\prime })\) denotes all possible label paths, \(P(\hat{y}_{t}|x_{t})\) can be obtained by neural network.
3 APPROACH
This section demonstrates a new low-power feature-attention acoustic model architecture based on DSCNN using distillation learning, LF-CKSF, to recognize speech signals of Chinese wake-up words. Firstly, the overall design of the proposed LF-CKSF is given. Next, the low-power feature-attention acoustic model is presented in detail. Finally, the training algorithm with distillation learning for the acoustic model is shown.
3.1 Overall Design of LF-CKSF
The overall design of LF-CKSF mainly consists of our innovatively proposed low-power feature-attention acoustic model and distillation learning algorithm. The low-power feature-attention acoustic model establishes a mapping between MFCC features and Chinese syllables instead of phonemes. This model can effectively reduce the model network parameters and MAC operations while maintaining good performance. In particular, we consider that a specific number of MFCC features correspond to one syllable, and the importance weight of each MFCC feature to the syllable is different, such as mute or noisy speech signal features are useless. Therefore, we design a feature-attention network to model the importance weights of MFCC features to syllables. Through a feature-attention network, we can obtain syllable features that are far less than MFCC features. Subsequently, we adopt DSCNN to process syllable features and extract high-dimensional features. In addition, due to the limited learning ability of the small neural network, it is difficult for them to train stably with fewer parameters to achieve good performance by directly minimizing CTC loss. Hence, there is tedious hyperparameter adjustment work to do. We apply a distillation learning algorithm to update our proposed acoustic model by utilizing the large model’s knowledge with more parameters that have been trained and obtained better performance. In the following, we present the network structure of our proposed LF-CKSF and its distillation learning algorithm in detail.
3.2 Low-Power Feature-Attention Acoustic Model based on DSCNN
The model network structure of the proposed LF-CKSF based on DSCNN is shown in Figure 1. The whole acoustic model maps MFCC feature to syllable, where its inputs are MFCC features, and its outputs are syllable posterior probability matrix. Specifically, the feature-attention network is designed to extract one syllable feature from multiple MFCC features with an attention mechanism. In addition, DSCNN is adopted to extract effective high-dimensional features from syllable features. The implementation process of the LF-CKSF network in detail is as follows. Firstly, a feature-attention network is presented. As shown on the left side of Figure 1, a large number of MFCC features (n MFCC features) are transformed into a few syllable features (\(n/k\) syllable features) through feature-attention network, where n can be divisible by k. In this paper, we assume that k MFCC features correspond to one syllable feature. Meanwhile, we set \(x_{i}\) to denote i-th MFCC feature and \(h_{j}\) to denote j-th syllable feature (\(i = 1, \ldots ,n\); \(j = 1, \ldots ,n/k\)). In the feature-attention network, we firstly calculate the attention weight \(\alpha = \lbrace \alpha _{1}, \ldots ,\alpha _{n}\rbrace\) of MFCC features as follows (\(\alpha _{i}\) denotes the attention weight of i-th MFCC features), (2) \(\begin{equation} \begin{split}\alpha = Softmax(FC(Relu(FC(Relu(FC(x)))))), \end{split} \end{equation}\) where \(x=\lbrace x_{1}, \ldots ,x_{n}\rbrace\) represents MFCC features, FC represents a fully connected layer network, Relu represents a rectified linear unit activation function, and Softmax function is used to normalize every k MFCC feature. These attention weights indicate the importance of each MFCC feature. Next, MFCC features are multiplied by attention weights, yielding MFCC attention features \(\hat{x} = multiply(x,\alpha)\), where \(\hat{x} = \lbrace \hat{x}_{1}, \ldots ,\hat{x}_{n}\rbrace\) and \(\hat{x}_{i} = multiply(x_{i},\alpha _{i})\). Finally, every k MFCC attention feature is added to get one syllable feature, i.e., \(h_{j} = \sum _{i=j*k}^{j*k+k}\hat{x}_{i}\), where the syllable feature is extracted from k MFCC features through selectively focusing on important MFCC features.
Fig. 1. Low-power feature-attention acoustic model network structure.
We can obtain effective syllable features through the feature-attention network. The syllable features can remove useless MFCC features, and its number is \(1/k\) times that of MFCC features in the time domain, which effectively reduces the parameters of the acoustic model. The number of k will illustrate in Section 4.3 and in Figure 5. After obtaining syllable features, DSCNN is adopted to extract high-dimensional features through convolution operation from the syllable features. Finally, two FC layer networks map DSCNN feature space to syllable posterior probability matrix space. Besides, we can generate a transcription given the syllable posterior probability matrix generated by the acoustic model through decoding strategies for CTC acoustic models [35].
3.3 Distillation Learning Algorithm
For the training of the proposed LF-CKSF, we design a distillation learning algorithm by utilizing the knowledge of a trained large model to assist the small model training. This method can speed up the convergence of model training and enable the small model to get good performance as the large model. The illustration of the designed distillation learning algorithm is shown in Figure 2. The black arrow indicates the direction of data flow, and the red arrow indicates the direction of gradient flow. Firstly, a large acoustic model with abundant neural network parameters is trained to achieve good performance though minimizing its CTC loss. Then, a small acoustic model with few network parameters is updated by minimizing the total loss L, (3) \(\begin{equation} \begin{split}L = \lambda L_{ce} + (1-\lambda) L_{ctc}, \end{split} \end{equation}\) where \(\lambda \in [0,1]\) is a hyperparameter, \(L_{ctc}\) is CTC loss of small acoustic model, \(L_{ce} = CrossEntropyLoss(y_{l},y_{s})\) is the cross-entropy loss between the syllable posterior probability matrix of the large acoustic model output \(y_{l}\) and the syllable posterior probability matrix of the small acoustic model output \(y_{s}\). Herein, \(y_{l}\) is the soft true value of \(y_{s}\), which shows the knowledge of the large model used to guide the training of the low-power acoustic model. This is the main design of the distillation learning algorithm.
Fig. 2. Illustration of the distillation learning algorithm.
4 EXPERIMENTS
4.1 Experiment Setting
We use the HI-MIA Chinese wake-up word dataset [20] for the low-power acoustic model exploration experiments. HI-MIA is an inherent wake-up word database in the smart home scene. The data is collected in a real home environment using microphone arrays and Hi-Fi microphones. The whole dataset is divided into train (254 people), dev (42 people), and test (44 people) subsets, where the wake-up word in the Chinese dataset is “ni3 hao3 mi3 ya3”, called Chinese HI-MIA (CHM). The low-power acoustic model is trained to recognize the speech signal of CHM and obtain the correct syllable sequence of CHM. Meanwhile, the acoustic model is built and trained in the TensorFlow framework. In addition, we build other small acoustic models as comparison models, which have less than 100KB parameter, including Bi-directional Long Short-Term Memory (BiLSTM) [14], Convolutional Neural Networks (CNN), Depthwise Separable Convolution Neural Network (DSCNN), and Attention-DSCNN with a large parameter. Attention-DSCNN with small parameters has the same model structure as our LF-CKSF but without distillation learning for fair. Details of these models are shown in Figure 3. These comparison models are updated by minimizing their own CTC loss without distillation learning.
Fig. 3. Models in experiments.
4.2 Implementation Specification
With a batch size of 100, acoustic models are trained using 650000 train data with a learning rate of 0.0008 and Adam optimizer [16]. The trained models are evaluated based on the accuracy on 1000 test dataset, 1000 dev dataset, and 1000 train dataset. The computing server is equipped with two AMD EPYC 7742 CPUs and six NVIDIA GeForce RTX3090 GPUs. The hyperparameter \(\lambda\) is set as 0.1. Meanwhile, for all acoustic models, we use 10 dimension MFCC features extracted from a speech frame of frame length 32ms with a frameshift of 16ms, and the redundant input data are truncated. The output dimension size of the acoustic model is 5, including ni3, hao3, mi3, ya3 and blank syllables. Besides, the parameter settings of the training process and LF-CKSF are shown in Table 1.
4.3 Results
Quantitative Evaluation. Table 2 summarizes the accuracy, memory size, and MAC operation of our proposed model and other acoustic models on the CHM dataset. The accuracy shown in the table includes the accuracy on the train, dev, and test sets. The memory shown in the table adopts 32-bit weights and activations, which is sufficient to achieve the same accuracy as that of a full-precision network. The MAC operation in the table counts the total number of MACs in the entire acoustic model network, which can well reflect the computing complexity of the acoustic model. They are essential factors for power consumption. As shown in Table 2, BiLSTM and CNN have poor accuracy as their parameters decrease. This is due to the limited coding capacity of the acoustic models with limited parameters. In contrast, DSCNN has higher accuracy, which is \(87.3\%\) on the test set. This is because it was designed for getting cross-channel correlations and spatial correlations with few network parameters. When we combine our proposed feature-attention network and DSCNN to obtain attention-DSCNN(Small) (that is the network of the acoustic model we propose without distillation learning), attention-DSCNN has better accuracy than DSCNN in the train, dev, and test set, which is \(95.3\%\) and \(90.6\%,\) respectively. More importantly, it has fewer network parameters and MAC operations than BiLSTM, CNN, and DSCNN while maintaining high accuracy. This fully validates the effectiveness of the proposed feature-attention network. Attention-DSCNN(Large) has high accuracy at the price of too many parameters and MAC operation.
| Neural Network Acoustic Model | Memory | MAC Operation | Accuracy | ||
|---|---|---|---|---|---|
| Train | val. | Test | |||
| BiLSTM | 16.852KB | 12676 | 0.040 | 0.014 | 0.031 |
| CNN | 17.908KB | 8834 | 0.671 | 0.459 | 0.264 |
| DSCNN | 19.128KB | 9348 | 0.930 | 0.926 | 0.873 |
| Attention-DSCNN (Large) | 243.260KB | 120748 | 1.000 | 1.000 | 1.000 |
| Attention-DSCNN (Small) | 14.524KB | 7112 | 0.953 | 0.956 | 0.906 |
| LF-CKSF (ours) | 14.524KB | 7112 | 1.000 | 0.997 | 0.990 |
Table 2. Experimental Results of Our Proposed Model and Other Acoustic Models
Meanwhile, when attention-DSCNN is trained through our designed distillation learning algorithm (our proposed LF-CKSF), LF-CKSF has higher accuracy than attention-DSCNN(Small) under the same network parameters and MAC operations. The result reflects the benefits of the designed distillation algorithm. In general, our proposed LF-CKSF shows better accuracy than other small acoustic models with fewer network parameters and MAC operations, and it has \(99\%\) test accuracy with only 14.524KB parameters which are far less than the number of standard parameters in IoT, 100KB. All these excellent performances are thanks to our proposed LF-CKSF, which utilizes the feature-attention network to extract syllable features from MFCC features and effectively reduces the whole acoustic model’s network parameters and MAC operations while maintaining good accuracy. Thus, the experimental results fully demonstrate the effectiveness of our proposed LF-CKSF.
Moreover, we evaluate the false positive rate (FPR, i.e., false alarm rate) and false negative rate (FNR, i.e., miss rate) of different acoustic models with 200 data samples composed of 160 CHM positive samples and 40 negative samples from Google speech command dataset [30]. The evaluation results are shown in Table 3. The results demonstrate that our acoustic model LF-CKSF outperforms other models in terms of FPR and FNR. Specifically, LF-CKSF has a lower \(3\%\) FPR and \(1.2\%\) FNR.
Furthermore, we evaluate the performance of different CTC acoustic models by using different decoding strategies, including greedy search and beam search [35]. We use six decoding strategies: one is greedy search, and another five approaches are beam searches with beam width equal to \(1, 2, 3, 4, 5\) respectively. The mean test accuracy result is shown in Figure 4. Our proposed LF-CKSF still shows \(99.6\%\) mean test accuracy in different decoding strategies on 1000 test dataset, which is obviously superior to the performance of BiLSTM, CNN, and DSCNN. This also reflects the excellent stability and expansibility of our model.
Fig. 4. Mean test accuracy of different CTC acoustic models using different decoding strategies on 1000 test dataset.
k Value Analysis. How many MFCC features should we choose to correspond to one syllable feature, i.e., the value of k. We consider that the value of k should be related to the speech signal length of a syllable, which cannot be too big or too small. If k is too small, one syllable feature will not contain all speech signals with one syllable and the compression ability of model parameters will be weakened. If k is too big, one syllable feature may contain speech signals of multiple syllables, which hinders the extraction performance of speech signals about a syllable. Hence, since the value of k is significant, we analyze the mean test accuracy of different k values using the proposed LF-CKSF. The results are shown in Figure 5. The proposed model shows the highest test accuracy when \(k=8\), i.e., \(99.9\%\). Therefore, the value of k will affect the test accuracy performance of the proposed model, and the value of k should be determined according to the specific keyword. In addition, the value of k does not affect the parameters of the model network, but it affects the number of MAC of the network.
Fig. 5. Mean test accuracy of different k values.
Qualitative Evaluation. We qualitatively analyze and verify the effectiveness of our model by dissecting a speech signal transformation process. The illustration of a “\(HI-MIA\)” speech signal transformation process is shown in Figure 6, including a speech signal, MFCC feature, syllable feature, and syllable posterior probability matrix of our wake-up words “\(HI-MIA\)”. In the transformation process, the speech signal is first processed as the MFCC feature with 56 frames and every frame has 10 dimensions. Then, the MFCC feature is transformed into the syllable feature with only 7 frames in time dimension through the designed feature-attention network. After that, the syllable feature is mapped to the syllable posterior probability matrix space. It can be seen from the figure that the syllable feature is a refined version of the MFCC feature and it reduces the number of features in the time dimension while still maintaining useful speech information. This syllable posterior probability matrix also indicates the probability of the speech signal being \(^{\prime }ni3 hao3 mi3 ya3^{\prime }\) is very high, which can be obtained through the decoding strategies. Moreover, we further analyze the rationality of compressing the acoustic model with the feature-attention network. Firstly, MFCC features contain redundant information of speech signals due to overlap window sliding in speech regions between adjacent frames. Secondly, MFCC features contain almost all speech signals, including noise or mute signals, which are useless signals for speech recognition. Hence, based on the above considerations, the proposed feature-attention network is designed to pay attention to useful speech signals and remove useless, redundant speech signals, making a big difference in compressing the acoustic model.
Fig. 6. Illustration of a “ \(HI-MIA\) ” speech signal transformation process.
Experimental Results on NVIDIA JETSON TX2. To evaluate the performance of different models in terms of latency and power consumption, we run different models on the NVIDIA JETSON TX2 embedded AI platform [1] shown in Figure 7. The NVIDIA JETSON TX2 platform is an embedded AI computing device with exceptional speed and power efficiency. It brings true AI computing to the edge with 256 NVIDIA Pascal GPU, up to 8 GB of memory and \(59.7 GB/s\) of memory bandwidth, and a wide range of standard hardware interfaces. The results are shown in Table 4. The measured indicators of the results include the average query latency for each model on a test instance (Latency/q (ms)), the energy per query (Energy/q (J)), and the peak power draw during the experimental run (Peak Power (W)). The results demonstrate that our proposed model outperforms other models in terms of latency and power consumption. Our proposed model spends \(17.9 ms\) and consumes \(0.141 J\) energy per query about \(8.32w\), which has a faster response time and lower power consumption than other models. However, BiLSTM has the lowest Peak Power among these models, because it calculates a frame once a time, not a speaking voice. So there is not too much data moving between memory and MAC. Although there is much lower power Application Specific Integrated Circuit (ASIC) and System on Chip (SoC) chips such as [29] has 140nw with 348ms latency in 180nm technology and [25] has 510nw with 64ms latency in 28nw technology. What we did was deploy on the embedded platform but not direct design hardware. Even though our latency is smallest even compared with ASIC or SOC chip, which has plenty of optimization in parameter read, write, and computing.
Fig. 7. Illustration of NVIDIA JETSON TX2.
5 CONCLUSION
In this paper, LF-CKSF is proposed to recognize speech signals of Chinese wake-up words. It has low power consumption and excellent latency by reducing model network parameters and MAC operations. Specifically, we design a feature-attention network to reduce the number of speech signal features, which dramatically helps to decrease parameters and MAC operations of the whole acoustic model. The feature-attention network can extract effective syllable features from a large number of MFCC features to refine MFCC features by selectively focusing on important speech signal features and removing invalid speech signal features. Moreover, DSCNN with fewer parameters and MAC operations is adopted to extract effective high-dimensional features from syllable features. Furthermore, a distillation learning algorithm is specially designed to train the proposed low-power acoustic model by utilizing the knowledge of the trained large acoustic model in syllable space. Experimental results have shown that our proposed model has better accuracy performance than other small acoustic models under the condition of low power consumption and low latency.
- [1] 2021. Jetson TX2 module. https://developer.nvidia.com/embedded/jetson-tx2.Google Scholar
- [2] . 2017. Convolutional recurrent neural networks for small-footprint keyword spotting. arXiv preprint arXiv:1703.05390 (2017).Google Scholar
- [3] . 2021. Bi-directional long short-term memory model with semantic positional attention for the question answering system. Transactions on Asian and Low-Resource Language Information Processing 20, 5 (2021), 1–13.Google Scholar
Digital Library
- [4] . 2000. Large vocabulary Mandarin speech recognition with different approaches in modeling tones. In Sixth International Conference on Spoken Language Processing.Google Scholar
Cross Ref
- [5] . 2014. Small-footprint keyword spotting using deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4087–4091.Google Scholar
Cross Ref
- [6] . 2015. Query-by-example keyword spotting using long short-term memory networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5236–5240.Google Scholar
Cross Ref
- [7] . 2021. Developing a Vietnamese tourism question answering system using knowledge graph and deep learning. Transactions on Asian and Low-Resource Language Information Processing 20, 5 (2021), 1–18.Google Scholar
Digital Library
- [8] . 2007. An application of recurrent neural networks to discriminative keyword spotting. In International Conference on Artificial Neural Networks. Springer, 220–229.Google Scholar
Digital Library
- [9] . 2004. Empirical mode decomposition as a filter bank. IEEE Signal Processing Letters 11, 2 (2004), 112–114.Google Scholar
Cross Ref
- [10] . 1993. Application of large vocabulary continuous speech recognition to topic and speaker identification using telephone speech. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2. IEEE, 471–474.Google Scholar
Cross Ref
- [11] . 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning. 369–376.Google Scholar
Digital Library
- [12] . 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google Scholar
- [13] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google Scholar
- [14] . 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).Google Scholar
- [15] . 2016. A review on automatic speech recognition architecture and approaches. International Journal of Signal Processing, Image Processing and Pattern Recognition 9, 4 (2016), 393–404.Google Scholar
Cross Ref
- [16] . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- [17] . 2016. An end-to-end architecture for keyword spotting and voice activity detection. arXiv preprint arXiv:1611.09405 (2016).Google Scholar
- [18] . 2020. A high-speed low-cost CNN inference accelerator for depthwise separable convolution. In 2020 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA). IEEE, 63–64.Google Scholar
Cross Ref
- [19] . 2000. Mel frequency cepstral coefficients for music modeling. In Ismir, Vol. 270. Citeseer, 1–11.Google Scholar
- [20] . 2019. HI-MIA: A Far-field Text-Dependent Speaker Verification Database and the Baselines.
arxiv:cs.SD/1912.01231 .Google Scholar - [21] . 2014. High-performance query-by-example spoken term detection on the SWS 2013 evaluation. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7819–7823.Google Scholar
Cross Ref
- [22] . 1989. Continuous hidden Markov modeling for speaker-independent word spotting. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 627–630.Google Scholar
Cross Ref
- [23] . 1990. A hidden Markov model based keyword recognition system. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 129–132.Google Scholar
Cross Ref
- [24] . 2015. Convolutional neural networks for small-footprint keyword spotting. In Sixteenth Annual Conference of the International Speech Communication Association.Google Scholar
Cross Ref
- [25] . 2020. 14.1 A 510nW 0.41 V low-memory low-computation keyword-spotting chip using serial FFT-based MFCC and binarized depthwise separable convolutional neural network in 28nm CMOS. In 2020 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 230–232.Google Scholar
Cross Ref
- [26] . 2018. An investigation of a knowledge distillation method for CTC acoustic models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5809–5813.Google Scholar
Digital Library
- [27] . 2016. Model compression applied to small-footprint keyword spotting. In Interspeech. 1878–1882.Google Scholar
- [28] . 2018. Keyword spotting based on CTC and RNN for Mandarin Chinese speech. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 374–378.Google Scholar
Cross Ref
- [29] . 2021. 12.1 A 148nW general-purpose event-driven intelligent wake-up chip for AIoT devices using asynchronous spike-based feature extractor and convolutional neural network. In 2021 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 64. IEEE, 436–438.Google Scholar
Cross Ref
- [30] . 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018).Google Scholar
- [31] . 2007. Context dependent syllable acoustic model for continuous Chinese speech recognition. In Eighth Annual Conference of the International Speech Communication Association.Google Scholar
Cross Ref
- [32] . 2020. Depthwise separable convolutional ResNet with squeeze-and-excitation blocks for small-footprint keyword spotting. arXiv preprint arXiv:2004.12200 (2020).Google Scholar
- [33] . 2020. CRNN-CTC based Mandarin keywords spotting. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7489–7493.Google Scholar
Cross Ref
- [34] . 2017. Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of Automatica Sinica 4, 3 (2017), 396–409.Google Scholar
Cross Ref
- [35] . 2017. Comparison of decoding strategies for CTC acoustic models. arXiv preprint arXiv:1708.04469 (2017).Google Scholar
- [36] . 2017. Hello edge: Keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128 (2017).Google Scholar
- [37] . 2018. Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Transactions on Intelligent Systems and Technology (TIST) 9, 5 (2018), 1–28.Google Scholar
Digital Library
Index Terms
Low-Power Feature-Attention Chinese Keyword Spotting Framework with Distillation Learning
Recommendations
Discriminative keyword spotting
This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims ...
Feature learning for efficient ASR-free keyword spotting in low-resource languages
AbstractWe consider feature learning for a computationally efficient method of keyword spotting that can be applied in severely under-resourced settings. The objective is to support humanitarian relief programmes by the United Nations (UN) in ...
Highlights- Keyword spotting is used for word recognition in severely under-resourced languages.
A Pitch and Noise Robust Keyword Spotting System Using SMAC Features with Prosody Modification
AbstractSpotting of keywords in continuous speech signal with the aid of the computer is called a keyword spotting (KWS) system. A variety of strategies have been suggested in the literature to detect keywords from the adult’s speech effectively. However, ...













Comments