Abstract
Lipreading is a task of decoding the movement of the speaker’s lip region into text. In recent years, lipreading methods based on deep neural network have attracted widespread attention, and the accuracy has far surpassed that of experienced human lipreaders. The visual differences in some phonemes are extremely subtle and pose a great challenge to lipreading. Most of the lipreading existing methods do not process the extracted visual features, which mainly suffer from two problems. First, the extracted features contain lot of useless information such as noise caused by differences in speech speed and lip shape, for example. In addition, the extracted features are not abstract enough to distinguish phonemes with similar pronunciation. These problems have a bad effect on the performance of lipreading. To extract features from the lip regions that are more distinguishable and more relevant to the speech content, this article proposes an end-to-end deep neural network-based lipreading model (LCSNet). The proposed model extracts the short-term spatio-temporal features and the motion trajectory features from the lip region in the video clips. The extracted features are filtered by the channel attention module to eliminate the useless features and then used as input to the proposed Selective Feature Fusion Module (SFFM) to extract the high-level abstract features. Afterwards, these features are used as input to the bidirectional GRU network in time order for temporal modeling to obtain the long-term spatio-temporal features. Finally, a Connectionist Temporal Classification (CTC) decoder is used to generate the output text. The experimental results show that the proposed model achieves a 1.0% CER and 2.3% WER on the GRID corpus database, which, respectively, represents an improvement of 52% and 47% compared to LipNet.
- [1] . 2009. Comparison of human and machine-based lip-reading. In Conference on Audio Visual Speech Processing.Google Scholar
- [2] . 2014. Sequence to sequence learning with neural networks. In Conference on Advances in Neural Information Processing Systems. 3104–3112.Google Scholar
- [3] . 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning ser. (ICML’06). Association for Computing Machinery, New York, NY, USA, 369–376.Google Scholar
- [4] . 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.Google Scholar
- [5] . 2015. Listen, Attend and Spell. arXiv e-prints, p. arXiv:1508.01211.Google Scholar
- [6] . 2014. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning. 1764–1772.Google Scholar
Digital Library
- [7] . 2016. Wavenet: A generative model for raw audio, arXiv e-prints, p. arXiv:1609.03499.Google Scholar
- [8] . 2017. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. PP, 99 (2017).Google Scholar
- [9] . 2018. CBAM: Convolutional block attention module. In European Conference on Computer Vision.Google Scholar
Digital Library
- [10] . 2019. Lightweight feature fusion network for single image super-resolution. IEEE Sig. Process. Lett. 26, 4 (2019), 538–542.Google Scholar
Cross Ref
- [11] . 2018. Large scale fine-grained categorization and domain-specific transfer learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4109–4118.Google Scholar
Cross Ref
- [12] . 2019. Spatial attentive single-image deraining with a high quality real rain dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [13] . 2020. Selective kernel networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- [14] . 2016. LipNet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599.Google Scholar
- [15] . 2018. LCANet: End-to-end lipreading with cascaded attention-CTC. In 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG’18).Google Scholar
- [16] . 2015. Highway networks. arXiv e-prints, p. arXiv:1505.00387.Google Scholar
- [17] . 2017. Lip reading sentences in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google Scholar
Cross Ref
- [18] . 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [19] . 2001. Hierarchical discriminant features for audio-visual LVCSR. In IEEE International Conference on Acoustics, Speech, and Signal Processing. 165–168.Google Scholar
Cross Ref
- [20] . 2010. Lip reading using optical flow and support vector machines. In 3rd International Congress on Image and Signal Processing. 327–330.Google Scholar
Cross Ref
- [21] . 2008. A novel motion based lip feature extraction for lip-reading. In International Conference on Computational Intelligence and Security. 361–365.Google Scholar
Digital Library
- [22] . 2008. Lip feature extraction and reduction for HMM-based visual speech recognition systems. In 9th International Conference on Signal Processing. 561–564.Google Scholar
Cross Ref
- [23] . 2008. Real-time lip contour extraction and tracking using an improved active contour model. In 4th International Symposium on Advances in Visual Computing. 236–245.Google Scholar
Digital Library
- [24] . 1995. Active shape models–their training and application. Comput. Vis. Image Underst. 61, 1 (1995), 38–59.Google Scholar
Digital Library
- [25] . 2001. Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23, 6 (2001), 681–685.Google Scholar
Digital Library
- [26] . 2014. Lipreading using convolutional neural network. In 15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages (INTERSPEECH’14). 1149–1153.Google Scholar
Cross Ref
- [27] . 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90.Google Scholar
Digital Library
- [28] . 2014. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference.Google Scholar
Cross Ref
- [29] . 2017. Combining residual networks with LSTMs for lipreading. In Interspeech Conference. 3652–3656.Google Scholar
Cross Ref
- [30] . 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.Google Scholar
Cross Ref
- [31] . 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. In International Joint Conference on Neural Networks. 602–610.Google Scholar
- [32] . 2018. End-to-end audiovisual speech recognition. In IEEE International Conference on Acoustics.Google Scholar
Digital Library
- [33] . 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.Google Scholar
- [34] . 2020. Lipreading using temporal convolutional networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20).Google Scholar
Cross Ref
- [35] . 2021. How to use time information effectively? Combining with time shift module for lipreading. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). 7988–7992.Google Scholar
Cross Ref
- [36] . 2021. Towards practical lipreading with distilled and efficient models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). 7608–7612.Google Scholar
Cross Ref
- [37] . 2020. More grounded image captioning by distilling image-text matching model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google Scholar
Cross Ref
- [38] . 2019. A cascade sequence-to-sequence model for Chinese Mandarin lip reading. In ACM Multimedia Asia Conference.Google Scholar
Digital Library
- [39] . 2018. Automatic detection of driver impairment based on pupillary light reflex. IEEE Trans. Intell. Transport. Syst. PP (2018), 1–11.Google Scholar
- [40] . 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.Google Scholar
- [41] . 2020. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition. In 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG’20).Google Scholar
Digital Library
- [42] . 2021. CALLip: Lipreading using contrastive and attribute learning. In 29th ACM International Conference on Multimedia. 2492–2500.Google Scholar
Digital Library
- [43] . 2020. FastLR: Non-autoregressive lipreading model with integrate-and-fire. In 28th ACM International Conference on Multimedia. 4328–4336.Google Scholar
Digital Library
- [44] . 2020. Hearing lips: Improving lip reading by distilling speech recognizers. In AAAI Conference on Artificial Intelligence. 6917–6924.Google Scholar
Cross Ref
Index Terms
LCSNet: End-to-end Lipreading with Channel-aware Feature Selection
Recommendations
CALLip: Lipreading using Contrastive and Attribute Learning
MM '21: Proceedings of the 29th ACM International Conference on MultimediaLipreading, aiming at interpreting speech by watching the lip movements of the speaker, has great significance in human communication and speech understanding. Despite having reached a feasible performance, lipreading still faces two crucial challenges: ...
Lipreading with local spatiotemporal descriptors
Visual speech information plays an important role in lipreading under noisy conditions or for listeners with a hearing impairment. In this paper, we present local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely ...
Is Lip Region-of-Interest Sufficient for Lipreading?
ICMI '22: Proceedings of the 2022 International Conference on Multimodal InteractionLip region-of-interest (ROI) is conventionally used for visual input in the lipreading task. Few works have adopted the entire face as visual input because lip-excluded parts of the face are usually considered to be redundant and irrelevant to visual ...






Comments