skip to main content
research-article

LCSNet: End-to-end Lipreading with Channel-aware Feature Selection

Authors Info & Claims
Published:23 January 2023Publication History
Skip Abstract Section

Abstract

Lipreading is a task of decoding the movement of the speaker’s lip region into text. In recent years, lipreading methods based on deep neural network have attracted widespread attention, and the accuracy has far surpassed that of experienced human lipreaders. The visual differences in some phonemes are extremely subtle and pose a great challenge to lipreading. Most of the lipreading existing methods do not process the extracted visual features, which mainly suffer from two problems. First, the extracted features contain lot of useless information such as noise caused by differences in speech speed and lip shape, for example. In addition, the extracted features are not abstract enough to distinguish phonemes with similar pronunciation. These problems have a bad effect on the performance of lipreading. To extract features from the lip regions that are more distinguishable and more relevant to the speech content, this article proposes an end-to-end deep neural network-based lipreading model (LCSNet). The proposed model extracts the short-term spatio-temporal features and the motion trajectory features from the lip region in the video clips. The extracted features are filtered by the channel attention module to eliminate the useless features and then used as input to the proposed Selective Feature Fusion Module (SFFM) to extract the high-level abstract features. Afterwards, these features are used as input to the bidirectional GRU network in time order for temporal modeling to obtain the long-term spatio-temporal features. Finally, a Connectionist Temporal Classification (CTC) decoder is used to generate the output text. The experimental results show that the proposed model achieves a 1.0% CER and 2.3% WER on the GRID corpus database, which, respectively, represents an improvement of 52% and 47% compared to LipNet.

REFERENCES

  1. [1] Hilder S., Harvey R., and Theobald B.. 2009. Comparison of human and machine-based lip-reading. In Conference on Audio Visual Speech Processing.Google ScholarGoogle Scholar
  2. [2] Sutskever I., Vinyals O., and Le Q. V.. 2014. Sequence to sequence learning with neural networks. In Conference on Advances in Neural Information Processing Systems. 31043112.Google ScholarGoogle Scholar
  3. [3] Graves A., Fernández S., and Gomez F.. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning ser. (ICML’06). Association for Computing Machinery, New York, NY, USA, 369376.Google ScholarGoogle Scholar
  4. [4] Bahdanau D., Cho K., and Bengio Y.. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  5. [5] Chan W., Jaitly N., V. Le Q., and Vinyals O.. 2015. Listen, Attend and Spell. arXiv e-prints, p. arXiv:1508.01211.Google ScholarGoogle Scholar
  6. [6] Graves A. and Jaitly N.. 2014. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning. 17641772.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] van den Oord A., Dieleman S., Zen H., Simonyan K., Vinyals O., Graves A., Kalchbrenner N., Senior A., and Kavukcuoglu K.. 2016. Wavenet: A generative model for raw audio, arXiv e-prints, p. arXiv:1609.03499.Google ScholarGoogle Scholar
  8. [8] Jie H., Li S., Gang S., and Albanie S.. 2017. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. PP, 99 (2017).Google ScholarGoogle Scholar
  9. [9] Woo S., Park J., Y. Lee J., and Kweon I. S.. 2018. CBAM: Convolutional block attention module. In European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Yang W., Wang W., Zhang X., Sun S., and Liao Q.. 2019. Lightweight feature fusion network for single image super-resolution. IEEE Sig. Process. Lett. 26, 4 (2019), 538542.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Cui Y., Song Y., Sun C., Howard A., and Belongie S.. 2018. Large scale fine-grained categorization and domain-specific transfer learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 41094118.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Wang T., Yang X., Xu K., Chen S., Zhang Q., and Lau R. W.. 2019. Spatial attentive single-image deraining with a high quality real rain dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Li X., Wang W., Hu X., and Yang J.. 2020. Selective kernel networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  14. [14] M. Assael Y., Shillingford B., Whiteson S., and de Freitas N.. 2016. LipNet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599.Google ScholarGoogle Scholar
  15. [15] Kai X., Li D., Cassimatis N., and Wang X.. 2018. LCANet: End-to-end lipreading with cascaded attention-CTC. In 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG’18).Google ScholarGoogle Scholar
  16. [16] K. Srivastava R., Greff K., and Schmidhuber J.. 2015. Highway networks. arXiv e-prints, p. arXiv:1505.00387.Google ScholarGoogle Scholar
  17. [17] S. Chung J., Senior A., Vinyals O., and Zisserman A.. 2017. Lip reading sentences in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Hochreiter S. and Schmidhuber J.. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Potamianos G., Luettin J., and Neti C.. 2001. Hierarchical discriminant features for audio-visual LVCSR. In IEEE International Conference on Acoustics, Speech, and Signal Processing. 165168.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] A. Shaikh A., K. Kumar D., C. Yau W., Z. C. Azemin M., and Gubbi J.. 2010. Lip reading using optical flow and support vector machines. In 3rd International Congress on Image and Signal Processing. 327330.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Li M. and Cheung Y. Ming. 2008. A novel motion based lip feature extraction for lip-reading. In International Conference on Computational Intelligence and Security. 361365.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Alizadeh S., Boostani R., and Asadpour V.. 2008. Lip feature extraction and reduction for HMM-based visual speech recognition systems. In 9th International Conference on Signal Processing. 561564.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Chen J., Tiddeman B., and Zhao G.. 2008. Real-time lip contour extraction and tracking using an improved active contour model. In 4th International Symposium on Advances in Visual Computing. 236245.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] F. Cootes T., J. Taylor C., H. Cooper D., and Graham J.. 1995. Active shape models–their training and application. Comput. Vis. Image Underst. 61, 1 (1995), 3859.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] F. Cootes T., J. Edwards G., and Taylor C. J.. 2001. Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23, 6 (2001), 681685.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Noda K., Yamaguchi Y., Nakadai K., G. Okuno H., and Ogata T.. 2014. Lipreading using convolutional neural network. In 15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages (INTERSPEECH’14). 11491153.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Krizhevsky A., Sutskever I., and Hinton G. E.. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 8490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Chatfield K., Simonyan K., Vedaldi A., and Zisserman A.. 2014. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Stafylakis T. and Tzimiropoulos G.. 2017. Combining residual networks with LSTMs for lipreading. In Interspeech Conference. 36523656.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] He K., Zhang X., Ren S., and Sun J.. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770778.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Graves A. and Schmidhuber J.. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. In International Joint Conference on Neural Networks. 602610.Google ScholarGoogle Scholar
  32. [32] Petridis S., Stafylakis T., Ma P., Cai F., and Pantic M.. 2018. End-to-end audiovisual speech recognition. In IEEE International Conference on Acoustics.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Bai S., Z. Kolter J., and Koltun V.. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.Google ScholarGoogle Scholar
  34. [34] Martinez B., Ma P., Petridis S., and Pantic M.. 2020. Lipreading using temporal convolutional networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20).Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Hao M., Mamut M., Yadikar N., Aysa A., and Ubul K.. 2021. How to use time information effectively? Combining with time shift module for lipreading. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). 79887992.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Ma P., Martinez B., Petridis S., and Pantic M.. 2021. Towards practical lipreading with distilled and efficient models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). 76087612.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Zhou Y., Wang M., Liu D., Hu Z., and Zhang H.. 2020. More grounded image captioning by distilling image-text matching model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Zhao Y., Xu R., and Song M.. 2019. A cascade sequence-to-sequence model for Chinese Mandarin lip reading. In ACM Multimedia Asia Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Amodio A., Ermidoro M., Maggi D., Formentin S., and Savaresi S. M.. 2018. Automatic detection of driver impairment based on pupillary light reflex. IEEE Trans. Intell. Transport. Syst. PP (2018), 111.Google ScholarGoogle Scholar
  40. [40] Kingma D. P. and Ba J. L.. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  41. [41] Zhang Y., Yang S., Xiao J., Shan S., and Chen X.. 2020. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition. In 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG’20).Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Huang Y., Liang X., and Fang C.. 2021. CALLip: Lipreading using contrastive and attribute learning. In 29th ACM International Conference on Multimedia. 24922500.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Liu J., Ren Y., Zhao Z., Zhang C., Huai B., and Yuan J.. 2020. FastLR: Non-autoregressive lipreading model with integrate-and-fire. In 28th ACM International Conference on Multimedia. 43284336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Zhao Y., Xu R., Wang X., Hou P., Tang H., and Song M.. 2020. Hearing lips: Improving lip reading by distilling speech recognizers. In AAAI Conference on Artificial Intelligence. 69176924.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. LCSNet: End-to-end Lipreading with Channel-aware Feature Selection

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1s
        February 2023
        504 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3572859
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 January 2023
        • Online AM: 17 March 2022
        • Accepted: 8 March 2022
        • Revised: 10 January 2022
        • Received: 31 October 2021
        Published in tomm Volume 19, Issue 1s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)240
        • Downloads (Last 6 weeks)17

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!