skip to main content
research-article

Improving Deep Learning based Automatic Speech Recognition for Gujarati

Authors Info & Claims
Published:13 December 2021Publication History
Skip Abstract Section

Abstract

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.

REFERENCES

  1. Abdel-Hamid O., Mohamed A., Jiang H., and Penn G.. 2012. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 42774280. DOI: DOI: https://doi.org/10.1109/ICASSP.2012.6288864Google ScholarGoogle ScholarCross RefCross Ref
  2. Amodei Dario, Anubhai Rishita, Battenberg Eric, Case Carl, Casper Jared, Catanzaro Bryan, Chen Jingdong, Chrzanowski Mike, Coates Adam, Diamos Greg, Elsen Erich, Engel Jesse H., Fan Linxi, Fougner Christopher, Han Tony, Hannun Awni Y., Jun Billy, LeGresley Patrick, Lin Libby, Narang Sharan, Ng Andrew Y., Ozair Sherjil, Prenger Ryan, Raiman Jonathan, Satheesh Sanjeev, Seetapun David, Sengupta Shubho, Wang Yi, Wang Zhiqian, and Wang Chong. 2015. Deep speech 2: End-to-end speech recognition in English and Mandarin. Retrieved from http://arxiv.org/abs/1512.02595.Google ScholarGoogle Scholar
  3. Atal Bishnu S. and Hanauer Suzanne L.. 1971. Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust. Soc. Amer. 50, 2B (1971), 637655.Google ScholarGoogle ScholarCross RefCross Ref
  4. Baker J.. 1975. The DRAGON system—An overview. IEEE Trans. Acoust. Speech Signal Process. 23, 1 (February 1975), 2429. DOI: DOI: https://doi.org/10.1109/TASSP.1975.1162650Google ScholarGoogle ScholarCross RefCross Ref
  5. Besacier Laurent, Barnard Etienne, Karpov Alexey, and Schultz Tanja. 2014. Automatic speech recognition for under-resourced languages: A survey. Speech Commun. 56 (2014), 85100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Billa Jayadev. 2018. ISI ASR system for the low resource speech recognition challenge for Indian Languages. Proc. Interspeech (2018), 3207–3211. DOI: 10.21437/Interspeech.2018-2473Google ScholarGoogle Scholar
  7. Bourlard H. and Wellekens C. J.. 1990. Links between Markov models and multilayer perceptrons. IEEE Trans. Pattern Anal. Mach. Intell. 12, 12 (Dec. 1990), 11671178. DOI: DOI: https://doi.org/10.1109/34.62605 Google ScholarGoogle ScholarCross RefCross Ref
  8. Chan William, Jaitly Navdeep, Le Quoc, and Vinyals Oriol. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 49604964.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Charniak Eugene and Johnson Mark. 2005. Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). 173180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Retrieved from https://arXiv:1406.1078.Google ScholarGoogle Scholar
  11. Dave D.. 2015. An approach to increase word recognition accuracy in Gujarati Language. Int. J. Innovat. Res. Comput. Commun. Eng. 03 (Aug. 2015), 64426450. DOI: DOI: https://doi.org/10.15680/ijircce.2015.0307012Google ScholarGoogle ScholarCross RefCross Ref
  12. Davis Ken H., Biddulph R., and Balashek Stephen. 1952. Automatic recognition of spoken digits. J. Acoust. Soc. Amer. 24, 6 (1952), 637642.Google ScholarGoogle ScholarCross RefCross Ref
  13. Deshmukh Akshay Madhav. 2020. Comparison of hidden markov model and recurrent neural network in automatic speech recognition. Eur. J. Eng. Technol. Res. 5, 8 (2020), 958965.Google ScholarGoogle ScholarCross RefCross Ref
  14. Dikici Erinç, Semerci Murat, Saraçlar Murat, and Alpaydin Ethem. 2012. Classification and ranking approaches to discriminative language modeling for ASR. IEEE Trans. Audio Speech Lang. Process. 21, 2 (2012), 291300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Forsberg Markus. 2003. Why is speech recognition difficult. (03 2003).Google ScholarGoogle Scholar
  16. Golowich Steven E. and Sun Don X.. 1998. A support vector/hidden Markov model approach to phoneme recognition, in. In Proceedings of the Center for Media Technology (RCMT’98). 125130.Google ScholarGoogle Scholar
  17. Graves Alex, Fernandez Santiago, Gomez Faustino, and Schmidhuber Jurgen. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06). 369376. DOI: DOI: https://doi.org/10.1145/1143844.1143891 Google ScholarGoogle ScholarCross RefCross Ref
  18. Graves Alex and Jaitly Navdeep. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML’14). JMLR.org, 17641772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Graves Alex, Mohamed Abdel-rahman, and Hinton Geoffrey E.. 2013. Speech recognition with deep recurrent neural networks. Retrieved from http://arxiv.org/abs/1303.5778.Google ScholarGoogle Scholar
  20. Hannun Awni, Case Carl, Casper Jared, Catanzaro Bryan, Diamos Greg, Elsen Erich, Prenger Ryan, Satheesh Sanjeev, Sengupta Shubho, Coates Adam, et al. 2014b. Deep speech: Scaling up end-to-end speech recognition. Retrieved from https://arXiv:1412.5567.Google ScholarGoogle Scholar
  21. Hannun Awni Y., Case Carl, Casper Jared, Catanzaro Bryan, Diamos Greg, Elsen Erich, Prenger Ryan, Satheesh Sanjeev, Sengupta Shubho, Coates Adam, and Ng Andrew Y.. 2014a. Deep speech: Scaling up end-to-end speech recognition. Retrieved from http://arxiv.org/abs/1412.5567.Google ScholarGoogle Scholar
  22. Hannun Awni Y., Maas Andrew L., Jurafsky Daniel, and Ng Andrew Y.. 2014c. First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. Retrieved from https://arXiv:1408.2873.Google ScholarGoogle Scholar
  23. Heafield Kenneth, Pouzyrevsky Ivan, Clark Jonathan H., and Koehn Philipp. 2013. Scalable modified Kneser-Ney Language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 690696. Retrieved from https://www.aclweb.org/anthology/P13-2121.Google ScholarGoogle Scholar
  24. Hu Yifei, Jing Xiaonan, Ko Youlim, and Rayz Julia Taylor. 2021. Misspelling correction with pre-trained contextual language model. Retrieved from https://arXiv:2101.03204.Google ScholarGoogle Scholar
  25. Itakura Fumitada. 1975. Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal Process. 23, 1 (1975), 6772.Google ScholarGoogle ScholarCross RefCross Ref
  26. Jelinek F.. 1976. Continuous speech recognition by statistical methods. Proc. IEEE 64, 4 (Apr. 1976), 532556. DOI: DOI: https://doi.org/10.1109/PROC.1976.10159Google ScholarGoogle ScholarCross RefCross Ref
  27. Jyothi Preethi, Johnson Leif, Chelba Ciprian, and Strope Brian. 2012. Large-scale discriminative language model reranking for voice-search. In Proceedings of the NAACL-HLT Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT. 4149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kenichi Maeda, Toshiyuki Sakai, and Shuji Doshita. 1966. Phonetic Typewriter System. U.S. Patent 3,265,814.Google ScholarGoogle Scholar
  29. Krauwer Steven. 2003. The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. Proceedings of the International Conference on Speech and Computer (SPECOM’03). 815.Google ScholarGoogle Scholar
  30. Krishna Hari, Gurugubelli Krishna, Vegesna Vidyadhar, and Vuppala Anil. 2018. An exploration towards joint acoustic modeling for Indian Languages: IIIT-H submission for low resource speech recognition challenge for Indian Languages. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’18). 31923196. DOI: DOI: https://doi.org/10.21437/Interspeech.2018-1584Google ScholarGoogle Scholar
  31. Logan Beth. 2000. Mel frequency cepstral coefficients for music modeling. In Proceedings of the 1st International Symposium on Music Information Retrieval.Google ScholarGoogle Scholar
  32. Lowerre B. and Reddy R.. 1976. The harpy speech recognition system: Performance with large vocabularies. J. Acoust. Soc. Amer. 60, S1 (1976), S10–S11.Google ScholarGoogle ScholarCross RefCross Ref
  33. Maas Andrew, Xie Ziang, Jurafsky Dan, and Ng Andrew. 2015b. Lexicon-free conversational speech recognition with neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 345354. DOI: DOI: https://doi.org/10.3115/v1/N15-1038Google ScholarGoogle ScholarCross RefCross Ref
  34. Maas Andrew, Xie Ziang, Jurafsky Dan, and Ng Andrew Y.. 2015a. Lexicon-free conversational speech recognition with neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 345354.Google ScholarGoogle ScholarCross RefCross Ref
  35. Maas Andrew L., Hannun Awni Y., Jurafsky Daniel, and Ng Andrew Y.. 2014. First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. Retrieved from http://arxiv.org/abs/1408.2873.Google ScholarGoogle Scholar
  36. Mohri Mehryar. 1997. Finite-state transducers in language and speech processing. Comput. Linguist. 23, 2 (1997), 269311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Motlıcek Petr. 2002. Feature Extraction in Speech Coding and Recognition. Technical Report. Technical Report of PhD research internship in ASP Group, OGI-OHSU.Google ScholarGoogle Scholar
  38. Pal Aditya and Mustafi Abhijit. 2020. Vartani Spellcheck–automatic context-sensitive spelling correction of OCR-generated Hindi text using BERT and Levenshtein distance. Retrieved from https://arXiv:2012.07652.Google ScholarGoogle Scholar
  39. Pires Telmo, Schlinger Eva, and Garrette Dan. 2019. How Multilingual is Multilingual BERT? Retrieved from https://arXiv:1906.01502.Google ScholarGoogle Scholar
  40. Pulugundla Bhargav, Baskar Murali Karthick, Kesiraju Santosh, Egorova Ekaterina, Karafiát Martin, Burget Lukás, and Cernockỳ Jan. 2018. BUT system for low resource Indian Language ASR. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’18). 31823186.Google ScholarGoogle ScholarCross RefCross Ref
  41. Rodríguez Elena, Ruíz Belén, García-Crespo Ángel, and García Fernando. 1997. Speech/speaker recognition using a HMM/GMM hybrid model. In Audio- and Video-based Biometric Person Authentication, Bigün Josef, Chollet Gérard, and Borgefors Gunilla (Eds.). Springer, Berlin, 227234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Rosenfeld Ronald. 2000. Two decades of statistical language modeling: Where do we go from here?Proc. IEEE 88, 8 (2000), 12701278.Google ScholarGoogle ScholarCross RefCross Ref
  43. Sailor Hardik B., Krishna M. V. S., Chhabra Diksha, Patil Ankur T., Kamble Madhu R., and Patil H. A.. 2018. DA-IICT/IIITV system for low resource speech recognition challenge 2018. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’18).Google ScholarGoogle ScholarCross RefCross Ref
  44. Sainath Tara, Mohamed Abdel-rahman, Kingsbury Brian, and Ramabhadran Bhuvana. 2013. Deep convolutional neural networks for LVCSR. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’88). 86148618. DOI: DOI: https://doi.org/10.1109/ICASSP.2013.6639347Google ScholarGoogle Scholar
  45. Sak Haşim, Saraclar Murat, and Güngör Tunga. 2010. On-the-fly lattice rescoring for real-time automatic speech recognition. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  46. Sak Haşim, Senior Andrew, and Beaufays Francoise. 2014. Long short-term memory-based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128 (2014).Google ScholarGoogle Scholar
  47. Sakoe Hiroaki and Chiba Seibi. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech. Signal Process. 26, 1 (1978), 4349.Google ScholarGoogle ScholarCross RefCross Ref
  48. Schuster M. and Paliwal K. K.. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (Nov. 1997), 26732681. DOI: DOI: https://doi.org/10.1109/78.650093 Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. MIT Press, 31043112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Tailor Jinal H. and Shah Dipti B.. 2016. Speech recognition system architecture for Gujarati language. Int. J. Comput. Appl. 138, 12 (2016).Google ScholarGoogle Scholar
  51. Tailor Jinal H. and Shah Dipti B.. 2017. HMM-based lightweight speech recognition system for Gujarati. In Proceedings of the Conference on Information and Communication Technology for Sustainable Development (ICT4SD’16), Volume 2 10 (2017), 451.Google ScholarGoogle ScholarCross RefCross Ref
  52. Toshniwal Shubham, Kannan Anjuli, Chiu Chung-Cheng, Wu Yonghui, Sainath Tara N, and Livescu Karen. 2018a. A comparison of techniques for language model integration in encoder-decoder speech recognition. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 369375.Google ScholarGoogle ScholarCross RefCross Ref
  53. Toshniwal Shubham, Sainath Tara N., Weiss Ron, Li Bo, Moreno Pedro, Weinsten Eugene, and Rao Kanishka. 2018b. Multilingual speech recognition with a single end-to-end model. Retrieved from https://arxiv.org/pdf/1711.01694.Google ScholarGoogle Scholar
  54. Vintsyuk Taras K.. 1968. Speech discrimination by dynamic programming. Cybernetics 4, 1 (1968), 5257.Google ScholarGoogle ScholarCross RefCross Ref
  55. Viterbi Andrew. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Info. Theory 13, 2 (1967), 260269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Werbos P. J.. 1990. Backpropagation through time: What it does and how to do it. Proc. IEEE 78, 10 (Oct. 1990), 15501560. DOI: DOI: https://doi.org/10.1109/5.58337Google ScholarGoogle ScholarCross RefCross Ref
  57. Zhang Shaohua, Huang Haoran, Liu Jicong, and Li Hang. 2020. Spelling error correction with soft-masked BERT. Retrieved from https://arXiv:2005.07421.Google ScholarGoogle Scholar

Index Terms

  1. Improving Deep Learning based Automatic Speech Recognition for Gujarati

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 3
      May 2022
      413 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3505182
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 December 2021
      • Revised: 1 August 2021
      • Accepted: 1 August 2021
      • Received: 1 November 2020
      Published in tallip Volume 21, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)224
      • Downloads (Last 6 weeks)19

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!