Abstract
We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.
- . 2012. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 4277–4280.
DOI: DOI: https://doi.org/10.1109/ICASSP.2012.6288864Google ScholarCross Ref
- . 2015. Deep speech 2: End-to-end speech recognition in English and Mandarin. Retrieved from http://arxiv.org/abs/1512.02595.Google Scholar
- . 1971. Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust. Soc. Amer. 50, 2B (1971), 637–655.Google Scholar
Cross Ref
- . 1975. The DRAGON system—An overview. IEEE Trans. Acoust. Speech Signal Process. 23, 1 (
February 1975), 24–29.DOI: DOI: https://doi.org/10.1109/TASSP.1975.1162650Google ScholarCross Ref
- . 2014. Automatic speech recognition for under-resourced languages: A survey. Speech Commun. 56 (2014), 85–100. Google Scholar
Digital Library
- . 2018. ISI ASR system for the low resource speech recognition challenge for Indian Languages. Proc. Interspeech (2018), 3207–3211.
DOI: 10.21437/Interspeech.2018-2473Google Scholar - . 1990. Links between Markov models and multilayer perceptrons. IEEE Trans. Pattern Anal. Mach. Intell. 12, 12 (
Dec. 1990), 1167–1178.DOI: DOI: https://doi.org/10.1109/34.62605 Google ScholarCross Ref
- . 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 4960–4964.Google Scholar
Digital Library
- . 2005. Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). 173–180. Google Scholar
Digital Library
- . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Retrieved from https://arXiv:1406.1078.Google Scholar
- . 2015. An approach to increase word recognition accuracy in Gujarati Language. Int. J. Innovat. Res. Comput. Commun. Eng. 03 (Aug. 2015), 6442–6450.
DOI: DOI: https://doi.org/10.15680/ijircce.2015.0307012Google ScholarCross Ref
- . 1952. Automatic recognition of spoken digits. J. Acoust. Soc. Amer. 24, 6 (1952), 637–642.Google Scholar
Cross Ref
- . 2020. Comparison of hidden markov model and recurrent neural network in automatic speech recognition. Eur. J. Eng. Technol. Res. 5, 8 (2020), 958–965.Google Scholar
Cross Ref
- . 2012. Classification and ranking approaches to discriminative language modeling for ASR. IEEE Trans. Audio Speech Lang. Process. 21, 2 (2012), 291–300. Google Scholar
Digital Library
- . 2003. Why is speech recognition difficult. (
03 2003).Google Scholar - . 1998. A support vector/hidden Markov model approach to phoneme recognition, in. In Proceedings of the Center for Media Technology (RCMT’98). 125–130.Google Scholar
- . 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06). 369–376.
DOI: DOI: https://doi.org/10.1145/1143844.1143891 Google ScholarCross Ref
- . 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML’14). JMLR.org, 1764–1772. Google Scholar
Digital Library
- . 2013. Speech recognition with deep recurrent neural networks. Retrieved from http://arxiv.org/abs/1303.5778.Google Scholar
- . 2014b. Deep speech: Scaling up end-to-end speech recognition. Retrieved from https://arXiv:1412.5567.Google Scholar
- . 2014a. Deep speech: Scaling up end-to-end speech recognition. Retrieved from http://arxiv.org/abs/1412.5567.Google Scholar
- . 2014c. First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. Retrieved from https://arXiv:1408.2873.Google Scholar
- . 2013. Scalable modified Kneser-Ney Language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 690–696. Retrieved from https://www.aclweb.org/anthology/P13-2121.Google Scholar
- . 2021. Misspelling correction with pre-trained contextual language model. Retrieved from https://arXiv:2101.03204.Google Scholar
- . 1975. Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal Process. 23, 1 (1975), 67–72.Google Scholar
Cross Ref
- . 1976. Continuous speech recognition by statistical methods. Proc. IEEE 64, 4 (
Apr. 1976), 532–556.DOI: DOI: https://doi.org/10.1109/PROC.1976.10159Google ScholarCross Ref
- . 2012. Large-scale discriminative language model reranking for voice-search. In Proceedings of the NAACL-HLT Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT. 41–49. Google Scholar
Digital Library
- . 1966. Phonetic Typewriter System.
U.S. Patent 3,265,814. Google Scholar - . 2003. The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. Proceedings of the International Conference on Speech and Computer (SPECOM’03). 8–15.Google Scholar
- . 2018. An exploration towards joint acoustic modeling for Indian Languages: IIIT-H submission for low resource speech recognition challenge for Indian Languages. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’18). 3192–3196.
DOI: DOI: https://doi.org/10.21437/Interspeech.2018-1584Google Scholar - . 2000. Mel frequency cepstral coefficients for music modeling. In Proceedings of the 1st International Symposium on Music Information Retrieval.Google Scholar
- . 1976. The harpy speech recognition system: Performance with large vocabularies. J. Acoust. Soc. Amer. 60, S1 (1976), S10–S11.Google Scholar
Cross Ref
- . 2015b. Lexicon-free conversational speech recognition with neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 345–354.
DOI: DOI: https://doi.org/10.3115/v1/N15-1038Google ScholarCross Ref
- . 2015a. Lexicon-free conversational speech recognition with neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 345–354.Google Scholar
Cross Ref
- . 2014. First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. Retrieved from http://arxiv.org/abs/1408.2873.Google Scholar
- . 1997. Finite-state transducers in language and speech processing. Comput. Linguist. 23, 2 (1997), 269–311. Google Scholar
Digital Library
- . 2002. Feature Extraction in Speech Coding and Recognition.
Technical Report . Technical Report of PhD research internship in ASP Group, OGI-OHSU.Google Scholar - . 2020. Vartani Spellcheck–automatic context-sensitive spelling correction of OCR-generated Hindi text using BERT and Levenshtein distance. Retrieved from https://arXiv:2012.07652.Google Scholar
- . 2019. How Multilingual is Multilingual BERT? Retrieved from https://arXiv:1906.01502.Google Scholar
- . 2018. BUT system for low resource Indian Language ASR. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’18). 3182–3186.Google Scholar
Cross Ref
- . 1997. Speech/speaker recognition using a HMM/GMM hybrid model. In Audio- and Video-based Biometric Person Authentication, , , and (Eds.). Springer, Berlin, 227–234. Google Scholar
Digital Library
- . 2000. Two decades of statistical language modeling: Where do we go from here?Proc. IEEE 88, 8 (2000), 1270–1278.Google Scholar
Cross Ref
- . 2018. DA-IICT/IIITV system for low resource speech recognition challenge 2018. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’18).Google Scholar
Cross Ref
- . 2013. Deep convolutional neural networks for LVCSR. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’88). 8614–8618.
DOI: DOI: https://doi.org/10.1109/ICASSP.2013.6639347Google Scholar - . 2010. On-the-fly lattice rescoring for real-time automatic speech recognition. In Proceedings of the 11th Annual Conference of the International Speech Communication Association.Google Scholar
Cross Ref
- . 2014. Long short-term memory-based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128 (2014).Google Scholar
- . 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech. Signal Process. 26, 1 (1978), 43–49.Google Scholar
Cross Ref
- . 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (
Nov. 1997), 2673–2681.DOI: DOI: https://doi.org/10.1109/78.650093 Google ScholarDigital Library
- . 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. MIT Press, 3104–3112. Google Scholar
Digital Library
- . 2016. Speech recognition system architecture for Gujarati language. Int. J. Comput. Appl. 138, 12 (2016).Google Scholar
- . 2017. HMM-based lightweight speech recognition system for Gujarati. In Proceedings of the Conference on Information and Communication Technology for Sustainable Development (ICT4SD’16), Volume 2 10 (2017), 451.Google Scholar
Cross Ref
- . 2018a. A comparison of techniques for language model integration in encoder-decoder speech recognition. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 369–375.Google Scholar
Cross Ref
- . 2018b. Multilingual speech recognition with a single end-to-end model. Retrieved from https://arxiv.org/pdf/1711.01694.Google Scholar
- . 1968. Speech discrimination by dynamic programming. Cybernetics 4, 1 (1968), 52–57.Google Scholar
Cross Ref
- . 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Info. Theory 13, 2 (1967), 260–269. Google Scholar
Digital Library
- . 1990. Backpropagation through time: What it does and how to do it. Proc. IEEE 78, 10 (
Oct. 1990), 1550–1560.DOI: DOI: https://doi.org/10.1109/5.58337Google ScholarCross Ref
- . 2020. Spelling error correction with soft-masked BERT. Retrieved from https://arXiv:2005.07421.Google Scholar
Index Terms
Improving Deep Learning based Automatic Speech Recognition for Gujarati
Recommendations
Bangladeshi Bangla speech corpus for automatic speech recognition research
Highlights- Development of language resource of Bangladeshi bangla spoken language (BBSL).
- ...
AbstractThis article reports the development of language resource for Bangladeshi Bangla spoken language (BBSL). Bangladeshi Bangla has inadequate large speech corpora for Large Vocabulary Continuous Speech Recognition (LVCSR) system. The ...
Psycho-acoustics inspired automatic speech recognition
AbstractUnderstanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
Automatic detection of breathy voiced vowels in Gujarati speech
This paper proposes a method for automatic detection of breathy voiced vowels in continuous Gujarati speech. As breathy voice is a specific phonetic feature predominantly present in Gujarati among Indian languages, it can be used for identifying ...






Comments