Abstract
Accurate phonetic transcriptions are crucial for building robust acoustic models for speech recognition as well as speech synthesis applications. Phonetic transcriptions are not usually provided with speech corpora. A lexicon is used to generate phone-level transcriptions of speech corpora with sentence-level transcriptions. When lexical entries are not available, letter-to-sound (LTS) rules are used. Whether it is a lexicon or LTS, the rules for pronunciation are generic and may not match the spoken utterance. This can lead to transcription errors. The objective of this study is to address the issue of mismatch between the transcription and its acoustic realisation. In particular, the issue of vowel deletions is studied. Group-delay-based segmentation is used to determine insertion/deletion of vowels in the speech utterance. The transcriptions are corrected in the training data based on this. The corrected data are used in automatic speech recognition (ASR) and text to speech synthesis (TTS) systems. ASR and TTS systems built with the corrected transcriptions show improvements in the performance.
- {n.d.}. Indic TTS. Retrieved from https://www.iitm.ac.in/donlab/tts/.Google Scholar
- Basil Abraham, Neethu Mariam Joy, and Navneeth K. S. Umesh. 2014. A data-driven phoneme mapping technique using interpolation vectors of phone-cluster adaptive training. In Proceedings of the Spoken Language Technology Workshop (SLT’14). 36--41.Google Scholar
- Jordi Adell, Pablo Daniel Aguero, and Antonio Bonafonte. 2006. Database pruning for unsupervised building of text-to-speech voices. In Proceedings of the Acoustics, Speech and Signal Processing (ICASSP’06), Vol. 1.Google Scholar
Cross Ref
- Basem H. A. Ahmed and Tien-Ping Tan. 2011. Non-native accent pronunciation modeling in automatic speech recognition. In Proceedings of the International Conference on Asian Language Processing (IALP’11). Google Scholar
Digital Library
- Sankaranarayanan Ananthakrishnan and Shrikanth Narayanan. 2007. Improved speech recognition using acoustic and lexical correlates of pitch accent in a n-best rescoring framework. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), Vol. 4.Google Scholar
Cross Ref
- Arun Baby, N. L. Nishanthi, Anju Leela Thomas, and Hema A. Murthy. 2016. A unified parser for developing Indian language text to speech synthesizers. In Proceedings of the International Conference on Text, Speech and Dialogue.Google Scholar
- Arun Baby, Jeena J. Prakash, S. Rupak Vignesh, and Hema A. Murthy. 2017. Deep learning techniques in tandem with signal processing cues for phonetic segmentation for text to speech synthesis in Indian languages. 3817--3821.Google Scholar
- Arun Baby, Anju Leela Thomas, N. L. Nishanthi, and TTS Consortium. 2016. Resources for Indian languages. In Proceedings of the Community-Based Building of Language Resources (CBBLR’16).Google Scholar
- Alan W. Black, Kevin Lenzo, and Vincent Pagel. 1998. Issues in building general letter to sound rules. In Proceedings of the 3rd ESCA Workshop in Speech Synthesis. 77--80.Google Scholar
- P. Deivapalan, Mukund Jha, Rakesh Guttikonda, and Hema A. Murthy. 2008. Donlabel: An automatic labeling tool for Indian languages. In Proceedings of the National Conference on Communication.Google Scholar
- Robert E. Donovan and Philip C. Woodland. 1999. A hidden Markov-model-based trainable speech synthesizer. Comput. Speech Lang. 13, 3 (1999), 223--241. Google Scholar
Digital Library
- Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. 2018. Automatic speech recognition errors detection and correction: A review. Proc. Comput. Sci. 128 (2018), 32--37.Google Scholar
Cross Ref
- Gunnar Evermann and P. C. Woodland. 2000. Posterior probability decoding, confidence estimation and system combination. In Proceedings of the Speech Transcription Workshop, Vol. 27. 78--81.Google Scholar
- Eric Fosler-Lussier. 2000. A tutorial on pronunciation modeling for large vocabulary speech recognition. In Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), Vol. 2705. 38--77.Google Scholar
- Aravind Ganapathiraju, Jonathan Hamaker, Joseph Picone, Mark Ordowski, and George R. Doddington. 2001. Syllable-based large vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 9, 4 (2001), 358--366.Google Scholar
Cross Ref
- Rajan Golda Brunet and A. Hema Murthy. 2017. Transcription correction using group delay processing for continuous speech recognition. In Circuits, Systems, and Signal Processing. 1177--1202. Google Scholar
Digital Library
- Takaaki Hori, Chiori Hori, Yasuhiro Minami, and Atsushi Nakamura. 2007. Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition. IEEE Trans. Aud.Speech Lang. Process. 15, 4 (2007), 1352--1365. Google Scholar
Digital Library
- Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. 2001. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Vol. 1. Prentice Hall, Upper Saddle River, NJ. Google Scholar
Digital Library
- Andrew J. Hunt and Alan W. Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’96), Vol. 1. 373--376. Google Scholar
Digital Library
- Kominek John and Alan W. Black. 2004. The CMU arctic speech databases. In Proceedings of the 5th ISCA Workshop on Speech Synthesis. 223--224.Google Scholar
- Jeena J. Prakash, Golda Brunet Rajan, and Hema Murthy. 2018. Transcription correction for Indian languages using acoustic signatures. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH'18). 3177--3181.Google Scholar
- Yeon-Jun Kim, Ann Syrdal, and Matthias Jilka. 2004. Improving TTS by higher agreement between predicted versus observed pronunciations. In Proceedings of the 5th ISCA Workshop on Speech Synthesis. 127--132.Google Scholar
- John Kominek and Alan W. Black. 2004. Impact of durational outlier removal from unit selection catalogs. In Proceedings of the 5th ISCA Workshop on Speech Synthesis. 155--160.Google Scholar
- Hema A. Murthy, Lakshmi Sarada G., Lakshmi A., and T. Nagarajan. 2009. Automatic transcription of continuous speech into syllable-like units for Indian languages. Sadhana 34, 2 (2009), 221--233.Google Scholar
Cross Ref
- Donald J. Leu. 1982. Oral reading error analysis: A critical review of research and application. Read. Res. Quart. 17, 3 (1982), 420--437. http://www.jstor.org/stable/747528Google Scholar
Cross Ref
- Jae Lim. 1979. Spectral root homomorphic deconvolution system. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’79), Vol. 4. 409--414.Google Scholar
Cross Ref
- Xunying Liu, Yongqiang Wang, Xie Chen, Mark J. F. Gales, and Philip C. Woodland. 2014. Efficient lattice rescoring using recurrent neural network language models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 4908--4912.Google Scholar
- M. Mahesh, Jeena J. Prakash, and Hema Murthy. 2018. Resyllabification in Indian languages and its implications in text-to-speech systems. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). 212--216.Google Scholar
- Benoıt Maison, Stanley F. Chen, and Paul S. Cohen. 2003. Pronunciation modeling for names of foreign origin. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’03). 429--434.Google Scholar
- Microsoft. 2018. Interspeech 2018 special session: Low resource speech recognition challenge for Indian languages. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). Retrieved from https://www.microsoft.com/en-us/research/event/interspeech-2018-special-session-low-resource-speech-recognition-challenge-indian-languages/.Google Scholar
- Nelson Morgan and Herve Bourlard. 1995. Continuous speech recognition. IEEE Sign. Process. Mag. 12, 3 (1995), 24--42.Google Scholar
Cross Ref
- Hema A. Murthy and B. Yegnanarayana. 1991. Formant extraction from group delay function. Speech Commun. 10, 3 (1991), 209--221. Google Scholar
Digital Library
- Hema A. Murthy and Bayya Yegnanarayana. 2011. Group delay functions and its applications in speech technology. Sadhana 36, 5 (2011), 745--782.Google Scholar
- T. Nagarajan and A. Hema Murthy. 2003. Group delay based segmentation of spontaneous speech into syllable-like units. In Proceedings of the ISCA and IEEE Workshop on Spontaneous Speech Processing and Recognition. 115--118.Google Scholar
- T. Nagarajan and Hema A. Murthy. 2004. Subband-based group delay segmentation of spontaneous speech into syllable-like units. EURASIP J. Adv. Sign. Process. 2004, 17 (2004), 2614--2625.Google Scholar
Cross Ref
- T. Nagarajan, V. Kamakshi Prasad, and Hema A. Murthy. 2003. Minimum phase signal derived from root cepstrum. Electron. Lett. 39, 12 (2003), 941--942.Google Scholar
Cross Ref
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.Google Scholar
- Anusha Prakash, Jeena J. Prakash, and Hema A. Murthy. Acoustic analysis of syllables across Indian languages. In Proceedings of INTERSPEECH. 327--331.Google Scholar
- Jeena J. Prakash and Hema A. Murthy. 2016. An analysis of the distribution of syllables in prosodic phrases of stress-timed and syllable-timed languages. In Speech Prosody 2016. 49--53.Google Scholar
- V. Kamakshi Prasad, T. Nagarajan, and Hema A. Murthy. 2004. Automatic segmentation of continuous speech using minimum phase group delay functions. Speech Commun. 42, 3--4 (2004), 429--446.Google Scholar
- B. Ramani, S. Lilly Christina, G. Anushiya Rachel, V. Sherlin Solomi, Mahesh Kumar Nandwana, Anusha Prakash, S. Aswin Shanmugam, Raghava Krishnan, S. Kishore Prahalad, K. Samudravijaya, et al. 2013. A common attribute based unified HTS framework for speech synthesis in Indian languages. In Proceedings of the 8th ISCA Workshop on Speech Synthesis. 311--316.Google Scholar
- A. Rudnicky. {n.d.}. Cmu lexicon. Retrieved from www.speech.cs.cmu.edu/cgi-bin/cmudict.Google Scholar
- Tara N. Sainath, Bhuvana Ramabhadran, and Michael Picheny. 2009. An exploration of large vocabulary tools for small vocabulary phonetic recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition 8 Understanding (ASRU’09). 359--364.Google Scholar
Cross Ref
- Jilt Sebastian, Manoj Kumar, and Hema A. Murthy. 2016. An analysis of the high resolution property of group delay function with applications to audio signal processing. Speech Commun. 81, C (July 2016), 42--53. Google Scholar
Digital Library
- Jilt Sebastian, Ganesh Kumar Mari, Venkata Subramanian Viraraghavan, Mriganka Sur, and Hema A. Murthy. 2019. Spike estimation from fluorescence signals using high--resolution property of group delay. IEEE Trans. Sign. Process. 67, 11 (June 2019), 1535--1549.Google Scholar
Cross Ref
- Jilt Sebastian, Y. S. Sreekar, Rajeev Vijay Rikhye, Mriganka Sur, Hema A. Murthy, et al. 2017. GDspike: An accurate spike estimation algorithm from noisy calcium fluorescence signals. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, Los Alamitos, CA, 1043--1047.Google Scholar
Cross Ref
- S. Aswin Shanmugam and Hema Murthy. 2014. A hybrid approach to segmentation of speech using group delay processing and HMM based embedded reestimation. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14). 7334--7338.Google Scholar
- Andreas Stolcke. 2002. SRILM—An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing. 901--904.Google Scholar
- Jan Svec, Lubos Smidl, Tomas Valenta, Adam Chylek, and Pavel Ircing. 2015. Word-semantic lattices for spoken language understanding. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 5266--5270.Google Scholar
Cross Ref
- Ryuki Tachibana, Tohru Nagano, Gakuto Kurata, Masafumi Nishimura, and Noboru Babaguchi. 2007. Preliminary experiments toward automatic generation of new TTS voices from recorded speech alone. In Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH’07). 1917--1920.Google Scholar
- Swetha Tanamala, Jeena J. Prakash, and Hema A. Murthy. 2017. A semi-automatic method for transcription error correction for Indian language TTS systems. In Proceedings of the 23rd National Conference on Communications (NCC’17). 1--6.Google Scholar
- Anju Thomas, Anusha Prakash, Arun Baby, and Hema Murthy. 2018. Code-switching in Indic speech synthesisers. In Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH’18). 1948--1952.Google Scholar
Cross Ref
- Si Wei, Guoping Hu, Yu Hu, and Ren-Hua Wang. 2009. A new method for mispronunciation detection using support vector machine based on pronunciation space models. Speech Commun. 51, 10 (2009), 896--905. Google Scholar
Digital Library
- Fuliang Weng, Andreas Stolcke, and Ananth Sankar. 1998. Efficient lattice representation and generation. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP’98). 2531--2534.Google Scholar
- Wikipedia. 2018. Languages of India. Retrieved March 22, 2018 from https://en.wikipedia.org/w/index.php?title=Languages_of_India8oldid=831676831.Google Scholar
- Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. 2006. The HTK Book. Cambridge University Engineering Department. Retrieved from http://www.dsic.upv.es/docs/posgrado/20/RES/materialesDocentes/alejandroViewgraphs/htkbook.pdf.Google Scholar
- Heiga Zen, Takashi Nose, Takashi Masuko, Alan W. Black, and Keiichi Tokuda. 2007. The HMM-based speech synthesis system (HTS) version 2.0. In Proceedings of the 6th ISCA Workshop on Speech Synthesis.Google Scholar
- Da Zheng, Zhehuai Chen, Yue Wu, and Kai Yu. 2016. Directed automatic speech transcription error correction using bidirectional LSTM. In Proceedings of the 10th International Symposium on Chinese Spoken Language Processing (ISCSLP’16). 1--5.Google Scholar
Cross Ref
- Jing Zheng. 2014. Pronunciation Variation Modeling for Automatic Speech Recognition. Ph.D. Dissertation. University of Colorado at Boulder.Google Scholar
Index Terms
Importance of Signal Processing Cues in Transcription Correction for Low-Resource Indian Languages
Recommendations
Merging of Native and Non-native Speech for Low-resource Accented ASR
SLSP 2015: Proceedings of the Third International Conference on Statistical Language and Speech Processing - Volume 9449This paper presents our recent study on low-resource automatic speech recognition ASR system with accented speech. We propose multi-accent Subspace Gaussian Mixture Models SGMM and accent-specific Deep Neural Networks DNN for improving non-native ASR ...
Transcription Correction Using Group Delay Processing for Continuous Speech Recognition
Three major areas have been the focus in the literature to improve ASR performance, namely enhanced acoustic modeling, use of new acoustic features and contributions to the language modeling. An aspect that is less frequently considered is the effect of ...
Spoken language resources for Cantonese speech processing
This paper describes the development of CU Corpora, a series of large-scale speech corpora for Cantonese. Cantonese is the most commonly spoken Chinese dialect in Southern China and Hong Kong. CU Corpora are the first of their kind and intended to serve ...






Comments