skip to main content
research-article

Importance of Signal Processing Cues in Transcription Correction for Low-Resource Indian Languages

Authors Info & Claims
Published:10 August 2019Publication History
Skip Abstract Section

Abstract

Accurate phonetic transcriptions are crucial for building robust acoustic models for speech recognition as well as speech synthesis applications. Phonetic transcriptions are not usually provided with speech corpora. A lexicon is used to generate phone-level transcriptions of speech corpora with sentence-level transcriptions. When lexical entries are not available, letter-to-sound (LTS) rules are used. Whether it is a lexicon or LTS, the rules for pronunciation are generic and may not match the spoken utterance. This can lead to transcription errors. The objective of this study is to address the issue of mismatch between the transcription and its acoustic realisation. In particular, the issue of vowel deletions is studied. Group-delay-based segmentation is used to determine insertion/deletion of vowels in the speech utterance. The transcriptions are corrected in the training data based on this. The corrected data are used in automatic speech recognition (ASR) and text to speech synthesis (TTS) systems. ASR and TTS systems built with the corrected transcriptions show improvements in the performance.

References

  1. {n.d.}. Indic TTS. Retrieved from https://www.iitm.ac.in/donlab/tts/.Google ScholarGoogle Scholar
  2. Basil Abraham, Neethu Mariam Joy, and Navneeth K. S. Umesh. 2014. A data-driven phoneme mapping technique using interpolation vectors of phone-cluster adaptive training. In Proceedings of the Spoken Language Technology Workshop (SLT’14). 36--41.Google ScholarGoogle Scholar
  3. Jordi Adell, Pablo Daniel Aguero, and Antonio Bonafonte. 2006. Database pruning for unsupervised building of text-to-speech voices. In Proceedings of the Acoustics, Speech and Signal Processing (ICASSP’06), Vol. 1.Google ScholarGoogle ScholarCross RefCross Ref
  4. Basem H. A. Ahmed and Tien-Ping Tan. 2011. Non-native accent pronunciation modeling in automatic speech recognition. In Proceedings of the International Conference on Asian Language Processing (IALP’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sankaranarayanan Ananthakrishnan and Shrikanth Narayanan. 2007. Improved speech recognition using acoustic and lexical correlates of pitch accent in a n-best rescoring framework. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), Vol. 4.Google ScholarGoogle ScholarCross RefCross Ref
  6. Arun Baby, N. L. Nishanthi, Anju Leela Thomas, and Hema A. Murthy. 2016. A unified parser for developing Indian language text to speech synthesizers. In Proceedings of the International Conference on Text, Speech and Dialogue.Google ScholarGoogle Scholar
  7. Arun Baby, Jeena J. Prakash, S. Rupak Vignesh, and Hema A. Murthy. 2017. Deep learning techniques in tandem with signal processing cues for phonetic segmentation for text to speech synthesis in Indian languages. 3817--3821.Google ScholarGoogle Scholar
  8. Arun Baby, Anju Leela Thomas, N. L. Nishanthi, and TTS Consortium. 2016. Resources for Indian languages. In Proceedings of the Community-Based Building of Language Resources (CBBLR’16).Google ScholarGoogle Scholar
  9. Alan W. Black, Kevin Lenzo, and Vincent Pagel. 1998. Issues in building general letter to sound rules. In Proceedings of the 3rd ESCA Workshop in Speech Synthesis. 77--80.Google ScholarGoogle Scholar
  10. P. Deivapalan, Mukund Jha, Rakesh Guttikonda, and Hema A. Murthy. 2008. Donlabel: An automatic labeling tool for Indian languages. In Proceedings of the National Conference on Communication.Google ScholarGoogle Scholar
  11. Robert E. Donovan and Philip C. Woodland. 1999. A hidden Markov-model-based trainable speech synthesizer. Comput. Speech Lang. 13, 3 (1999), 223--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. 2018. Automatic speech recognition errors detection and correction: A review. Proc. Comput. Sci. 128 (2018), 32--37.Google ScholarGoogle ScholarCross RefCross Ref
  13. Gunnar Evermann and P. C. Woodland. 2000. Posterior probability decoding, confidence estimation and system combination. In Proceedings of the Speech Transcription Workshop, Vol. 27. 78--81.Google ScholarGoogle Scholar
  14. Eric Fosler-Lussier. 2000. A tutorial on pronunciation modeling for large vocabulary speech recognition. In Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), Vol. 2705. 38--77.Google ScholarGoogle Scholar
  15. Aravind Ganapathiraju, Jonathan Hamaker, Joseph Picone, Mark Ordowski, and George R. Doddington. 2001. Syllable-based large vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 9, 4 (2001), 358--366.Google ScholarGoogle ScholarCross RefCross Ref
  16. Rajan Golda Brunet and A. Hema Murthy. 2017. Transcription correction using group delay processing for continuous speech recognition. In Circuits, Systems, and Signal Processing. 1177--1202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Takaaki Hori, Chiori Hori, Yasuhiro Minami, and Atsushi Nakamura. 2007. Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition. IEEE Trans. Aud.Speech Lang. Process. 15, 4 (2007), 1352--1365. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. 2001. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Vol. 1. Prentice Hall, Upper Saddle River, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Andrew J. Hunt and Alan W. Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’96), Vol. 1. 373--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kominek John and Alan W. Black. 2004. The CMU arctic speech databases. In Proceedings of the 5th ISCA Workshop on Speech Synthesis. 223--224.Google ScholarGoogle Scholar
  21. Jeena J. Prakash, Golda Brunet Rajan, and Hema Murthy. 2018. Transcription correction for Indian languages using acoustic signatures. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH'18). 3177--3181.Google ScholarGoogle Scholar
  22. Yeon-Jun Kim, Ann Syrdal, and Matthias Jilka. 2004. Improving TTS by higher agreement between predicted versus observed pronunciations. In Proceedings of the 5th ISCA Workshop on Speech Synthesis. 127--132.Google ScholarGoogle Scholar
  23. John Kominek and Alan W. Black. 2004. Impact of durational outlier removal from unit selection catalogs. In Proceedings of the 5th ISCA Workshop on Speech Synthesis. 155--160.Google ScholarGoogle Scholar
  24. Hema A. Murthy, Lakshmi Sarada G., Lakshmi A., and T. Nagarajan. 2009. Automatic transcription of continuous speech into syllable-like units for Indian languages. Sadhana 34, 2 (2009), 221--233.Google ScholarGoogle ScholarCross RefCross Ref
  25. Donald J. Leu. 1982. Oral reading error analysis: A critical review of research and application. Read. Res. Quart. 17, 3 (1982), 420--437. http://www.jstor.org/stable/747528Google ScholarGoogle ScholarCross RefCross Ref
  26. Jae Lim. 1979. Spectral root homomorphic deconvolution system. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’79), Vol. 4. 409--414.Google ScholarGoogle ScholarCross RefCross Ref
  27. Xunying Liu, Yongqiang Wang, Xie Chen, Mark J. F. Gales, and Philip C. Woodland. 2014. Efficient lattice rescoring using recurrent neural network language models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 4908--4912.Google ScholarGoogle Scholar
  28. M. Mahesh, Jeena J. Prakash, and Hema Murthy. 2018. Resyllabification in Indian languages and its implications in text-to-speech systems. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). 212--216.Google ScholarGoogle Scholar
  29. Benoıt Maison, Stanley F. Chen, and Paul S. Cohen. 2003. Pronunciation modeling for names of foreign origin. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’03). 429--434.Google ScholarGoogle Scholar
  30. Microsoft. 2018. Interspeech 2018 special session: Low resource speech recognition challenge for Indian languages. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). Retrieved from https://www.microsoft.com/en-us/research/event/interspeech-2018-special-session-low-resource-speech-recognition-challenge-indian-languages/.Google ScholarGoogle Scholar
  31. Nelson Morgan and Herve Bourlard. 1995. Continuous speech recognition. IEEE Sign. Process. Mag. 12, 3 (1995), 24--42.Google ScholarGoogle ScholarCross RefCross Ref
  32. Hema A. Murthy and B. Yegnanarayana. 1991. Formant extraction from group delay function. Speech Commun. 10, 3 (1991), 209--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Hema A. Murthy and Bayya Yegnanarayana. 2011. Group delay functions and its applications in speech technology. Sadhana 36, 5 (2011), 745--782.Google ScholarGoogle Scholar
  34. T. Nagarajan and A. Hema Murthy. 2003. Group delay based segmentation of spontaneous speech into syllable-like units. In Proceedings of the ISCA and IEEE Workshop on Spontaneous Speech Processing and Recognition. 115--118.Google ScholarGoogle Scholar
  35. T. Nagarajan and Hema A. Murthy. 2004. Subband-based group delay segmentation of spontaneous speech into syllable-like units. EURASIP J. Adv. Sign. Process. 2004, 17 (2004), 2614--2625.Google ScholarGoogle ScholarCross RefCross Ref
  36. T. Nagarajan, V. Kamakshi Prasad, and Hema A. Murthy. 2003. Minimum phase signal derived from root cepstrum. Electron. Lett. 39, 12 (2003), 941--942.Google ScholarGoogle ScholarCross RefCross Ref
  37. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.Google ScholarGoogle Scholar
  38. Anusha Prakash, Jeena J. Prakash, and Hema A. Murthy. Acoustic analysis of syllables across Indian languages. In Proceedings of INTERSPEECH. 327--331.Google ScholarGoogle Scholar
  39. Jeena J. Prakash and Hema A. Murthy. 2016. An analysis of the distribution of syllables in prosodic phrases of stress-timed and syllable-timed languages. In Speech Prosody 2016. 49--53.Google ScholarGoogle Scholar
  40. V. Kamakshi Prasad, T. Nagarajan, and Hema A. Murthy. 2004. Automatic segmentation of continuous speech using minimum phase group delay functions. Speech Commun. 42, 3--4 (2004), 429--446.Google ScholarGoogle Scholar
  41. B. Ramani, S. Lilly Christina, G. Anushiya Rachel, V. Sherlin Solomi, Mahesh Kumar Nandwana, Anusha Prakash, S. Aswin Shanmugam, Raghava Krishnan, S. Kishore Prahalad, K. Samudravijaya, et al. 2013. A common attribute based unified HTS framework for speech synthesis in Indian languages. In Proceedings of the 8th ISCA Workshop on Speech Synthesis. 311--316.Google ScholarGoogle Scholar
  42. A. Rudnicky. {n.d.}. Cmu lexicon. Retrieved from www.speech.cs.cmu.edu/cgi-bin/cmudict.Google ScholarGoogle Scholar
  43. Tara N. Sainath, Bhuvana Ramabhadran, and Michael Picheny. 2009. An exploration of large vocabulary tools for small vocabulary phonetic recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition 8 Understanding (ASRU’09). 359--364.Google ScholarGoogle ScholarCross RefCross Ref
  44. Jilt Sebastian, Manoj Kumar, and Hema A. Murthy. 2016. An analysis of the high resolution property of group delay function with applications to audio signal processing. Speech Commun. 81, C (July 2016), 42--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jilt Sebastian, Ganesh Kumar Mari, Venkata Subramanian Viraraghavan, Mriganka Sur, and Hema A. Murthy. 2019. Spike estimation from fluorescence signals using high--resolution property of group delay. IEEE Trans. Sign. Process. 67, 11 (June 2019), 1535--1549.Google ScholarGoogle ScholarCross RefCross Ref
  46. Jilt Sebastian, Y. S. Sreekar, Rajeev Vijay Rikhye, Mriganka Sur, Hema A. Murthy, et al. 2017. GDspike: An accurate spike estimation algorithm from noisy calcium fluorescence signals. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, Los Alamitos, CA, 1043--1047.Google ScholarGoogle ScholarCross RefCross Ref
  47. S. Aswin Shanmugam and Hema Murthy. 2014. A hybrid approach to segmentation of speech using group delay processing and HMM based embedded reestimation. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14). 7334--7338.Google ScholarGoogle Scholar
  48. Andreas Stolcke. 2002. SRILM—An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing. 901--904.Google ScholarGoogle Scholar
  49. Jan Svec, Lubos Smidl, Tomas Valenta, Adam Chylek, and Pavel Ircing. 2015. Word-semantic lattices for spoken language understanding. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 5266--5270.Google ScholarGoogle ScholarCross RefCross Ref
  50. Ryuki Tachibana, Tohru Nagano, Gakuto Kurata, Masafumi Nishimura, and Noboru Babaguchi. 2007. Preliminary experiments toward automatic generation of new TTS voices from recorded speech alone. In Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH’07). 1917--1920.Google ScholarGoogle Scholar
  51. Swetha Tanamala, Jeena J. Prakash, and Hema A. Murthy. 2017. A semi-automatic method for transcription error correction for Indian language TTS systems. In Proceedings of the 23rd National Conference on Communications (NCC’17). 1--6.Google ScholarGoogle Scholar
  52. Anju Thomas, Anusha Prakash, Arun Baby, and Hema Murthy. 2018. Code-switching in Indic speech synthesisers. In Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH’18). 1948--1952.Google ScholarGoogle ScholarCross RefCross Ref
  53. Si Wei, Guoping Hu, Yu Hu, and Ren-Hua Wang. 2009. A new method for mispronunciation detection using support vector machine based on pronunciation space models. Speech Commun. 51, 10 (2009), 896--905. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Fuliang Weng, Andreas Stolcke, and Ananth Sankar. 1998. Efficient lattice representation and generation. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP’98). 2531--2534.Google ScholarGoogle Scholar
  55. Wikipedia. 2018. Languages of India. Retrieved March 22, 2018 from https://en.wikipedia.org/w/index.php?title=Languages_of_India8oldid=831676831.Google ScholarGoogle Scholar
  56. Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. 2006. The HTK Book. Cambridge University Engineering Department. Retrieved from http://www.dsic.upv.es/docs/posgrado/20/RES/materialesDocentes/alejandroViewgraphs/htkbook.pdf.Google ScholarGoogle Scholar
  57. Heiga Zen, Takashi Nose, Takashi Masuko, Alan W. Black, and Keiichi Tokuda. 2007. The HMM-based speech synthesis system (HTS) version 2.0. In Proceedings of the 6th ISCA Workshop on Speech Synthesis.Google ScholarGoogle Scholar
  58. Da Zheng, Zhehuai Chen, Yue Wu, and Kai Yu. 2016. Directed automatic speech transcription error correction using bidirectional LSTM. In Proceedings of the 10th International Symposium on Chinese Spoken Language Processing (ISCSLP’16). 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  59. Jing Zheng. 2014. Pronunciation Variation Modeling for Automatic Speech Recognition. Ph.D. Dissertation. University of Colorado at Boulder.Google ScholarGoogle Scholar

Index Terms

  1. Importance of Signal Processing Cues in Transcription Correction for Low-Resource Indian Languages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Article Metrics

        • Downloads (Last 12 months)17
        • Downloads (Last 6 weeks)9

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!