skip to main content
research-article

Approaches for Multilingual Phone Recognition in Code-switched and Non-code-switched Scenarios Using Indian Languages

Authors Info & Claims
Published:20 July 2021Publication History
Skip Abstract Section

Abstract

In this study, we evaluate and compare two different approaches for multilingual phone recognition in code-switched and non-code-switched scenarios. First approach is a front-end Language Identification (LID)-switched to a monolingual phone recognizer (LID-Mono), trained individually on each of the languages present in multilingual dataset. In the second approach, a common multilingual phone-set derived from the International Phonetic Alphabet (IPA) transcription of the multilingual dataset is used to develop a Multilingual Phone Recognition System (Multi-PRS). The bilingual code-switching experiments are conducted using Kannada and Urdu languages. In the first approach, LID is performed using the state-of-the-art i-vectors. Both monolingual and multilingual phone recognition systems are trained using Deep Neural Networks. The performance of LID-Mono and Multi-PRS approaches are compared and analysed in detail. It is found that the performance of Multi-PRS approach is superior compared to more conventional LID-Mono approach in both code-switched and non-code-switched scenarios. For code-switched speech, the effect of length of segments (that are used to perform LID) on the performance of LID-Mono system is studied by varying the window size from 500 ms to 5.0 s, and full utterance. The LID-Mono approach heavily depends on the accuracy of the LID system and the LID errors cannot be recovered. But, the Multi-PRS system by virtue of not having to do a front-end LID switching and designed based on the common multilingual phone-set derived from several languages, is not constrained by the accuracy of the LID system, and hence performs effectively on code-switched and non-code-switched speech, offering low Phone Error Rates than the LID-Mono system.

References

  1. K. Bhuvanagirir and S. K. Kopparapu. 2012. Mixed Language Speech Recognition without Explicit identification of language. Amer. J. Signal Process. 2(5), (2012), 92–97. DOI:https://doi.org/10.5923/j.ajsp.20120205.02Google ScholarGoogle ScholarCross RefCross Ref
  2. A. Biswas, E. Yilmaz, F. d. Wet, E. v. d. Westhuizen, and T. Niesler. 2019. Semi-supervised acoustic model training for five-lingual code-switched ASR. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). 3745–3749. DOI:https://doi.org/10.21437/Interspeech.2019-1325Google ScholarGoogle Scholar
  3. W. M. Campbell, J. Campbell, D. A. Reynolds, E. Singer, and P. A. Torres-Carrasquillo. 2006. Support vector machines for speaker and language recognition. Comput. Speech Lang. 20, 2-3, (2006), 210–229. DOI:https://doi.org/10.1016/j.csl.2005.06.003Google ScholarGoogle ScholarCross RefCross Ref
  4. W. M. Campbell, E. Singer, P. A. Torres-Carrasquillo, and D. A. Reynolds. 2004. Language recognition with support vector machines. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop. 41–44.Google ScholarGoogle Scholar
  5. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), (2011), 1–27. Retrieved from http://www.csie.ntu.edu.tw/∼cjlin/libsvm. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Dehak, P. A T. Carrasquillo, D. Reynolds, and R. Dehak. 2011. Language recognition via i-vectors and dimensionality reduction. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’11). 857–860.Google ScholarGoogle Scholar
  7. Department of Higher Education, Ministry of Education, Government of India. Language education. Retrieved from https://mhrd.gov.in/language-education.Google ScholarGoogle Scholar
  8. Department of Higher Education, Ministry of Education, Government of India. To know more about Indian languages. Retrieved from http://mhrd.gov.in/sites/upload_files/mhrd/files/upload_document/languagebr.pdf.Google ScholarGoogle Scholar
  9. Development of Prosodically Guided Phonetic Engine for Searching Speech Databases in Indian Languages. 2012. Retrieved from http://speech.iiit.ac.in/svldownloads/pro_po_en_report/.Google ScholarGoogle Scholar
  10. J. G. Dominguez, D. Eustis, I. L. Moreno, A. Senior, F. Beaufays, and P. J. Moreno. 2015. A real-time end-to-end multilingual speech recognition architecture. IEEE J. Select. Top. Signal Process. 10, 4, (2015). DOI:https://doi.org/10.1109/JSTSP.2014.2364559Google ScholarGoogle Scholar
  11. S. Ford. Language Mixing among Bilingual Children. Retrieved from http://www2.hawaii.edu/ sford/research/mixing.htm.Google ScholarGoogle Scholar
  12. V. Golla. 2011. California Indian Languages. University of California Press—Language Arts & Disciplines, 380 pages.Google ScholarGoogle Scholar
  13. R. R. Heredia and J. Altarriba. 2001. Bilingual language mixing: Why do bilinguals code-switch? Curr. Direct. Psychol. Sci. 10, (2001), 164–168. DOI:https://doi.org/10.1111/1467-8721.00140Google ScholarGoogle ScholarCross RefCross Ref
  14. A. K. V. Sai Jayram, V. Ramasubramanian, and T. V. Sreenivas. 2003. Language identification using parallel sub-word recognition.. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSAP’03), Vol. 1, I-32. DOI:https://doi.org/10.1109/ICASSP.2003.1198709Google ScholarGoogle Scholar
  15. B. Jiang, Y. Song, S. Wei, J. H. Liu, I. McLoughlin, and L. Dai. 2014. Deep bottleneck features for spoken language identification. PLoS ONE, 9(7) (2014). DOI:https://doi.org/10.1371/journal.pone.0100795Google ScholarGoogle Scholar
  16. B. Jiang, Y. Song, S. Wei, M. Wang, I. McLoughlin, and L. Dai. 2014. Performance evaluation of deep bottleneck features for spoken language identification. In Proceedings of the International Symposium on Chinese Spoken Language Processing, 143–147. DOI:https://doi.org/10.1109/ISCSLP.2014.6936580Google ScholarGoogle ScholarCross RefCross Ref
  17. L. Jorschick, A. E. Quick, D. Glasser, E. Lieven, and M. Tomasello. 2011. German-English-speaking children’s mixed NPs with “correct” agreement. Biling.: Lang. Cogn. 14, 2, (2011), 173–183. DOI:https://doi.org/10.1017/S1366728910000131Google ScholarGoogle Scholar
  18. S. Kim and M. L. Seltzer. 2018. Towards language-universal end-to-end speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 4914–4918. DOI:https://doi.org/10.1109/ICASSP.2018.8462201Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. F. Kroll and A. M. B. De Groot (Ed.). 2005. Handbook of Bilingualism: Psycholinguistic Approaches. Oxford University Press.Google ScholarGoogle Scholar
  20. S. B. S. Kumar, K. S. Rao, and D. Pati. 2013. Phonetic and prosodically rich transcribed speech corpus in Indian languages : Bengali and Odia. In Proceedings of the 16th IEEE International Oriental COCOSDA (O-COCOSDA’13). 1–5. DOI:https://doi.org/10.1109/ICSDA.2013.6709901Google ScholarGoogle Scholar
  21. C. S. Kumar, V. P. Mohandas, and L. Haizhou. 2005. Multilingual speech recognition: A unified approach. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’05), 3357–3360.Google ScholarGoogle Scholar
  22. Z. T. Kyaw Z. H. Lim E. S. Chng H. Xu, V. T. Pham and H. Li. 2018. Mandarin-English code-switching speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). 554–555.Google ScholarGoogle Scholar
  23. M. Li, H. Suo, X. Wu, P. Lu, and Y. Yan. 2007. Spoken language identification using score vector modeling and support vector machine. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’07). 350–353.Google ScholarGoogle Scholar
  24. H. Lin, J. T. Huang, F. Beaufays, B. Strope, and H. Sung. 2012. Recognition of multilingual speech in mobile applications. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’12), 4881–4884. DOI:https://doi.org/10.1109/ICASSP.2012.6289013Google ScholarGoogle Scholar
  25. D. Lyu, R. Lyu, Y. Chiang, and C. Hsu. 2006. Speech recognition on code-switching among the chinese dialects. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’06), I–I. DOI:https://doi.org/10.1109/ICASSP.2006.1660218Google ScholarGoogle Scholar
  26. B. Ma, C. Guan, H. Li, and C. Lee. 2002. Multilingual speech recognition with language identification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’02).Google ScholarGoogle Scholar
  27. M. C. Madhavi, S. Sharma, and H. A. Patil. 2014. Development of language resources for speech application in Gujarati and Marathi. In Proceedings of the IEEE International Conference on Asian Language Processing (IALP’14), Vol. 1, 115–118. DOI:https://doi.org/10.1109/IALP.2014.6973517Google ScholarGoogle Scholar
  28. K. E. Manjunath, K. S. Rao, D. B. Jayagopi, and V. Ramasubramanian. 2018. Indian languages ASR: A multilingual phone recognition framework with IPA-based common phone-set, predicted articulatory features and feature fusion. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). 1016–1020. DOI:https://doi.org/10.21437/Interspeech.2018-2529Google ScholarGoogle Scholar
  29. L. Mary and B. Yegnanarayana. 2004. Autoassociative neural network models for language identification. In Proceedings of the International Conference on Intelligent Sensing and Information Processing (ICISIP’04). DOI:https://doi.org/10.1109/ICISIP.2004.1287674Google ScholarGoogle Scholar
  30. M. Muller, S. Stuker, and A. Waibel. 2016. Towards improving low-resource speech recognition using articulatory and language features. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT’16), 1–7.Google ScholarGoogle Scholar
  31. T. Nagarajan and H. A. Murthy. 2003. A pair-wise multiple codebook approach to implicit language identification. In Proceedings of the Workshop on Spoken Language Processing. 101–108. DOI:https://doi.org/10.1109/ICASSP.2018.8461972Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Nandi, D. Pati, and K. S. Rao. 2017. Implicit processing of LP residual for language identification. Comput. Speech Lang. (2017), 68–87. DOI:https://doi.org/10.1016/j.csl.2016.06.002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. B. Padi, S. Ramoji, V. Yeruva, S. Kumar, and S. Ganapathy. 2018. The LEAP language recognition system for LRE 2017 challenge—Improvements and error analysis. In Proceedings of the Odyssey: The Speaker and Language Recognition Workshop, 31–38. DOI:https://doi.org/10.21437/Odyssey.2018-5Google ScholarGoogle Scholar
  34. V. T. Pham H. Xu E. S. Chng Z. Zeng, Y. Khassanov, and H. Li. 2019. On the end-to-end solution to Mandarin-English code-switching speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’19). 2165–2169. DOI:https://doi.org/10.21437/Interspeech.2019-1429Google ScholarGoogle Scholar
  35. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlcek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. 2011. The Kaldi speech recognition toolkit. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Undertsanding (ASRU’11). Retrieved from http://kaldi-asr.org/.Google ScholarGoogle Scholar
  36. L. Rabiner, B. Juang, and B. Yegnanarayana. 2008. Fundamentals of Speech Recognition. Pearson Education. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. 2000. Speaker verification using adapted Gaussian mixture models. Dig. Signal Process. 10, 1--3, (2000), 19–41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. T. Riedhammer, T. Bocklet, A. Ghoshal, and D. Povey. 2012. Revisiting semi-continuous hidden Markov models. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’12). 4721–4724. DOI:https://doi.org/10.1109/ICASSP.2012.6288973Google ScholarGoogle Scholar
  39. S. A. SantoshKumar and V. Ramasubramanian. 2005. Automatic language identification using ergodic-HMM. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05). 609–612. DOI:https://doi.org/10.1109/ICASSP.2005.1415187Google ScholarGoogle ScholarCross RefCross Ref
  40. B. D. Sarma, M. Sarma, M. Sarma, and S. R. M. Prasanna. 2013. Development of assamese phonetic engine: Some issues. In Proceedings of the IEEE Conference of the India Council of Computer Science and Engineering (INDICON’13). 1–6. DOI:https://doi.org/10.1109/INDCON.2013.6725966Google ScholarGoogle Scholar
  41. T. Schultz. 2014. Multilingual automatic speech recognition for code-switching speech. In Proceedings of the 9th International Symposium on Chinese Spoken Language Processing.Google ScholarGoogle Scholar
  42. T. Schultz and A. Waibel. 1998a. Language independent and language adaptive large vocabulary speech recognition. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’98). 1819–1822.Google ScholarGoogle Scholar
  43. T. Schultz and A. Waibel. 1998b. Multilingual and crosslingual speech recognition. In Proceedings of the DARPA Workshop on Broadcast News Transcription and Understanding. 259–262.Google ScholarGoogle Scholar
  44. T. Schultz and A. Waibel. 2001. Language independent and language adaptive acoustic modeling for speech recognition. Speech Commun. 35, (2001), 31–51. DOI:https://doi.org/10.1016/S0167-6393(00)00094-7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. T. Schultz and K. Kirchhoff. 2006. Multilingual Speech Processing. Academic Press. DOI:https://doi.org/10.1016/B978-0-12-088501-5.X5000-8 Google ScholarGoogle ScholarCross RefCross Ref
  46. scikit-learn. scikit-learn: Machine learning in Python. Retrieved from https://scikit-learn.org.Google ScholarGoogle Scholar
  47. Sclite Tool. Retrieved from http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm.Google ScholarGoogle Scholar
  48. M. V. Shridhara, B. K Banahatti, L. Narthan, V. Karjigi, and R. Kumaraswamy. 2013. Development of Kannada speech corpus for prosodically guided phonetic search engine. In Proceedings of the 16th International Oriental COCOSDA (O-COCOSDA’13), 1–6. DOI:https://doi.org/10.1109/ICSDA.2013.6709875Google ScholarGoogle Scholar
  49. S. M. Siniscalchi, D. Lyu, T. Svendsen, and C. Lee. 2012. Experiments on cross-language attribute detection and phone recognition with minimal target-specific training data. IEEE Trans. Acoust. Speech Signal Process. 20, 3 (2012), 875–887. DOI:https://doi.org/10.1109/TASL.2011.2167610 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. Sitaram K. Bali S. Sivasankaran, B. M. L. Srivastava and M. Choudhury. 2018. Phone merging for code-switched speech recognition. In Proceedings of the 3rd Workshop on Computational Approaches to Linguistic Code-switching, 11–19.Google ScholarGoogle Scholar
  51. The International Phonetic Association. 2007. Handbook of the International Phonetic Association. Cambridge University Press. Retrieved from https://www.internationalphoneticassociation.org/.Google ScholarGoogle Scholar
  52. S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao. 2018. Multilingual speech recognition with a single end-to-end model. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’18). 4904–4908. DOI:https://doi.org/10.1109/ICASSP.2018.8461972Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. G. R. Tucker. 1999. A global perspective on bilingualism and bilingual education. ERIC Digest, Office of Educational Research and Improvement (ED), Washington, DC.Google ScholarGoogle Scholar
  54. N. T. Vu, D. Imseng, D. Povey, P. Motlicek, T. Schultz, and H. Bourlard2014. Multilingual deep neural network-based acoustic modeling for rapid language adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’14), 7639-7643. DOI:https://doi.org/10.1109/ICASSP.2014.6855086Google ScholarGoogle ScholarCross RefCross Ref
  55. N. T. Vu, D. Lyu, J. Weiner, D. Telaar, T. Schlippe, F. Blaicher, E. Chng, T. Schultz, and Haizhou Li. 2012. A first speech recognition system for mandarin-english code-switch conversational speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’12). 4889–4892. DOI:https://doi.org/10.1109/ICASSP.2012.6289015Google ScholarGoogle ScholarCross RefCross Ref
  56. A. Waibel, H. Soltau, T. Schultz, T. Schaaf, and F. Metze. 2000. Multilingual speech recognition. In Verbmobil: Foundations of Speech-to-Speech Translation. Artificial Intelligence. Springer, 33–45. DOI:https://doi.org/10.1007/978-3-662-04230-4_3Google ScholarGoogle Scholar
  57. J. Weiner, N. T. Vu, D. Telaar, F. Metze, T. Schultz, D. Lyu, E. Chng, and H. Li. 2012. Integration of language identification into a recognition system for spoken conversations containing code-switches. In Proceedings of the 3rd Workshop on Spoken Language Technology for Under-resourced Languages (SLTU’12).Google ScholarGoogle Scholar
  58. L. Xie P. Guo, H. Xu, and E. S. Chng. 2018. Study of semi-supervised approaches to improving english-Mandarin code-switching speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18). 1928–1932. DOI:https://doi.org/10.21437/Interspeech.2018-1974Google ScholarGoogle Scholar
  59. E. Yilmaz, A. Biswas, F. De Wet, E. v. d. Westhuizen, and T. Niesler. 2018. Building a unified code-switching asr system for south african languages. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’18), 1923–1927. DOI:https://doi.org/10.21437/Interspeech.2018-1966Google ScholarGoogle Scholar
  60. E. Yilmaz, H. v. d. Heuvel, and D. v. Leeuwen. 2016. Investigating Bilingual Deep Neural Networks for automatic recognition of code-switching frisian speech. In Proceedings of the 5th Workshop on Spoken Language Technology for Under-resourced Languages(SLTU), 159–166. DOI:https://doi.org/10.1016/j.procs.2016.04.044Google ScholarGoogle ScholarCross RefCross Ref
  61. X. Zhang, J. Trmal, D. Povey, and S. Khudanpur. 2014. Improving deep neural network acoustic models using generalized maxout networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’14). 215–219. DOI:https://doi.org/10.1109/ICASSP.2014.6853589Google ScholarGoogle Scholar
  62. S. Zhao C. Gong W. Zou N. Luo, D. Jiang and X. Li. 2018. Towards end-to-end code-switching speech recognition. Retrieved from https://arxiv.org/abs/1810.13091.Google ScholarGoogle Scholar

Index Terms

  1. Approaches for Multilingual Phone Recognition in Code-switched and Non-code-switched Scenarios Using Indian Languages

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Article Metrics

      • Downloads (Last 12 months)22
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!