skip to main content
research-article

Model Generation of Accented Speech using Model Transformation and Verification for Bilingual Speech Recognition

Published:20 April 2015Publication History
Skip Abstract Section

Abstract

Nowadays, bilingual or multilingual speech recognition is confronted with the accent-related problem caused by non-native speech in a variety of real-world applications. Accent modeling of non-native speech is definitely challenging, because the acoustic properties in highly-accented speech pronounced by non-native speakers are quite divergent. The aim of this study is to generate highly Mandarin-accented English models for speakers whose mother tongue is Mandarin. First, a two-stage, state-based verification method is proposed to extract the state-level, highly-accented speech segments automatically. Acoustic features and articulatory features are successively used for robust verification of the extracted speech segments. Second, Gaussian components of the highly-accented speech models are generated from the corresponding Gaussian components of the native speech models using a linear transformation function. A decision tree is constructed to categorize the transformation functions and used for transformation function retrieval to deal with the data sparseness problem. Third, a discrimination function is further applied to verify the generated accented acoustic models. Finally, the successfully verified accented English models are integrated into the native bilingual phone model set for Mandarin-English bilingual speech recognition. Experimental results show that the proposed approach can effectively alleviate recognition performance degradation due to accents and can obtain absolute improvements of 4.1%, 1.8%, and 2.7% in word accuracy for bilingual speech recognition compared to that using traditional ASR approaches, MAP-adapted, and MLLR-adapted ASR methods, respectively.

References

  1. Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz, D., and Reynolds, D. A., 2004. A tutorial on text-independent speaker verification. J. Appl. Signal Process. 4, 430--451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bouselmi, G., Fohr, D., and Illina, I. 2007. Combined acoustic and pronunciation modeling for non-native speech recognition. In Proceedings of Interspeech. 1449--1452.Google ScholarGoogle Scholar
  3. Bouselmi, G., Fohr, D., and Illina, I. 2008. Multi-accent and accent-independent non-native speech recognition. In Proceedings of Interspeech.Google ScholarGoogle Scholar
  4. Campbell, W. M., Brady, K. J., Campbell, J. P., Granville, R. D., and Reynolds, A. 2006. Understanding scores in forensic speaker recognition. In Proceedings of the IEEE Odyssey: Speaker and Language Recognition Workshop. 1--8.Google ScholarGoogle Scholar
  5. Chao, Y.-H., Wang, H.-M., and Chang, R.-C. 2007. A novel characterization of the alternative hypothesis using kernel discriminant analysis for LLR-based speaker verification. Int. J. Comput. Linguist. Chinese Lang. Process. 12, 3, 255--272.Google ScholarGoogle Scholar
  6. Chen, N. F., Shen, W., Campbell, J. P., and Torres-Carrasquillo, P. 2011. Informative dialect recognition using context-dependent pronunciation modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 4396--4399.Google ScholarGoogle Scholar
  7. Chen, Y.-J., Wu, C.-H., Chiu, Y.-H., and Liao, H.-C. 2002. Generation of robust phonetic set and decision tree for Mandarin using chi-square testing. Speech Commun. 38, 3--4, 349--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chomsky, N. and Halle, M. 1968. The Sound Pattern of English. Harper & Row, New York, NY.Google ScholarGoogle Scholar
  9. Dempster, A. P., Laird, N. M., and Rubin, D. B. 1997. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Series B (Methodological) 39, 1, 1--38.Google ScholarGoogle ScholarCross RefCross Ref
  10. English Across Taiwan. 2005. EAT. http://www.aclclp.org.tw/use_mat.php#eat.Google ScholarGoogle Scholar
  11. Felps, D., Geng, C., and Gutierrez-Osuna, R. 2012. Foreign accent conversion through concatenative synthesis in the articulatory domain. IEEE Trans. Audio, Speech, Lang. Process. 20, 8, 2301--2312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fisher, W. M., Doddington, G. R., and Goudie-Marshall, K. M. 1986. The DARPA speech recognition research database: Specifications and status. In Proceedings of the DARPA Workshop on Speech Recognition. 93--99.Google ScholarGoogle Scholar
  13. Fung, P. and Liu, Y. 2005. Effects and modeling of phonetic and acoustic confusions in accented speech. J. Acoust. Soc. Am. 118, 5, 3279--3293.Google ScholarGoogle ScholarCross RefCross Ref
  14. Fukada, T., Yoshimura, T., and Sagisaka, Y. 1999. Automatic generation of multiple pronunciations based on neural networks. Speech Commun. 27, 1, 63--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hieronymus, J. L. 1994. ASCII phonetic symbols for the world’s languages: Worldbet. AT&T Tech.Rep. http://www.cslu.ogi.edu/publications/.Google ScholarGoogle Scholar
  16. Huang, C.-L. and Wu, C.-H. 2007. Generation of phonetic units for mixed-language speech recognition based on acoustic and contextual analysis. IEEE Trans. Comput. 56, 9, 1245--1254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Huang, Y.-C., Wu, C.-H., and Chao, Y.-T. 2013. Personalized spectral and prosody conversion using frame-based codeword distribution and adaptive CRF. IEEE Trans. Audio, Speech, Lang. Process.21, 1, 51--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hwang, M.-Y. and Huang, X. 1992. Subphonetic modeling with Markov states-senone. In Proceedings of ICASSP. Vol. 1, 33--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hwang, M.-Y., Huang, X., and Alleva, F. 1996. Predicting unseen triphones with senones. IEEE Trans. Speech and Audio Process. 4, 6, 412--419.Google ScholarGoogle ScholarCross RefCross Ref
  20. International Phonetic Association (IPA) 1999. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press.Google ScholarGoogle Scholar
  21. Joachims, T. 2002. Learning to classify text using support vector machines. Ph.D. Dissertation, Cornell University. Kluwer Academic Publishers, Springer.Google ScholarGoogle Scholar
  22. Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods:Support Vector Learning. Schölkopf, B., Burges, C., and Smola, A. Eds. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kirchhoff, K. 2000. Integrating articulatory features into acoustic models for speech recognition. In Proceedings of the Workshop on Phonetics and Phonology in ASR.Google ScholarGoogle Scholar
  24. Lebese, E., Manamela, J., and Gasela, N. 2012. Towards a multilingual recognition system based on phone-clustering scheme for decoding local languages. In Proceedings of Southern Africa Telecommunication Networks and Applications Conference (SATNAC).Google ScholarGoogle Scholar
  25. Lee, C.-H., Clements, M. A., Dusan, S., Fosler-Lussier, E., Johnson, K., Juang, B.-H., and Rabiner, L. R. 2007. An overview on automatic speech attribute transcription (ASAT). In Proceedings of Interspeech.Google ScholarGoogle Scholar
  26. Liu, Y. and Fung, P. 2004. State-dependent phonetic tied mixtures with pronunciation modeling for spontaneous speech recognition. IEEE Trans. Speech Audio Process. 12, 4, 351--364.Google ScholarGoogle ScholarCross RefCross Ref
  27. Livescu, K. and Glass, J. 2000. Lexical modeling of non-native speech for automatic speech recognition. In Proceedings of ICASSP. Vol. 3.Google ScholarGoogle Scholar
  28. Mokbel, H. and Jouvet, D. 1998. Derivation of the optimal phonetic transcription set for a word from its acoustic realizations. In Proceedings of the ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition. 73--78.Google ScholarGoogle Scholar
  29. Oh, Y. R., Yoon, J. S., and Kim, H. K. 2007. Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition. Speech Commun. 49, 1, 59--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ostendorf, M. 1999. Moving beyond the ‘beads-on-a-string’ model of speech. In Proceedings of the Automatic Speech Recognition and Understanding Workshop. Vol. 1, 79.Google ScholarGoogle Scholar
  31. Qian, Y., Povey, D., and Liu, J. 2011. State-level data borrowing for low-resource speech recognition based on subspace GMMs. In Proceedings of Interspeech.Google ScholarGoogle Scholar
  32. Ravishankar, M. and Eskenazi, M. 1997. Automatic generation of context-dependent pronunciations. In Proceedings of EuroSpeech. 2467--2470.Google ScholarGoogle Scholar
  33. Rosenberg, A. E., DeLong, J., Lee, C.-H., Juang, B.-H., and Soong, F. K. 1992. The use of cohort normalized scores for speaker verification. In Proceedings of the 2nd International Conference on Spoken Language Processing.Google ScholarGoogle Scholar
  34. Siniscalchi, S. M., Svendsen, T., and Lee, C.-H. 2008. Toward a detector-based universal phone recognizer. In Proceedings of ICASSP. 4261--4264.Google ScholarGoogle Scholar
  35. Stefan, S. 2003. Generating non-native pronunciation lexicons by phonological rules. In Proceedings of International Conference of Phonetic Sciences (ICPhS). 2545--2548.Google ScholarGoogle Scholar
  36. Strik, H. and Cucchiarini, C. 1999. Modeling pronunciation variation for ASR: A survey of the literature. Speech Commun. 29, 2--4, 225--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Stüker, S., Metze, F., Schultz, T., and Waibel, A. 2003a. Integrating multilingual articulatory features into speech recognition. In Proceedings of EuroSpeech.Google ScholarGoogle Scholar
  38. Stüker, S., Schultz, T., Metze, F., and Waibel, A. 2003b. Multilingual articulatory features. In Proceedings of ICASSP. Vol. 1, I-144--I-147.Google ScholarGoogle Scholar
  39. Stylianou, Y., Cappe, O., and Moulines, E. 1998. Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6, 2, 131--142.Google ScholarGoogle ScholarCross RefCross Ref
  40. Tomokiyo, L. M. and Waibel, A. 2001. Adaptation methods for non-native speech. In Proceedings of Multilinguality in Spoken Language Processing.Google ScholarGoogle Scholar
  41. Torre, D., Villarrubia, L., Hernández, L., and Elvira, J. 1997. Automatic alternative transcription generation and vocabulary selection for flexible word recognizers. In Proceedings of ICASSP. Vol. 2, 1463--1466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Vu, N. T., Lyu, D.-C., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E.-S., Schultz, T., and Li, H. 2012. A first speech recognition system for Mandarin-English code-switch conversational speech. In Proceedings of ICASSP. 4889--4892.Google ScholarGoogle Scholar
  43. Wang, Z., Schultz, T., and Waibel, A. 2003. Comparison of acoustic model adaptation techniques on non-native speech. In Proceedings of ICASSP, Vol. 1, I-540--I-543.Google ScholarGoogle Scholar
  44. Wang, Z. and Schultz, T. 2003. Non-native spontaneous speech recognition through polyphone decision tree specialization. In Proceedings of Eurospeech.Google ScholarGoogle Scholar
  45. Weiner, J., Vu, N. T., Telaar, D., Metze, F., Schultz, T., Lyu, D.-C., Chng, E.-S., and Li, H. 2012. Integration of language identification into a recognition system for spoken conversations containing code-switches. In Proceedings of the IEEE Workshop of Spoken Language Technology (SLT).Google ScholarGoogle Scholar
  46. Wells, J. C. 1989. Computer-coded phonemic notation of individual languages of the European community. J. Int. Phonetic Assoc. 19, 1, 31--54.Google ScholarGoogle ScholarCross RefCross Ref
  47. Williams, G. and Renals, S. 1998. Confidence measures for evaluating pronunciation models. In Proceedings of the ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition. 151--156.Google ScholarGoogle Scholar
  48. Wu, C.-H., Shen, H.-P., and Yang, Y.-T. 2012. Phone set construction based on context-sensitive articulatory attributes for code-switching speech recognition. In Proceedings of ICASSP.4865--4868.Google ScholarGoogle Scholar
  49. Wu, C.-H., Su, H.-Y., and Shen, H.-P. 2011. Articulation-disordered speech recognition using speaker-adaptive acoustic models and personalized articulation patterns. ACM Trans. Asian Lang. Inform. Process. 10, 2, Article 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Yang, J., Wu, P., and Xu, D. 2008. Mandarin speech recognition for nonnative speakers based on pronunciation dictionary adaptation. In Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP). 1--4.Google ScholarGoogle Scholar
  51. Yeh, C.-F., Heidel, A., Lee, H.-Y., and Lee, L.-S. 2012. Recognition of highly imbalanced code-mixed bilingual speech with frame-level language detection based on blurred posteriorgram. In Proceedings of ICASSP. 4873--4876.Google ScholarGoogle Scholar
  52. Yeh, C.-F., Huang, C.-Y., and Lee, L.-S. 2011. Bilingual acoustic model adaptation by unit merging on different levels and cross-level integration. In Proceedings of Interspeech.Google ScholarGoogle Scholar
  53. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.-Y., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. 2006. The Hidden Markov Model Toolkit (HTK) Version 3.4. http://htk.eng.cam.ac.uk/.Google ScholarGoogle Scholar
  54. Yu, S., Zhang, S., and Xu, B. 2004. Chinese-English bilingual phone modeling for cross-language speech recognition. In Proceedings of ICASSP, Vol. 1, I-917--20.Google ScholarGoogle Scholar
  55. Zhang, C., Liu, Y., and Lee, C.-H. 2011. Detection-based accented speech recognition using articulatory features. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).Google ScholarGoogle Scholar
  56. Zhang, Q., Li, T., Pan, J., and Yan, Y. 2008. Nonnative speech recognition based on state-level bilingual model modification. In Proceedings of the 3rd International Conference on Convergence and Hybrid Information Technology (ICCIT). Vol. 2, 1220--1225. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Model Generation of Accented Speech using Model Transformation and Verification for Bilingual Speech Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 14, Issue 2
      March 2015
      96 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/2764912
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 April 2015
      • Accepted: 1 August 2014
      • Revised: 1 June 2014
      • Received: 1 August 2013
      Published in tallip Volume 14, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!