Abstract
Nowadays, bilingual or multilingual speech recognition is confronted with the accent-related problem caused by non-native speech in a variety of real-world applications. Accent modeling of non-native speech is definitely challenging, because the acoustic properties in highly-accented speech pronounced by non-native speakers are quite divergent. The aim of this study is to generate highly Mandarin-accented English models for speakers whose mother tongue is Mandarin. First, a two-stage, state-based verification method is proposed to extract the state-level, highly-accented speech segments automatically. Acoustic features and articulatory features are successively used for robust verification of the extracted speech segments. Second, Gaussian components of the highly-accented speech models are generated from the corresponding Gaussian components of the native speech models using a linear transformation function. A decision tree is constructed to categorize the transformation functions and used for transformation function retrieval to deal with the data sparseness problem. Third, a discrimination function is further applied to verify the generated accented acoustic models. Finally, the successfully verified accented English models are integrated into the native bilingual phone model set for Mandarin-English bilingual speech recognition. Experimental results show that the proposed approach can effectively alleviate recognition performance degradation due to accents and can obtain absolute improvements of 4.1%, 1.8%, and 2.7% in word accuracy for bilingual speech recognition compared to that using traditional ASR approaches, MAP-adapted, and MLLR-adapted ASR methods, respectively.
- Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz, D., and Reynolds, D. A., 2004. A tutorial on text-independent speaker verification. J. Appl. Signal Process. 4, 430--451. Google Scholar
Digital Library
- Bouselmi, G., Fohr, D., and Illina, I. 2007. Combined acoustic and pronunciation modeling for non-native speech recognition. In Proceedings of Interspeech. 1449--1452.Google Scholar
- Bouselmi, G., Fohr, D., and Illina, I. 2008. Multi-accent and accent-independent non-native speech recognition. In Proceedings of Interspeech.Google Scholar
- Campbell, W. M., Brady, K. J., Campbell, J. P., Granville, R. D., and Reynolds, A. 2006. Understanding scores in forensic speaker recognition. In Proceedings of the IEEE Odyssey: Speaker and Language Recognition Workshop. 1--8.Google Scholar
- Chao, Y.-H., Wang, H.-M., and Chang, R.-C. 2007. A novel characterization of the alternative hypothesis using kernel discriminant analysis for LLR-based speaker verification. Int. J. Comput. Linguist. Chinese Lang. Process. 12, 3, 255--272.Google Scholar
- Chen, N. F., Shen, W., Campbell, J. P., and Torres-Carrasquillo, P. 2011. Informative dialect recognition using context-dependent pronunciation modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 4396--4399.Google Scholar
- Chen, Y.-J., Wu, C.-H., Chiu, Y.-H., and Liao, H.-C. 2002. Generation of robust phonetic set and decision tree for Mandarin using chi-square testing. Speech Commun. 38, 3--4, 349--364. Google Scholar
Digital Library
- Chomsky, N. and Halle, M. 1968. The Sound Pattern of English. Harper & Row, New York, NY.Google Scholar
- Dempster, A. P., Laird, N. M., and Rubin, D. B. 1997. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Series B (Methodological) 39, 1, 1--38.Google Scholar
Cross Ref
- English Across Taiwan. 2005. EAT. http://www.aclclp.org.tw/use_mat.php#eat.Google Scholar
- Felps, D., Geng, C., and Gutierrez-Osuna, R. 2012. Foreign accent conversion through concatenative synthesis in the articulatory domain. IEEE Trans. Audio, Speech, Lang. Process. 20, 8, 2301--2312. Google Scholar
Digital Library
- Fisher, W. M., Doddington, G. R., and Goudie-Marshall, K. M. 1986. The DARPA speech recognition research database: Specifications and status. In Proceedings of the DARPA Workshop on Speech Recognition. 93--99.Google Scholar
- Fung, P. and Liu, Y. 2005. Effects and modeling of phonetic and acoustic confusions in accented speech. J. Acoust. Soc. Am. 118, 5, 3279--3293.Google Scholar
Cross Ref
- Fukada, T., Yoshimura, T., and Sagisaka, Y. 1999. Automatic generation of multiple pronunciations based on neural networks. Speech Commun. 27, 1, 63--73. Google Scholar
Digital Library
- Hieronymus, J. L. 1994. ASCII phonetic symbols for the world’s languages: Worldbet. AT&T Tech.Rep. http://www.cslu.ogi.edu/publications/.Google Scholar
- Huang, C.-L. and Wu, C.-H. 2007. Generation of phonetic units for mixed-language speech recognition based on acoustic and contextual analysis. IEEE Trans. Comput. 56, 9, 1245--1254. Google Scholar
Digital Library
- Huang, Y.-C., Wu, C.-H., and Chao, Y.-T. 2013. Personalized spectral and prosody conversion using frame-based codeword distribution and adaptive CRF. IEEE Trans. Audio, Speech, Lang. Process.21, 1, 51--62. Google Scholar
Digital Library
- Hwang, M.-Y. and Huang, X. 1992. Subphonetic modeling with Markov states-senone. In Proceedings of ICASSP. Vol. 1, 33--36. Google Scholar
Digital Library
- Hwang, M.-Y., Huang, X., and Alleva, F. 1996. Predicting unseen triphones with senones. IEEE Trans. Speech and Audio Process. 4, 6, 412--419.Google Scholar
Cross Ref
- International Phonetic Association (IPA) 1999. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press.Google Scholar
- Joachims, T. 2002. Learning to classify text using support vector machines. Ph.D. Dissertation, Cornell University. Kluwer Academic Publishers, Springer.Google Scholar
- Joachims, T. 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods:Support Vector Learning. Schölkopf, B., Burges, C., and Smola, A. Eds. MIT Press. Google Scholar
Digital Library
- Kirchhoff, K. 2000. Integrating articulatory features into acoustic models for speech recognition. In Proceedings of the Workshop on Phonetics and Phonology in ASR.Google Scholar
- Lebese, E., Manamela, J., and Gasela, N. 2012. Towards a multilingual recognition system based on phone-clustering scheme for decoding local languages. In Proceedings of Southern Africa Telecommunication Networks and Applications Conference (SATNAC).Google Scholar
- Lee, C.-H., Clements, M. A., Dusan, S., Fosler-Lussier, E., Johnson, K., Juang, B.-H., and Rabiner, L. R. 2007. An overview on automatic speech attribute transcription (ASAT). In Proceedings of Interspeech.Google Scholar
- Liu, Y. and Fung, P. 2004. State-dependent phonetic tied mixtures with pronunciation modeling for spontaneous speech recognition. IEEE Trans. Speech Audio Process. 12, 4, 351--364.Google Scholar
Cross Ref
- Livescu, K. and Glass, J. 2000. Lexical modeling of non-native speech for automatic speech recognition. In Proceedings of ICASSP. Vol. 3.Google Scholar
- Mokbel, H. and Jouvet, D. 1998. Derivation of the optimal phonetic transcription set for a word from its acoustic realizations. In Proceedings of the ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition. 73--78.Google Scholar
- Oh, Y. R., Yoon, J. S., and Kim, H. K. 2007. Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition. Speech Commun. 49, 1, 59--70. Google Scholar
Digital Library
- Ostendorf, M. 1999. Moving beyond the ‘beads-on-a-string’ model of speech. In Proceedings of the Automatic Speech Recognition and Understanding Workshop. Vol. 1, 79.Google Scholar
- Qian, Y., Povey, D., and Liu, J. 2011. State-level data borrowing for low-resource speech recognition based on subspace GMMs. In Proceedings of Interspeech.Google Scholar
- Ravishankar, M. and Eskenazi, M. 1997. Automatic generation of context-dependent pronunciations. In Proceedings of EuroSpeech. 2467--2470.Google Scholar
- Rosenberg, A. E., DeLong, J., Lee, C.-H., Juang, B.-H., and Soong, F. K. 1992. The use of cohort normalized scores for speaker verification. In Proceedings of the 2nd International Conference on Spoken Language Processing.Google Scholar
- Siniscalchi, S. M., Svendsen, T., and Lee, C.-H. 2008. Toward a detector-based universal phone recognizer. In Proceedings of ICASSP. 4261--4264.Google Scholar
- Stefan, S. 2003. Generating non-native pronunciation lexicons by phonological rules. In Proceedings of International Conference of Phonetic Sciences (ICPhS). 2545--2548.Google Scholar
- Strik, H. and Cucchiarini, C. 1999. Modeling pronunciation variation for ASR: A survey of the literature. Speech Commun. 29, 2--4, 225--246. Google Scholar
Digital Library
- Stüker, S., Metze, F., Schultz, T., and Waibel, A. 2003a. Integrating multilingual articulatory features into speech recognition. In Proceedings of EuroSpeech.Google Scholar
- Stüker, S., Schultz, T., Metze, F., and Waibel, A. 2003b. Multilingual articulatory features. In Proceedings of ICASSP. Vol. 1, I-144--I-147.Google Scholar
- Stylianou, Y., Cappe, O., and Moulines, E. 1998. Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6, 2, 131--142.Google Scholar
Cross Ref
- Tomokiyo, L. M. and Waibel, A. 2001. Adaptation methods for non-native speech. In Proceedings of Multilinguality in Spoken Language Processing.Google Scholar
- Torre, D., Villarrubia, L., Hernández, L., and Elvira, J. 1997. Automatic alternative transcription generation and vocabulary selection for flexible word recognizers. In Proceedings of ICASSP. Vol. 2, 1463--1466. Google Scholar
Digital Library
- Vu, N. T., Lyu, D.-C., Weiner, J., Telaar, D., Schlippe, T., Blaicher, F., Chng, E.-S., Schultz, T., and Li, H. 2012. A first speech recognition system for Mandarin-English code-switch conversational speech. In Proceedings of ICASSP. 4889--4892.Google Scholar
- Wang, Z., Schultz, T., and Waibel, A. 2003. Comparison of acoustic model adaptation techniques on non-native speech. In Proceedings of ICASSP, Vol. 1, I-540--I-543.Google Scholar
- Wang, Z. and Schultz, T. 2003. Non-native spontaneous speech recognition through polyphone decision tree specialization. In Proceedings of Eurospeech.Google Scholar
- Weiner, J., Vu, N. T., Telaar, D., Metze, F., Schultz, T., Lyu, D.-C., Chng, E.-S., and Li, H. 2012. Integration of language identification into a recognition system for spoken conversations containing code-switches. In Proceedings of the IEEE Workshop of Spoken Language Technology (SLT).Google Scholar
- Wells, J. C. 1989. Computer-coded phonemic notation of individual languages of the European community. J. Int. Phonetic Assoc. 19, 1, 31--54.Google Scholar
Cross Ref
- Williams, G. and Renals, S. 1998. Confidence measures for evaluating pronunciation models. In Proceedings of the ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition. 151--156.Google Scholar
- Wu, C.-H., Shen, H.-P., and Yang, Y.-T. 2012. Phone set construction based on context-sensitive articulatory attributes for code-switching speech recognition. In Proceedings of ICASSP.4865--4868.Google Scholar
- Wu, C.-H., Su, H.-Y., and Shen, H.-P. 2011. Articulation-disordered speech recognition using speaker-adaptive acoustic models and personalized articulation patterns. ACM Trans. Asian Lang. Inform. Process. 10, 2, Article 7. Google Scholar
Digital Library
- Yang, J., Wu, P., and Xu, D. 2008. Mandarin speech recognition for nonnative speakers based on pronunciation dictionary adaptation. In Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP). 1--4.Google Scholar
- Yeh, C.-F., Heidel, A., Lee, H.-Y., and Lee, L.-S. 2012. Recognition of highly imbalanced code-mixed bilingual speech with frame-level language detection based on blurred posteriorgram. In Proceedings of ICASSP. 4873--4876.Google Scholar
- Yeh, C.-F., Huang, C.-Y., and Lee, L.-S. 2011. Bilingual acoustic model adaptation by unit merging on different levels and cross-level integration. In Proceedings of Interspeech.Google Scholar
- Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.-Y., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. 2006. The Hidden Markov Model Toolkit (HTK) Version 3.4. http://htk.eng.cam.ac.uk/.Google Scholar
- Yu, S., Zhang, S., and Xu, B. 2004. Chinese-English bilingual phone modeling for cross-language speech recognition. In Proceedings of ICASSP, Vol. 1, I-917--20.Google Scholar
- Zhang, C., Liu, Y., and Lee, C.-H. 2011. Detection-based accented speech recognition using articulatory features. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).Google Scholar
- Zhang, Q., Li, T., Pan, J., and Yan, Y. 2008. Nonnative speech recognition based on state-level bilingual model modification. In Proceedings of the 3rd International Conference on Convergence and Hybrid Information Technology (ICCIT). Vol. 2, 1220--1225. Google Scholar
Digital Library
Index Terms
Model Generation of Accented Speech using Model Transformation and Verification for Bilingual Speech Recognition
Recommendations
Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling
This article presents an approach for the automatic recognition of non-native speech. Some non-native speakers tend to pronounce phonemes as they would in their native language. Model adaptation can improve the recognition rate for non-native speakers, ...
Fast accent identification and accented speech recognition
ICASSP '99: Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01The performance of speech recognition systems degrades when speaker accent is different from that in the training set. Accent-independent or accent-dependent recognition both require collection of more training data. In this paper, we propose a faster ...
Development of a Mandarin-English Bilingual Speech Recognition System for Real World Music Retrieval
In recent decades, there has been a great deal of research into the problem of bilingual speech recognition – to develop a recognizer that can handle inter-and intra-sentential language switching between two languages. This paper presents our recent ...






Comments