Abstract
In isolated languages, such as Chinese and Vietnamese, words are not separated by spaces, and a word may be formed by one or more syllables. Therefore, word segmentation (WS) is usually the first process that is implemented in the machine translation process. WS in the source and target languages is based on different training corpora, and WS approaches may not be the same. Therefore, the WS that results in these two languages are not often homologous, and thus word alignment results in many 1-n and n-1 alignment pairs in statistical machine translation, which degrades the performance of machine translation. In this article, we will adjust the WS for both Chinese and Vietnamese in particular and for isolated language pairs in general and make the word boundary of the two languages more symmetric in order to strengthen 1-1 alignments and enhance machine translation performance. We have tested this method on the Computational Linguistics Center’s corpus, which consists of 35,623 sentence pairs. The experimental results show that our method has significantly improved the performance of machine translation compared to the baseline translation system, WS translation system, and anchor language-based WS translation systems.
- Ming-Hong Bai, Keh-Jiann Chen, and Jason S. Chang. 2008. Improving word alignment by adjusting chinese word segmentation. In Proceedings of the 3rd International Joint Conference on Natural Language Processing, Vol. I. India, 249--256.Google Scholar
- Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. In Proceedings of Association for Computational Linguistics, Vol. 21. 543--565. Google Scholar
Digital Library
- Pi-Chuan Chang, Michel Galley, and Christopher D. Manning. 2008. Optimizing chinese word segmentation for machine translation performance. In Proceeding of the 3rd Workshop on Statistical Machine Translation (ACL). 224--232. Google Scholar
Digital Library
- Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, and Sadao Kurohashi. 2012. Exploiting shared chinese characters in chinese word segmentation optimization for chinese-japanese machine translation. In Proceeding of the 16th Annual Conference of the European Association for Machine Translation. 35--42.Google Scholar
- Steve DeNeefe, Ulf Hermjakob, and Kevin Knight. 2008. Overcoming vocabulary sparsity in MT using lattices. In Proceeding of the 8th Biennial Conference of the Association for Machine Translation in the Americas. 89--96.Google Scholar
- Christopher Dyer. 2009. Using a maximum entropy model to build segmentation lattices for MT. In Proceedings of Human Language Technologies (The 2009 Annual Conference of the North American Chapter of the ACL). 406--414. Google Scholar
Digital Library
- Christopher Dyer, Smaranda Muresan, and Philip Resnik. 2008. Generalizing word lattice translation. In Proceedings of Association for Computational Linguistics. 1012--1020.Google Scholar
Cross Ref
- Sharon Goldwater and David McClosky. 2005. Improving statistical MT through morphological analysis. In Proceeding of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 676--683. Google Scholar
Digital Library
- J. Lafferty, A. McCallum, and F. Pereia. 2001. Conditional ramdom fields: Probabilistic models for segmenting and labeling sequence data. In Proceeding of International Conference on Machine Learning. 282--289. Google Scholar
Digital Library
- Dinh Khan Le. 2002. Vietnamese Vocabulary Having Chinese Origin. National University of HCMC Press, HCMC. [in Vietnamese].Google Scholar
- Young-Suk Lee. 2004. Morphological analysis for statistical machine translation. In Proceeding of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004). 57--62. Google Scholar
Digital Library
- Yanjun Ma and Andy Way. 2009. Bilingually motivated Domain Adapted word segmentation for SMT. In EACL 2009 Workshop on Computational Approaches to Semitic Languages (HLT-NAACL’04). 549--557. Google Scholar
Digital Library
- ThuyLinh Nguyen, Stephan Vogel, and Noah A. Smith. 2010. Nonparametric word segmentation for machine translation. In Proceedings of the International Conference on Computational Linguistics (COLING). 815--823. Google Scholar
Digital Library
- Thien Giap Nguyen. 1998. Introduction to Linguistics. Vietnam Education Publishing House, HCMC. [in Vietnamese].Google Scholar
- Yoon Mi Oh, Francois Pellegrino, Egidio Marsico, and Christophe Coupe. 2013. A quantitative and typological approach to correlating linguistic complexity. In Proceedings of the 5th Conference on Quantitative Investigations in Theoretical Linguistics (COLING).Google Scholar
- Michael Paul, Andrew Finch, and Eiichiro Sumita. 2010. Integration of multiple bilingually-learned segmentation schemes into statistical machine translation. In Proceedings of the Joint 5th Workshop on Statistical Machine Translation and Metrics (MATR). 400--408. Google Scholar
Digital Library
- Pereira, Fernando, Michael Riley, and Richard Sproat. 1994. Weighted rational transductions and their application to human language processing. In Proceedings of the ARPA Workshop on Human Language Technology. 249--254. Google Scholar
Digital Library
- Chih-Hao Tsai. 1996. MMSEG: A word identification system for mandarin chinese text based on two variants of the maximum matching algorithm. Available at http://technology.chtsai.org/.Google Scholar
- Xiaolin Wang, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2014a. Empirical study of unsupervised chinese word segmentation methods for SMT on large-scale corpora. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 752--758.Google Scholar
Cross Ref
- Xiaolin Wang, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2014b. Refining word segmentation using a manually aligned corpus for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1654--1664.Google Scholar
Cross Ref
- Jia Xu, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian semi-supervised chinese word segmentation for statistical machine translation. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING). 1017--1024. Google Scholar
Digital Library
- Jia Xu, Richard Jens, and Hermann Ney. 2004. Do we need Chinese word segmentation for statistical machine translation? In ACL SIGHAN Workshop 2004, Association for Computational Linguistics. 122--128.Google Scholar
- Jia Xu, Evgeny Matusov, Richard Zens, and Hermann Ney. 2005. Integrated chinese word segmentation in statistical machine translation. In Proceeding of the International Workshop on Spoken Language Translation. 131--137.Google Scholar
- Xiaodong Zeng, Lidia S. Chao, Derek F. Wong, Isabel Trancoso, and Liang Tian. 2014. Toward better chinese word segmentation for SMT via bilingual constraints. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 1360--1369.Google Scholar
Cross Ref
- Hai Zhao, Masao Utiyama, Eiichiro Sumita, and Bao-Liang Lu. 2013. An empirical study on word segmentation for chinese machine translation. In Computational Linguistics and Intelligent Text Processing (Lecture Notes in Computer Science), Vol. 7817. 248--263. Google Scholar
Digital Library
Index Terms
Word Re-Segmentation in Chinese-Vietnamese Machine Translation
Recommendations
Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languages
AbstractOne of the several challenges faced by neural machine translation systems is the lack of standard parallel corpora for several language pairs. Poor translation qualities often result from inadequate data. Aggravating this problem further are the ...
Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation
COLING-MTIA '02: Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16The most difficult task in machine translation is the elimination of ambiguity in human languages. A certain word in English as well as Vietnamese often has different meanings which depend on their syntactical position in the sentence and the actual ...
Bilingually Motivated Word Segmentation for Statistical Machine Translation
We introduce a bilingually motivated word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Our approach is motivated from the insight ...






Comments