skip to main content
research-article

Word Re-Segmentation in Chinese-Vietnamese Machine Translation

Authors Info & Claims
Published:04 November 2016Publication History
Skip Abstract Section

Abstract

In isolated languages, such as Chinese and Vietnamese, words are not separated by spaces, and a word may be formed by one or more syllables. Therefore, word segmentation (WS) is usually the first process that is implemented in the machine translation process. WS in the source and target languages is based on different training corpora, and WS approaches may not be the same. Therefore, the WS that results in these two languages are not often homologous, and thus word alignment results in many 1-n and n-1 alignment pairs in statistical machine translation, which degrades the performance of machine translation. In this article, we will adjust the WS for both Chinese and Vietnamese in particular and for isolated language pairs in general and make the word boundary of the two languages more symmetric in order to strengthen 1-1 alignments and enhance machine translation performance. We have tested this method on the Computational Linguistics Center’s corpus, which consists of 35,623 sentence pairs. The experimental results show that our method has significantly improved the performance of machine translation compared to the baseline translation system, WS translation system, and anchor language-based WS translation systems.

References

  1. Ming-Hong Bai, Keh-Jiann Chen, and Jason S. Chang. 2008. Improving word alignment by adjusting chinese word segmentation. In Proceedings of the 3rd International Joint Conference on Natural Language Processing, Vol. I. India, 249--256.Google ScholarGoogle Scholar
  2. Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. In Proceedings of Association for Computational Linguistics, Vol. 21. 543--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Pi-Chuan Chang, Michel Galley, and Christopher D. Manning. 2008. Optimizing chinese word segmentation for machine translation performance. In Proceeding of the 3rd Workshop on Statistical Machine Translation (ACL). 224--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, and Sadao Kurohashi. 2012. Exploiting shared chinese characters in chinese word segmentation optimization for chinese-japanese machine translation. In Proceeding of the 16th Annual Conference of the European Association for Machine Translation. 35--42.Google ScholarGoogle Scholar
  5. Steve DeNeefe, Ulf Hermjakob, and Kevin Knight. 2008. Overcoming vocabulary sparsity in MT using lattices. In Proceeding of the 8th Biennial Conference of the Association for Machine Translation in the Americas. 89--96.Google ScholarGoogle Scholar
  6. Christopher Dyer. 2009. Using a maximum entropy model to build segmentation lattices for MT. In Proceedings of Human Language Technologies (The 2009 Annual Conference of the North American Chapter of the ACL). 406--414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christopher Dyer, Smaranda Muresan, and Philip Resnik. 2008. Generalizing word lattice translation. In Proceedings of Association for Computational Linguistics. 1012--1020.Google ScholarGoogle ScholarCross RefCross Ref
  8. Sharon Goldwater and David McClosky. 2005. Improving statistical MT through morphological analysis. In Proceeding of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 676--683. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Lafferty, A. McCallum, and F. Pereia. 2001. Conditional ramdom fields: Probabilistic models for segmenting and labeling sequence data. In Proceeding of International Conference on Machine Learning. 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dinh Khan Le. 2002. Vietnamese Vocabulary Having Chinese Origin. National University of HCMC Press, HCMC. [in Vietnamese].Google ScholarGoogle Scholar
  11. Young-Suk Lee. 2004. Morphological analysis for statistical machine translation. In Proceeding of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004). 57--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yanjun Ma and Andy Way. 2009. Bilingually motivated Domain Adapted word segmentation for SMT. In EACL 2009 Workshop on Computational Approaches to Semitic Languages (HLT-NAACL’04). 549--557. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. ThuyLinh Nguyen, Stephan Vogel, and Noah A. Smith. 2010. Nonparametric word segmentation for machine translation. In Proceedings of the International Conference on Computational Linguistics (COLING). 815--823. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Thien Giap Nguyen. 1998. Introduction to Linguistics. Vietnam Education Publishing House, HCMC. [in Vietnamese].Google ScholarGoogle Scholar
  15. Yoon Mi Oh, Francois Pellegrino, Egidio Marsico, and Christophe Coupe. 2013. A quantitative and typological approach to correlating linguistic complexity. In Proceedings of the 5th Conference on Quantitative Investigations in Theoretical Linguistics (COLING).Google ScholarGoogle Scholar
  16. Michael Paul, Andrew Finch, and Eiichiro Sumita. 2010. Integration of multiple bilingually-learned segmentation schemes into statistical machine translation. In Proceedings of the Joint 5th Workshop on Statistical Machine Translation and Metrics (MATR). 400--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Pereira, Fernando, Michael Riley, and Richard Sproat. 1994. Weighted rational transductions and their application to human language processing. In Proceedings of the ARPA Workshop on Human Language Technology. 249--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chih-Hao Tsai. 1996. MMSEG: A word identification system for mandarin chinese text based on two variants of the maximum matching algorithm. Available at http://technology.chtsai.org/.Google ScholarGoogle Scholar
  19. Xiaolin Wang, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2014a. Empirical study of unsupervised chinese word segmentation methods for SMT on large-scale corpora. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 752--758.Google ScholarGoogle ScholarCross RefCross Ref
  20. Xiaolin Wang, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2014b. Refining word segmentation using a manually aligned corpus for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1654--1664.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jia Xu, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian semi-supervised chinese word segmentation for statistical machine translation. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING). 1017--1024. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jia Xu, Richard Jens, and Hermann Ney. 2004. Do we need Chinese word segmentation for statistical machine translation? In ACL SIGHAN Workshop 2004, Association for Computational Linguistics. 122--128.Google ScholarGoogle Scholar
  23. Jia Xu, Evgeny Matusov, Richard Zens, and Hermann Ney. 2005. Integrated chinese word segmentation in statistical machine translation. In Proceeding of the International Workshop on Spoken Language Translation. 131--137.Google ScholarGoogle Scholar
  24. Xiaodong Zeng, Lidia S. Chao, Derek F. Wong, Isabel Trancoso, and Liang Tian. 2014. Toward better chinese word segmentation for SMT via bilingual constraints. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 1360--1369.Google ScholarGoogle ScholarCross RefCross Ref
  25. Hai Zhao, Masao Utiyama, Eiichiro Sumita, and Bao-Liang Lu. 2013. An empirical study on word segmentation for chinese machine translation. In Computational Linguistics and Intelligent Text Processing (Lecture Notes in Computer Science), Vol. 7817. 248--263. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Word Re-Segmentation in Chinese-Vietnamese Machine Translation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!