Abstract
Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian language pairs—Japanese, Indonesian, Malay paired with Vietnamese—they are also not excluded from the case, in which there are no large bilingual corpora on these low-resource language pairs. Furthermore, although the languages are widely used in the world, there is no prior work on MT, which causes an issue for the development of MT on these languages. In this article, we conducted an empirical study of leveraging additional resources to improve MT for the Asian low-resource language pairs: translation from Japanese, Indonesian, and Malay to Vietnamese. We propose an innovative approach that lies in two strategies of building bilingual corpora from comparable data and phrase pivot translation on existing bilingual corpora of the languages paired with English. Bilingual corpora were built from Wikipedia bilingual titles to enhance bilingual data for the low-resource languages. Additionally, we introduced a combined model of the additional resources to create an effective solution to improve MT on the Asian low-resource languages. Experimental results show the effectiveness of our systems with the improvement of +2 to +7 BLEU points. This work contributes to the development of MT on low-resource languages, especially opening a promising direction for the progress of MT on the Asian language pairs.
- Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 355--362. Google Scholar
Digital Library
- Ondřej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, et al. 2013. Findings of the 2013 workshop on statistical machine translation. In Proceedings of the 8th Workshop on Statistical Machine Translation. 1--44.Google Scholar
- Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL’91). 169--176.Google Scholar
- Mauro Cettolo, Nicola Bertoldi, and Marcello Federico. 2011. Bootstrapping Arabic-Italian SMT through comparable texts and pivot translation. In Proceedings of the 15th International Conference of the European Association for Machine Translation (EAMT’11).Google Scholar
- M. Cettolo, C. Girardi, and M. Federico. 2012a. WIT3: Web inventory of transcribed and translated talks. In Proceedings of th 16th International Conference of the European Association for Machine Translation (EAMT’12). 261--268. http://www.mt-archive.info/EAMT-2012-Cettolo.Google Scholar
- Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012b. Wit3: Web inventory of transcribed and translated talks. In Proceedings of the 16th International Conference of the European Association for Machine Translation (EAMT’12). 261--268.Google Scholar
- Stanley F. Chen. 1993. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL’93). 9--16.Google Scholar
Digital Library
- Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’12). 427--436.Google Scholar
Digital Library
- Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2014. Constructing a Chinese-Japanese parallel corpus from Wikipedia. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 642--647.Google Scholar
- Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2015. Integrated parallel sentence and fragment extraction from comparable corpora: A case study on Chinese--Japanese Wikipedia. ACM Transactions on Asian and Low-Resource Language Information Processing 15, 2 (Dec. 2015), Article 10, 22 pages. Google Scholar
Digital Library
- Trevor Cohn and Mirella Lapata. 2007. Machine translation by triangulation: Making effective use of multi-parallel Corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07). 728--735.Google Scholar
- Adrià De Gispert and Jose B. Marino. 2006. Catalan-English statistical machine translation without parallel corpus: Bridging through Spanish. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 65--68.Google Scholar
- Ahmed El Kholy, Nizar Habash, Gregor Leusch, Evgeny Matusov, and Hassan Sawaf. 2013. Language independent connectivity strength features for phrase pivot statistical machine translation. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’13). 412--418.Google Scholar
- William A. Gale and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics 19, 1 (1993), 75--102. Google Scholar
Digital Library
- Nizar Habash. 2008. Four techniques for online handling of out-of-vocabulary words in Arabic-English statistical machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers (HLT-Short’08). 57--60.Google Scholar
Digital Library
- Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation. 187--197. Google Scholar
Digital Library
- Ann Irvine. 2013. Statistical machine translation in low resource settings. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’13). 54--61.Google Scholar
- Martin Kay and Martin Röscheisen. 1993. Text-translation alignment. Computational Linguistics 19, 1 (1993), 121--142. Google Scholar
Digital Library
- Sungchul Kim, Kristina Toutanova, and Hwanjo Yu. 2012. Multilingual named entity recognition using parallel data and metadata from Wikipedia. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers—Volume 1. 694--702. Google Scholar
Digital Library
- Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP’04).Google Scholar
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit (MT Summit X). http://mt-archive.info/MTS-2005-Koehn.pdf.Google Scholar
- Philipp Koehn, Alexandra Birch, and Ralf Steinberger. 2009. 462 machine translation systems for Europe. In Proceedings of the 12th Machine Translation Summit (MT Summit XII).Google Scholar
- Philipp Koehn, Barry Haddow, Philip Williams, and Hieu Hoang. 2010. More linguistic annotation for statistical machine translation. In Proceedings of the Joint 5th Workshop on Statistical Machine Translation and Metrics (MATR’10). 115--120.Google Scholar
Digital Library
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07). 177--180.Google Scholar
Digital Library
- Gennadi Lembersky, Noam Ordan, and Shuly Wintner. 2013. Improving statistical machine translation by adapting translation models to translationese. Computational Linguistics 39, 4 (2013), 999--1023. Google Scholar
Digital Library
- Bo Li and Juan Liu. 2008. Mining Chinese-English parallel corpora from the web. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). http://www.newdesign.aclweb.org/anthology-new/I/I08/I08-2120.pdf.Google Scholar
- Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 2015 Conference of the Association for Computational Linguistics (ACL’15). 11--19.Google Scholar
Cross Ref
- Jeff Ma, S. Matsoukas, and R. Schwartz. 2011. Improving low-resource statistical machine translation with a novel semantic word clustering algorithm. Proceedings of the 13th Machine Translation Summit (MT Summit XIII).Google Scholar
- Xiaoyi Ma. 2006. Champollion: A robust parallel text sentence aligner. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 489--492.Google Scholar
- Luis Marujo, Nuno Grazina, Tiago Luis, Wang Ling, Luisa Coheur, and Isabel Trancoso. 2011. —Adaptation of Brazilian Portuguese texts to European Portuguese. In Proceedings of the 15th International Conference of the European Association for Machine Translation (EAMT’11). 129--136.Google Scholar
- I. Dan Melamed. 1996. A geometric approach to mapping bitext correspondence. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’96).Google Scholar
- Robert C. Moore. 2002. Fast and accurate sentence alignment of bilingual corpora. In Proceedings of the 5th Conference of the Association for Machine Translation in the Americas (AMTA’02). 135--144.Google Scholar
Cross Ref
- Dragos Stefan Munteanu and Daniel Marcu. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. 81--88. Google Scholar
Digital Library
- Ahmad Musleh, Nadir Durrani, Irina Temnikova, Preslav Nakov, Stephan Vogel, and Osama Alsaad. 2016. Enabling medical translation for low-resource languages. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16).Google Scholar
- Graham Neubig. 2011. The Kyoto Free Translation Task (KFTT). Retrieved April 11, 2019 from http://www.phontron.com/kftt.Google Scholar
- Malte Nuhn, Arne Mauser, and Hermann Ney. 2012. Deciphering foreign language by combining language models and context vectors. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL’12). 156--164.Google Scholar
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1 (2003), 19--51. Google Scholar
Digital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02). 311--318.Google Scholar
- Sujith Ravi and Kevin Knight. 2011. Deciphering foreign language. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1 (HLT’11). 12--21. Google Scholar
Digital Library
- Majid Razmara, Maryam Siahbani, Reza Haffari, and Anoop Sarkar. 2013. Graph propagation for paraphrasing out-of-vocabulary words in statistical machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL’13). 1105--1115.Google Scholar
- Philip Resnik. 1999. Mining the web for bilingual text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99). http://acl.ldc.upenn.edu/P/P99/P99-1068.pdf.Google Scholar
Digital Library
- Avneesh Saluja, Hany Hassan, Kristina Toutanova, and Chris Quirk. 2014. Graph-based semi-supervised Learning of translation models from monolingual data. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14). 676--686.Google Scholar
Cross Ref
- Rico Sennrich. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of the Conference of the European Association for Machine Translation (EAMT’12). 539--549.Google Scholar
- Anil Kumar Singh and Samar Husain. 2005. Comparison, selection and use of sentence alignment algorithms for new language pairs. In Proceedings of the Association for Computational Linguistics Workshop on Building and Using Parallel Texts. 99--106. Google Scholar
Digital Library
- Dan Ştefănescu and Radu Ion. 2013. Parallel-Wiki: A collection of parallel sentences extracted from Wikipedia. In Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’13). 24--30.Google Scholar
- Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2016. Introducing the Asian Language Treebank (ALT). In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 1574--1578.Google Scholar
- Hai-Long Trieu, Thanh-Quyen Dang, Phuong-Thai Nguyen, and Le-Minh Nguyen. 2015. The JAIST-UET-MITI machine translation systems for IWSLT 2015. In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT’15).Google Scholar
- Masao Utiyama and Hitoshi Isahara. 2003. Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL’03). 72--79. http://www.aclweb.org/anthology/P03-1010.pdf.Google Scholar
Digital Library
- Masao Utiyama and Hitoshi Isahara. 2007. A comparison of pivot methods for phrase-based statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’07). 484--491.Google Scholar
- Dániel Varga, Péter Halácsy, András Kornai, Viktor Nagy, László Németh, and Viktor Trón. 2007. Parallel corpora for medium density languages. Amsterdam Studies in the Theory and History of Linguistic Science Series 4 292 (2007), 247.Google Scholar
- Pidong Wang, Preslav Nakov, and Hwee Tou Ng. 2016. Source language adaptation approaches for resource-poor machine translation. Computational Linguistics 42, 2 (2016), 277--306. Google Scholar
Digital Library
- George Weber. 1999. The World's 10 most influential languages. American Association of Teachers of French Bulletin 24 (1999), 22--28.Google Scholar
- Dekai Wu. 1994. Aligning a parallel English-Chinese corpus statistically with lexical criteria. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL’94). 80--87.Google Scholar
Digital Library
- Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07). 856--863.Google Scholar
Digital Library
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144. http://arxiv.org/abs/1609.08144.Google Scholar
- Keiji Yasuda, Ruiqiang Zhang, Hirofumi Yamamoto, and Eiichiro Sumita. 2008. Method of selecting training data to build a compact and efficient translation model. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 655--660.Google Scholar
Index Terms
Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages
Recommendations
Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
AbstractUnsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...
Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation
We propose a new method for inducing a phrase-based translation model from a pair of unrelated monolingual corpora. Our method is able to deal with phrases of arbitrary length and to find phrase pairs that are useful for statistical machine translation, ...
Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages
AbstractNeural machine translation systems trained on low-resource languages produce sub-optimal results due to the scarcity of large parallel datasets. To alleviate this problem, parallel corpora can be mined from the web. Two key tasks in a parallel ...






Comments