skip to main content
research-article

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

Authors Info & Claims
Published:17 June 2019Publication History
Skip Abstract Section

Abstract

Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian language pairs—Japanese, Indonesian, Malay paired with Vietnamese—they are also not excluded from the case, in which there are no large bilingual corpora on these low-resource language pairs. Furthermore, although the languages are widely used in the world, there is no prior work on MT, which causes an issue for the development of MT on these languages. In this article, we conducted an empirical study of leveraging additional resources to improve MT for the Asian low-resource language pairs: translation from Japanese, Indonesian, and Malay to Vietnamese. We propose an innovative approach that lies in two strategies of building bilingual corpora from comparable data and phrase pivot translation on existing bilingual corpora of the languages paired with English. Bilingual corpora were built from Wikipedia bilingual titles to enhance bilingual data for the low-resource languages. Additionally, we introduced a combined model of the additional resources to create an effective solution to improve MT on the Asian low-resource languages. Experimental results show the effectiveness of our systems with the improvement of +2 to +7 BLEU points. This work contributes to the development of MT on low-resource languages, especially opening a promising direction for the progress of MT on the Asian language pairs.

References

  1. Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 355--362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ondřej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, et al. 2013. Findings of the 2013 workshop on statistical machine translation. In Proceedings of the 8th Workshop on Statistical Machine Translation. 1--44.Google ScholarGoogle Scholar
  3. Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL’91). 169--176.Google ScholarGoogle Scholar
  4. Mauro Cettolo, Nicola Bertoldi, and Marcello Federico. 2011. Bootstrapping Arabic-Italian SMT through comparable texts and pivot translation. In Proceedings of the 15th International Conference of the European Association for Machine Translation (EAMT’11).Google ScholarGoogle Scholar
  5. M. Cettolo, C. Girardi, and M. Federico. 2012a. WIT3: Web inventory of transcribed and translated talks. In Proceedings of th 16th International Conference of the European Association for Machine Translation (EAMT’12). 261--268. http://www.mt-archive.info/EAMT-2012-Cettolo.Google ScholarGoogle Scholar
  6. Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012b. Wit3: Web inventory of transcribed and translated talks. In Proceedings of the 16th International Conference of the European Association for Machine Translation (EAMT’12). 261--268.Google ScholarGoogle Scholar
  7. Stanley F. Chen. 1993. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL’93). 9--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’12). 427--436.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2014. Constructing a Chinese-Japanese parallel corpus from Wikipedia. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 642--647.Google ScholarGoogle Scholar
  10. Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2015. Integrated parallel sentence and fragment extraction from comparable corpora: A case study on Chinese--Japanese Wikipedia. ACM Transactions on Asian and Low-Resource Language Information Processing 15, 2 (Dec. 2015), Article 10, 22 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Trevor Cohn and Mirella Lapata. 2007. Machine translation by triangulation: Making effective use of multi-parallel Corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07). 728--735.Google ScholarGoogle Scholar
  12. Adrià De Gispert and Jose B. Marino. 2006. Catalan-English statistical machine translation without parallel corpus: Bridging through Spanish. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 65--68.Google ScholarGoogle Scholar
  13. Ahmed El Kholy, Nizar Habash, Gregor Leusch, Evgeny Matusov, and Hassan Sawaf. 2013. Language independent connectivity strength features for phrase pivot statistical machine translation. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’13). 412--418.Google ScholarGoogle Scholar
  14. William A. Gale and Kenneth W. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics 19, 1 (1993), 75--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Nizar Habash. 2008. Four techniques for online handling of out-of-vocabulary words in Arabic-English statistical machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers (HLT-Short’08). 57--60.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation. 187--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ann Irvine. 2013. Statistical machine translation in low resource settings. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’13). 54--61.Google ScholarGoogle Scholar
  18. Martin Kay and Martin Röscheisen. 1993. Text-translation alignment. Computational Linguistics 19, 1 (1993), 121--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sungchul Kim, Kristina Toutanova, and Hwanjo Yu. 2012. Multilingual named entity recognition using parallel data and metadata from Wikipedia. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers—Volume 1. 694--702. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP’04).Google ScholarGoogle Scholar
  21. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit (MT Summit X). http://mt-archive.info/MTS-2005-Koehn.pdf.Google ScholarGoogle Scholar
  22. Philipp Koehn, Alexandra Birch, and Ralf Steinberger. 2009. 462 machine translation systems for Europe. In Proceedings of the 12th Machine Translation Summit (MT Summit XII).Google ScholarGoogle Scholar
  23. Philipp Koehn, Barry Haddow, Philip Williams, and Hieu Hoang. 2010. More linguistic annotation for statistical machine translation. In Proceedings of the Joint 5th Workshop on Statistical Machine Translation and Metrics (MATR’10). 115--120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07). 177--180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Gennadi Lembersky, Noam Ordan, and Shuly Wintner. 2013. Improving statistical machine translation by adapting translation models to translationese. Computational Linguistics 39, 4 (2013), 999--1023. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Bo Li and Juan Liu. 2008. Mining Chinese-English parallel corpora from the web. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). http://www.newdesign.aclweb.org/anthology-new/I/I08/I08-2120.pdf.Google ScholarGoogle Scholar
  27. Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 2015 Conference of the Association for Computational Linguistics (ACL’15). 11--19.Google ScholarGoogle ScholarCross RefCross Ref
  28. Jeff Ma, S. Matsoukas, and R. Schwartz. 2011. Improving low-resource statistical machine translation with a novel semantic word clustering algorithm. Proceedings of the 13th Machine Translation Summit (MT Summit XIII).Google ScholarGoogle Scholar
  29. Xiaoyi Ma. 2006. Champollion: A robust parallel text sentence aligner. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 489--492.Google ScholarGoogle Scholar
  30. Luis Marujo, Nuno Grazina, Tiago Luis, Wang Ling, Luisa Coheur, and Isabel Trancoso. 2011. —Adaptation of Brazilian Portuguese texts to European Portuguese. In Proceedings of the 15th International Conference of the European Association for Machine Translation (EAMT’11). 129--136.Google ScholarGoogle Scholar
  31. I. Dan Melamed. 1996. A geometric approach to mapping bitext correspondence. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’96).Google ScholarGoogle Scholar
  32. Robert C. Moore. 2002. Fast and accurate sentence alignment of bilingual corpora. In Proceedings of the 5th Conference of the Association for Machine Translation in the Americas (AMTA’02). 135--144.Google ScholarGoogle ScholarCross RefCross Ref
  33. Dragos Stefan Munteanu and Daniel Marcu. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. 81--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ahmad Musleh, Nadir Durrani, Irina Temnikova, Preslav Nakov, Stephan Vogel, and Osama Alsaad. 2016. Enabling medical translation for low-resource languages. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16).Google ScholarGoogle Scholar
  35. Graham Neubig. 2011. The Kyoto Free Translation Task (KFTT). Retrieved April 11, 2019 from http://www.phontron.com/kftt.Google ScholarGoogle Scholar
  36. Malte Nuhn, Arne Mauser, and Hermann Ney. 2012. Deciphering foreign language by combining language models and context vectors. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL’12). 156--164.Google ScholarGoogle Scholar
  37. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1 (2003), 19--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02). 311--318.Google ScholarGoogle Scholar
  39. Sujith Ravi and Kevin Knight. 2011. Deciphering foreign language. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1 (HLT’11). 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Majid Razmara, Maryam Siahbani, Reza Haffari, and Anoop Sarkar. 2013. Graph propagation for paraphrasing out-of-vocabulary words in statistical machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL’13). 1105--1115.Google ScholarGoogle Scholar
  41. Philip Resnik. 1999. Mining the web for bilingual text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99). http://acl.ldc.upenn.edu/P/P99/P99-1068.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Avneesh Saluja, Hany Hassan, Kristina Toutanova, and Chris Quirk. 2014. Graph-based semi-supervised Learning of translation models from monolingual data. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14). 676--686.Google ScholarGoogle ScholarCross RefCross Ref
  43. Rico Sennrich. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of the Conference of the European Association for Machine Translation (EAMT’12). 539--549.Google ScholarGoogle Scholar
  44. Anil Kumar Singh and Samar Husain. 2005. Comparison, selection and use of sentence alignment algorithms for new language pairs. In Proceedings of the Association for Computational Linguistics Workshop on Building and Using Parallel Texts. 99--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Dan Ştefănescu and Radu Ion. 2013. Parallel-Wiki: A collection of parallel sentences extracted from Wikipedia. In Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’13). 24--30.Google ScholarGoogle Scholar
  46. Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2016. Introducing the Asian Language Treebank (ALT). In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 1574--1578.Google ScholarGoogle Scholar
  47. Hai-Long Trieu, Thanh-Quyen Dang, Phuong-Thai Nguyen, and Le-Minh Nguyen. 2015. The JAIST-UET-MITI machine translation systems for IWSLT 2015. In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT’15).Google ScholarGoogle Scholar
  48. Masao Utiyama and Hitoshi Isahara. 2003. Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL’03). 72--79. http://www.aclweb.org/anthology/P03-1010.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Masao Utiyama and Hitoshi Isahara. 2007. A comparison of pivot methods for phrase-based statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’07). 484--491.Google ScholarGoogle Scholar
  50. Dániel Varga, Péter Halácsy, András Kornai, Viktor Nagy, László Németh, and Viktor Trón. 2007. Parallel corpora for medium density languages. Amsterdam Studies in the Theory and History of Linguistic Science Series 4 292 (2007), 247.Google ScholarGoogle Scholar
  51. Pidong Wang, Preslav Nakov, and Hwee Tou Ng. 2016. Source language adaptation approaches for resource-poor machine translation. Computational Linguistics 42, 2 (2016), 277--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. George Weber. 1999. The World's 10 most influential languages. American Association of Teachers of French Bulletin 24 (1999), 22--28.Google ScholarGoogle Scholar
  53. Dekai Wu. 1994. Aligning a parallel English-Chinese corpus statistically with lexical criteria. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL’94). 80--87.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07). 856--863.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144. http://arxiv.org/abs/1609.08144.Google ScholarGoogle Scholar
  56. Keiji Yasuda, Ruiqiang Zhang, Hirofumi Yamamoto, and Eiichiro Sumita. 2008. Method of selecting training data to build a compact and efficient translation model. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 655--660.Google ScholarGoogle Scholar

Index Terms

  1. Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!