skip to main content
research-article
Open Access

Filtered Pseudo-parallel Corpus Improves Low-resource Neural Machine Translation

Published:31 October 2019Publication History
Skip Abstract Section

Abstract

Large-scale parallel corpora are essential for training high-quality machine translation systems; however, such corpora are not freely available for many language translation pairs. Previously, training data has been augmented by pseudo-parallel corpora obtained by using machine translation models to translate monolingual corpora into the source language. However, in low-resource language pairs, in which only low-accurate machine translation systems can be used, translation quality degrades when a pseudo-parallel corpus is naively used. To improve machine translation performance with low-resource language pairs, we propose a method to effectively expand the training data via filtering the pseudo-parallel corpus using quality estimation based on sentence-level round-trip translation. For experiments with three language pairs that utilized small, medium, and large size parallel corpora, BLEU scores significantly improved for low-resource language pairs. Additionally, the effects of iterative bootstrapping on translation performance quality is investigated; resultingly, it is confirmed that bootstrapping can further improve the translation performance.

References

  1. Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  2. Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 355--362.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Nicola Bertoldi and Marcello Federico. 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the 4th Workshop on Statistical Machine Translation. 182--189.Google ScholarGoogle ScholarCross RefCross Ref
  4. Ryan Cotterell and Julia Kreutzer. 2018. Explaining and generalizing back-translation through wake-sleep. Retrieved from: CoRR abs/1806.04402 (2018).Google ScholarGoogle Scholar
  5. Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 489--500.Google ScholarGoogle ScholarCross RefCross Ref
  6. Dirk Goldhahn, Maciej Sumalvico, and Uwe Quasthoff. 2016. Corpus collection for under-resourced languages with more than one million speakers. In Proceedings of the Workshop on Collaboration and Computing for Under-Resourced Languages (CCURL’16), Language Resources and Evaluation Conference. 67--73.Google ScholarGoogle Scholar
  7. Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 820--828.Google ScholarGoogle Scholar
  8. Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, and Trevor Cohn. 2018. Iterative back-translation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 18--24.Google ScholarGoogle ScholarCross RefCross Ref
  9. An-Chang Hsieh, Hen-Hsen Huang, and Hsin-Hsi Chen. 2013. Uses of monolingual in-domain corpora for cross-domain adaptation with hybrid MT approaches. In Proceedings of the 2nd Workshop on Hybrid Approaches to Translation. 117--122.Google ScholarGoogle Scholar
  10. Kenji Imamura, Atsushi Fujita, and Eiichiro Sumita. 2018. Enhancement of encoder and attention using target monolingual corpora in neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 55--63.Google ScholarGoogle ScholarCross RefCross Ref
  11. Aizhan Imankulova, Takayuki Sato, and Mamoru Komachi. 2017. Improving low-resource neural machine translation with filtered pseudo-parallel corpus. In Proceedings of the 4th Workshop on Asian Translation. 70--78.Google ScholarGoogle Scholar
  12. Tomoyuki Kajiwara, Danushka Bollegala, Yuichi Yoshida, and Ken-ichi Kawarabayashi. 2017. An iterative approach for the global estimation of sentence similarity. PloS ONE 12, 9 (2017), e0180885.Google ScholarGoogle ScholarCross RefCross Ref
  13. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. Retrieved from: arXiv preprint arXiv:1701.02810.Google ScholarGoogle Scholar
  14. Philipp Koehn. 2002. Europarl: A Multilingual Corpus for Evaluation of Machine Translation. http://homepages.inf.ed.ac.uk/pkoehn/publications/europarl.pdf.Google ScholarGoogle Scholar
  15. Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the 1st Workshop on Neural Machine Translation. 28--39.Google ScholarGoogle ScholarCross RefCross Ref
  16. Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  17. Chin-Yew Lin and Franz Josef Och. 2004a. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Meeting on Association for Computational Linguistics. 605--612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chin-Yew Lin and Franz Josef Och. 2004b. Orange: A method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics. 501--507.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  20. Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the Meeting on Association for Computational Linguistics Conference Short Papers. 220--224.Google ScholarGoogle Scholar
  21. Xing Niu, Michael Denkowski, and Marine Carpuat. 2018. Bi-directional neural machine translation with synthetic parallel data. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 84--91.Google ScholarGoogle ScholarCross RefCross Ref
  22. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting on Association for Computational Linguistics. 311--318.Google ScholarGoogle Scholar
  23. Holger Schwenk. 2008. Investigations on large-scale lightly-supervised training for statistical machine translation. In Proceedings of International Workshop on Spoken Language Translation. 182--189.Google ScholarGoogle Scholar
  24. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 86--96.Google ScholarGoogle ScholarCross RefCross Ref
  25. Yangqiu Song and Dan Roth. 2015. Unsupervised sparse vector densification for short text similarity. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1275--1280.Google ScholarGoogle ScholarCross RefCross Ref
  26. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Machine Learn. Res. 15, 1 (2014), 1929--1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Nicola Ueffing, Gholamreza Haffari, and Anoop Sarkar. 2007. Transductive learning for statistical machine translation. In Proceedings of the 45th Meeting of the Association of Computational Linguistics. 25--32.Google ScholarGoogle Scholar
  28. Marlies van der Wees, Arianna Bisazza, and Christof Monz. 2017. Dynamic data selection for neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1400--1410.Google ScholarGoogle ScholarCross RefCross Ref
  29. Longyue Wang, Derek F. Wong, Lidia S. Chao, Yi Lu, and Junwen Xing. 2014. A systematic comparison of data selection criteria for SMT domain adaptation. The Scientific World Journal. 1--10.Google ScholarGoogle Scholar
  30. Eray Yıldız, Ahmed Cüneyd Tantuğ, and Banu Diri. 2014. The effect of parallel corpus quality vs size in English-to-Turkish SMT. In Proceedings of the 6th International Conference on Web services and Semantic Technology. 21--30.Google ScholarGoogle Scholar
  31. Jiajun Zhang and Chengqing Zong. 2016. Exploiting source-side monolingual data in neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1535--1545.Google ScholarGoogle ScholarCross RefCross Ref
  32. Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. 2018. Joint training for neural machine translation models with monolingual data. In Proceedings of the AAAI Conference on Artificial Intelligence. 555--562.Google ScholarGoogle Scholar

Index Terms

  1. Filtered Pseudo-parallel Corpus Improves Low-resource Neural Machine Translation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!