Abstract
Large-scale parallel corpora are essential for training high-quality machine translation systems; however, such corpora are not freely available for many language translation pairs. Previously, training data has been augmented by pseudo-parallel corpora obtained by using machine translation models to translate monolingual corpora into the source language. However, in low-resource language pairs, in which only low-accurate machine translation systems can be used, translation quality degrades when a pseudo-parallel corpus is naively used. To improve machine translation performance with low-resource language pairs, we propose a method to effectively expand the training data via filtering the pseudo-parallel corpus using quality estimation based on sentence-level round-trip translation. For experiments with three language pairs that utilized small, medium, and large size parallel corpora, BLEU scores significantly improved for low-resource language pairs. Additionally, the effects of iterative bootstrapping on translation performance quality is investigated; resultingly, it is confirmed that bootstrapping can further improve the translation performance.
- Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 355--362.Google Scholar
Digital Library
- Nicola Bertoldi and Marcello Federico. 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the 4th Workshop on Statistical Machine Translation. 182--189.Google Scholar
Cross Ref
- Ryan Cotterell and Julia Kreutzer. 2018. Explaining and generalizing back-translation through wake-sleep. Retrieved from: CoRR abs/1806.04402 (2018).Google Scholar
- Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 489--500.Google Scholar
Cross Ref
- Dirk Goldhahn, Maciej Sumalvico, and Uwe Quasthoff. 2016. Corpus collection for under-resourced languages with more than one million speakers. In Proceedings of the Workshop on Collaboration and Computing for Under-Resourced Languages (CCURL’16), Language Resources and Evaluation Conference. 67--73.Google Scholar
- Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 820--828.Google Scholar
- Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, and Trevor Cohn. 2018. Iterative back-translation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 18--24.Google Scholar
Cross Ref
- An-Chang Hsieh, Hen-Hsen Huang, and Hsin-Hsi Chen. 2013. Uses of monolingual in-domain corpora for cross-domain adaptation with hybrid MT approaches. In Proceedings of the 2nd Workshop on Hybrid Approaches to Translation. 117--122.Google Scholar
- Kenji Imamura, Atsushi Fujita, and Eiichiro Sumita. 2018. Enhancement of encoder and attention using target monolingual corpora in neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 55--63.Google Scholar
Cross Ref
- Aizhan Imankulova, Takayuki Sato, and Mamoru Komachi. 2017. Improving low-resource neural machine translation with filtered pseudo-parallel corpus. In Proceedings of the 4th Workshop on Asian Translation. 70--78.Google Scholar
- Tomoyuki Kajiwara, Danushka Bollegala, Yuichi Yoshida, and Ken-ichi Kawarabayashi. 2017. An iterative approach for the global estimation of sentence similarity. PloS ONE 12, 9 (2017), e0180885.Google Scholar
Cross Ref
- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. Retrieved from: arXiv preprint arXiv:1701.02810.Google Scholar
- Philipp Koehn. 2002. Europarl: A Multilingual Corpus for Evaluation of Machine Translation. http://homepages.inf.ed.ac.uk/pkoehn/publications/europarl.pdf.Google Scholar
- Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the 1st Workshop on Neural Machine Translation. 28--39.Google Scholar
Cross Ref
- Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Chin-Yew Lin and Franz Josef Och. 2004a. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Meeting on Association for Computational Linguistics. 605--612.Google Scholar
Digital Library
- Chin-Yew Lin and Franz Josef Och. 2004b. Orange: A method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the 20th International Conference on Computational Linguistics. 501--507.Google Scholar
Digital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the Meeting on Association for Computational Linguistics Conference Short Papers. 220--224.Google Scholar
- Xing Niu, Michael Denkowski, and Marine Carpuat. 2018. Bi-directional neural machine translation with synthetic parallel data. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 84--91.Google Scholar
Cross Ref
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting on Association for Computational Linguistics. 311--318.Google Scholar
- Holger Schwenk. 2008. Investigations on large-scale lightly-supervised training for statistical machine translation. In Proceedings of International Workshop on Spoken Language Translation. 182--189.Google Scholar
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 86--96.Google Scholar
Cross Ref
- Yangqiu Song and Dan Roth. 2015. Unsupervised sparse vector densification for short text similarity. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1275--1280.Google Scholar
Cross Ref
- Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Machine Learn. Res. 15, 1 (2014), 1929--1958.Google Scholar
Digital Library
- Nicola Ueffing, Gholamreza Haffari, and Anoop Sarkar. 2007. Transductive learning for statistical machine translation. In Proceedings of the 45th Meeting of the Association of Computational Linguistics. 25--32.Google Scholar
- Marlies van der Wees, Arianna Bisazza, and Christof Monz. 2017. Dynamic data selection for neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1400--1410.Google Scholar
Cross Ref
- Longyue Wang, Derek F. Wong, Lidia S. Chao, Yi Lu, and Junwen Xing. 2014. A systematic comparison of data selection criteria for SMT domain adaptation. The Scientific World Journal. 1--10.Google Scholar
- Eray Yıldız, Ahmed Cüneyd Tantuğ, and Banu Diri. 2014. The effect of parallel corpus quality vs size in English-to-Turkish SMT. In Proceedings of the 6th International Conference on Web services and Semantic Technology. 21--30.Google Scholar
- Jiajun Zhang and Chengqing Zong. 2016. Exploiting source-side monolingual data in neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1535--1545.Google Scholar
Cross Ref
- Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. 2018. Joint training for neural machine translation models with monolingual data. In Proceedings of the AAAI Conference on Artificial Intelligence. 555--562.Google Scholar
Index Terms
Filtered Pseudo-parallel Corpus Improves Low-resource Neural Machine Translation
Recommendations
Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
AbstractUnsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...
Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation
We propose a new method for inducing a phrase-based translation model from a pair of unrelated monolingual corpora. Our method is able to deal with phrases of arbitrary length and to find phrase pairs that are useful for statistical machine translation, ...
Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems
Machine translation is the core problem for several natural language processing research across the globe. However, building a translation system involving low-resource languages remains a challenge with respect to statistical machine translation (SMT). ...






Comments