Abstract
Data augmentation is an approach for several text generation tasks. Generally, in the machine translation paradigm, mainly in low-resource language scenarios, many data augmentation methods have been proposed. The most used approaches for generating pseudo data mainly lay in word omission, random sampling, or replacing some words in the text. However, previous methods barely guarantee the quality of augmented data. In this work, we try to build the data by using paraphrase embedding and POS-Tagging. Namely, we generate the fake monolingual corpus by replacing the main four POS-Tagging labels, such as noun, adjective, adverb, and verb, based on both the paraphrase table and their similarity. We select the bigger corpus size of the paraphrase table with word level and obtain the word embedding of each word in the table, then calculate the cosine similarity between these words and tagged words in the original sequence. In addition, we exploit the ranking algorithm to choose highly similar words to reduce semantic errors and leverage the POS-Tagging replacement to mitigate syntactic error to some extent. Experimental results show that our augmentation method consistently outperforms all previous SOTA methods on the low-resource language pairs in seven language pairs from four corpora by 1.16 to 2.39 BLEU points.
- M. Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of ICLR.Google Scholar
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. In Proceedings of NIPS.Google Scholar
- Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.Google Scholar
- S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of [email protected].Google Scholar
- Steven Bird. 2006. NLTK: The natural language toolkit. In Proceedings of COLING/ACL. Google Scholar
Digital Library
- Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2 (1993), 263–311. Google Scholar
Digital Library
- Yun Chen, Yang Liu, Yong Cheng, and Victor O. K. Li. 2017. A teacher-student framework for zero-resource neural machine translation. In Proceedings of ACL.Google Scholar
- Yun Chen, Yang Liu, and Victor O. K. Li. 2018. Zero-resource neural machine translation with multi-agent communication game. arXiv:1802.03116.Google Scholar
- Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Semi-supervised learning for neural machine translation. In Proceedings of ACL.Google Scholar
Cross Ref
- Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Wei Xu. 2017. Joint training for pivot-based neural machine translation. In Proceedings of IJCAI. Google Scholar
Digital Library
- David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of ACL. Google Scholar
Digital Library
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of EMNLP.Google Scholar
Cross Ref
- Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017. An empirical comparison of simple domain adaptation methods for neural machine translation. arXiv:1701.03214.Google Scholar
- E. Cubuk, Barret Zoph, Dandelion Mané, V. Vasudevan, and Quoc V. Le. 2019. AutoAugment: Learning augmentation strategies from data. In Proceedings of CVPR.113–123.Google Scholar
- J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceeding of NAACL-HLT.Google Scholar
- Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language translation. In Proceedings of ACL.Google Scholar
Cross Ref
- Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data augmentation for low-resource neural machine translation. In Proceedings of ACL.Google Scholar
Cross Ref
- Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016a. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of NAACL.Google Scholar
Cross Ref
- Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016b. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of EMNLP.Google Scholar
Cross Ref
- Fei Gao, Jinhua Zhu, Lijun Wu, Yingce Xia, Tao Qin, Xueqi Cheng, Wengang Zhou, and Tie-Yan Liu. 2019. Soft contextual data augmentation for neural machine translation. In Proceedings of ACL. 5539–5544.Google Scholar
Cross Ref
- Silin Gao, Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020. Paraphrase augmented task-oriented dialog generation. In Proceedings of ACL.Google Scholar
Cross Ref
- Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and V. Li. 2018. Meta-learning for low-resource neural machine translation. In Proceedings of EMNLP.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of CVPR. 770–778.Google Scholar
Cross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google Scholar
Digital Library
- Mohit Iyyer, V. Manjunatha, Jordan L. Boyd-Graber, and Hal Daumé. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of ACL.Google Scholar
Cross Ref
- Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. 2016. Is neural machine translation ready for deployment? A case study on 30 translation directions. arXiv:1610.01108v2.Google Scholar
- Pei Ke, Fei Huang, Minlie Huang, and Xiaoyan Zhu. 2019. ARAML: A stable adversarial training framework for text generation. In Proceedings of EMNLP/IJCNLP.Google Scholar
Cross Ref
- S. Kobayashi. 2017. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of NAACL.Google Scholar
- Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of NAACL. Google Scholar
Digital Library
- Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In Proceedings of ICLR.Google Scholar
- Zhongguo Li and Maosong Sun. 2009. Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics 35 (2009), 505–512. Google Scholar
Digital Library
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://www.aclweb.org/anthology/W04-1013.Google Scholar
- Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of ACL.Google Scholar
Cross Ref
- M. Maimaiti, Y. Liu, Huanbo Luan, and M. Sun. 2019. Multi-round transfer learning for low-resource NMT using multiple high-resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing 18 (2019), Article 38, 26 pages. Google Scholar
Digital Library
- Maimaiti Mieradilijiang and Zou Xiaohui. 2018. Discussion on bilingual cognition in international exchange activities. In Proceedings of ICIS.Google Scholar
- Tomas Mikolov, Kai Chen, G. S. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.Google Scholar
- Mohammad Norouzi, S. Bengio, Z. Chen, Navdeep Jaitly, Mike Schuster, Y. Wu, and Dale Schuurmans. 2016. Reward augmented maximum likelihood for neural structured prediction. In Proceedings of NIPS. Google Scholar
Digital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of ACL. Google Scholar
Digital Library
- A. Radford, Jeffrey Wu, R. Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.Google Scholar
- Rico Sennrich, B. Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proceedings of ACL.Google Scholar
Cross Ref
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of ACL.Google Scholar
Cross Ref
- Rico Sennrich and Biao Zhang. 2019. Revisiting low-resource neural machine translation: A case study. In Proceedings of ACL.Google Scholar
Cross Ref
- Matthew Snover, Bonnie J. Dorr, R. Schwartz, and L. Micciulla. 2006. A study of translation edit rate with targeted human annotation. In AMTA.Google Scholar
- Jinsong Su, Shan Wu, Biao Zhang, Changxing Wu, Yue Qin, and Deyi Xiong. 2018. A neural generative autoencoder for bilingual word embeddings. Information Sciences 424 (2018), 287–300. Google Scholar
Digital Library
- Jinsong Su, Jiali Zeng, John Xie, H. Wen, Yongjing Yin, and Y. Liu. 2021. Exploring discriminative word-level domain contexts for multi-domain neural machine translation.IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2021), 1530–1545.Google Scholar
- Amane Sugiyama and Naoki Yoshinaga. 2019. Data augmentation using back-translation for context-aware neural machine translation. In Proceedings of [email protected].Google Scholar
Cross Ref
- Yibo Sun, Duyu Tang, N. Duan, Yeyun Gong, X. Feng, B. Qin, and D. Jiang. 2020. Neural semantic parsing in low-resource settings with back-translation and meta-learning. In Proceedings of AAAI.Google Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of NIPS. Google Scholar
Digital Library
- Ben Tan, Yu Zhang, Sinno Jialin Pan, and Qiang Yang. 2017. Distant domain transfer learning. In Proceedings of AAAI. Google Scholar
Digital Library
- Clara Vania and Adam Lopez. 2017. From characters to words to in between: Do we capture morphology? In Proceedings of ACL.Google Scholar
Cross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS. Google Scholar
Digital Library
- Rui Wang, M. Utiyama, A. Finch, L. Liu, Kehai Chen, and Eiichiro Sumita. 2018b. Sentence selection and weighting for neural machine translation domain adaptation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (2018), 1727–1741. Google Scholar
Digital Library
- Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. 2018a. SwitchOut: An efficient data augmentation algorithm for neural machine translation. In Proceedings of EMNLP.Google Scholar
Cross Ref
- Yiren Wang, Lijun Wu, Yingce Xia, Tao Qin, ChengXiang Zhai, and T. Liu. 2020. Transductive ensemble learning for neural machine translation. In Proceedings of AAAI.Google Scholar
- Jason Wei and K. Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of EMNLP/IJCNLP.Google Scholar
- Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and S. Hu. 2019. Conditional BERT contextual augmentation. In Proceedings of ICCS.Google Scholar
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.Google Scholar
- Ziang Xie, Sida I. Wang, J. Li, Daniel Lévy, Allen Nie, Dan Jurafsky, and A. Ng. 2017. Data noising as smoothing in neural network language models. In Proceedings of ICLR.Google Scholar
- Jiali Zeng, Y. Liu, Jinsong Su, Yubing Ge, Yaojie Lu, Yongjing Yin, and Jiebo Luo. 2019. Iterative dual domain adaptation for neural machine translation. In Proceedings of EMNLP/IJCNLP.Google Scholar
Cross Ref
- Jiali Zeng, Jinsong Su, H. Wen, Yang Liu, J. Xie, Yongjing Yin, and J. Zhao. 2018. Multi-domain neural machine translation with word-level domain context discrimination. In Proceedings of EMNLP.Google Scholar
- Jiajun Zhang, Y. Zhao, Haoran Li, and C. Zong. 2019. Attention with sparsity regularization for neural machine translation and summarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (2019), 507–518. Google Scholar
Digital Library
- Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2017. Learning transferable architectures for scalable image recognition. arXiv:1707.07012.Google Scholar
- Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of EMNLP.Google Scholar
Cross Ref
Index Terms
Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding
Recommendations
Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
AbstractUnsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...
Pre-trained Word Embedding based Parallel Text Augmentation Technique for Low-Resource NMT in Favor of Morphologically Rich Languages
CSAE '19: Proceedings of the 3rd International Conference on Computer Science and Application EngineeringRecently, neural machine translation (NMT) has made a remarkable achievement. However, performance of NMT is highly influenced by the size of training parallel text. The required amount of parallel text is not available for low-resource languages. The ...
A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such ...






Comments