Abstract
Parallel sentence pairs play a very important role in many natural language processing tasks, especially cross-lingual tasks such as machine translation. So far, many Asian language pairs lack bilingual parallel sentences. As collecting bilingual parallel data is very time-consuming and difficult, it is very important for many low-resource Asian language pairs. While existing methods have shown encouraging results, they rely on bilingual data seriously or have some drawbacks in an unsupervised situation. To address these issues, we propose a new unsupervised similarity calculation and dynamic selection metric to obtain parallel sentence pairs in an unsupervised situation. First, our method maps bilingual word embedding by postdoc adversarial training, which rotates the source space to match the target without parallel data. Then, we introduce a new cross-domain similarity adaption to obtain parallel sentence pairs. Experimental results on real-world datasets show that our model can obtain better accuracy and recall on mining parallel sentence pairs. We also show that the extracted bilingual sentence corpora can significantly improve the performance of neural machine translation.
- [1] . 2014. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems. MIT Press, 1853–1861.Google Scholar
- [2] . 2019. An effective approach to unsupervised machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 194–203.Google Scholar
Cross Ref
- [3] . 2019. Margin-based parallel corpus mining with multilingual sentence embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3197–3203.Google Scholar
Cross Ref
- [4] . 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 7 (2019), 597–610.Google Scholar
Cross Ref
- [5] . 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. Retrieved from https://arXiv:1608.02996.Google Scholar
- [6] . 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5 (2017), 135–146.Google Scholar
- [7] . 2018. [email protected]: Parallel sentence extraction from comparable corpora using multilingual sentence embeddings. In Proceedings of the Workshop on Building and Using Comparable Corpora.Google Scholar
- [8] . 2019. Low-resource corpus filtering using multilingual sentence embeddings. https://arXiv:1906.08885.Google Scholar
- [9] . 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems. MIT Press, 7059–7069.Google Scholar
- [10] . 2014. Improving zero-shot learning by mitigating the hubness problem. Retrieved from https://arXiv:1412.6568.Google Scholar
- [11] . 2016. Bitextor’s participation in WMT’16: Shared task on document alignment. In Proceedings of the 1st Conference on Machine Translation. 685–691.Google Scholar
Cross Ref
- [12] . 2018. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In Proceedings of the 27th International Conference on Computational Linguistics. 1442–1453.Google Scholar
- [13] . 2017. Bilingual word embeddings with bucketed cnn for parallel sentence extraction. In Proceedings of the ACL Student Research Workshop. 11–16.Google Scholar
Cross Ref
- [14] . 2018. Effective parallel corpus mining using bilingual sentence embeddings. In Proceedings of the 3rd Conference on Machine Translation. 165–176.Google Scholar
Cross Ref
- [15] . 2018. An unsupervised system for parallel corpus filtering. In Proceedings of the Third Conference on Machine Translation. 882–887.Google Scholar
Cross Ref
- [16] . 2019. Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1224–1234.Google Scholar
Cross Ref
- [17] . 2018. Achieving human parity on automatic chinese to english news translation. Retrieved from https://arXiv:1803.05567.Google Scholar
- [18] . 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 873–882.Google Scholar
Digital Library
- [19] . 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. Retrieved from https://arXiv:1809.00197.Google Scholar
- [20] . 2019. Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 1355–1360.Google Scholar
Cross Ref
- [21] . 2020. Unsupervised bitext mining and translation via self-trained contextual embeddings. Trans. Assoc. Comput. Linguist. 8 (2020), 828–841.Google Scholar
Cross Ref
- [22] . 2018. On the impact of various types of noise on neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 74–83.Google Scholar
Cross Ref
- [23] . 2019. Learning bilingual sentence embeddings via autoencoding and computing similarities with a multilayer perceptron. Retrieved from https://arXiv:1906.01942.Google Scholar
- [24] . 2018. Findings of the WMT 2018 shared task on parallel corpus filtering. In Proceedings of the 3rd Conference on Machine Translation. 726–739.Google Scholar
Cross Ref
- [25] . 2020. Unsupervised multilingual sentence embeddings for parallel corpus mining. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 255–262.Google Scholar
- [26] . 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [27] . 2017. Efficient extraction of pseudo-parallel sentences from raw monolingual data using word embeddings. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 392–398.Google Scholar
Cross Ref
- [28] . 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 81–88.Google Scholar
Digital Library
- [29] . 2021. A framework for extractive text summarization based on deep learning modified neural network classifier. ACM Trans. Asian Low-Resour. Lang. Info. Process. 20, 3, Article
45 (July 2021), 20 pages. .Google ScholarDigital Library
- [30] . 2019. Analyzing the limitations of cross-lingual word embedding mappings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4990–4995.Google Scholar
Cross Ref
- [31] . 2020. A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3476–3485.Google Scholar
Cross Ref
- [32] . 2019. Robust cross-lingual embeddings from parallel sentences. Retrieved from https://arXiv:1912.12481.Google Scholar
- [33] . 2018. Filtering and mining parallel data in a joint multilingual space. Retrieved from https://arXiv:1805.09822.Google Scholar
- [34] . 2019. Ccmatrix: Mining billions of high-quality parallel sentences on the web. Retrieved from https://arXiv:1911.04944.Google Scholar
- [35] . 2013. Statistical machine translation system for English to Urdu. Int. J. Adv. Intell. Paradigms 5, 3 (2013), 182–203.Google Scholar
Digital Library
- [36] . 2012. A neural network based approach for English to Hindi machine translation. Int. J. Comput. Appl. 53 (Sep. 2012), 50–56. .Google Scholar
Cross Ref
- [37] . 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies. 403–411.Google Scholar
- [38] . 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’10). Association for Computational Linguistics, 403–411.Google Scholar
- [39] . 2019. Leveraging additional resources for improving statistical machine translation on Asian low-resource languages. ACM Trans. Asian Low Resour. Lang. Info. Process. 18, 3 (2019), 32:1–32:22.Google Scholar
- [40] . 2017. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 1959–1970.Google Scholar
Cross Ref
- [41] . 2016. Building earth mover’s distance on bilingual word embeddings for machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’16). 2870–2876.Google Scholar
Cross Ref
- [42] . 2017. zNLP: Identifying parallel sentences in Chinese-English comparable corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora. 51–55.Google Scholar
Cross Ref
- [43] . 2020. Extracting parallel sentences from nonparallel corpora using parallel hierarchical attention network. Comput. Intell. Neurosci. 2020 (2020), 8823906:1–8823906:9.Google Scholar
Digital Library
Index Terms
Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs
Recommendations
Experimenting with Different Machine Translation Models in Medium-Resource Settings
Text, Speech, and DialogueAbstractState-of-the-art machine translation (MT) systems rely on the availability of large parallel corpora, containing millions of sentence pairs. For the Icelandic language, the parallel corpus ParIce exists, consisting of about 3.6 million English-...
Word Sense Based Hindi-Tamil Statistical Machine Translation
Corpus based natural language processing has emerged with great success in recent years. It is not only used for languages like English, French, Spanish, and Hindi but also is widely used for languages like Tamil, Telugu etc. This paper focuses to ...
Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
AbstractUnsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...






Comments