skip to main content
research-article

Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs

Authors Info & Claims
Published:10 March 2023Publication History
Skip Abstract Section

Abstract

Parallel sentence pairs play a very important role in many natural language processing tasks, especially cross-lingual tasks such as machine translation. So far, many Asian language pairs lack bilingual parallel sentences. As collecting bilingual parallel data is very time-consuming and difficult, it is very important for many low-resource Asian language pairs. While existing methods have shown encouraging results, they rely on bilingual data seriously or have some drawbacks in an unsupervised situation. To address these issues, we propose a new unsupervised similarity calculation and dynamic selection metric to obtain parallel sentence pairs in an unsupervised situation. First, our method maps bilingual word embedding by postdoc adversarial training, which rotates the source space to match the target without parallel data. Then, we introduce a new cross-domain similarity adaption to obtain parallel sentence pairs. Experimental results on real-world datasets show that our model can obtain better accuracy and recall on mining parallel sentence pairs. We also show that the extracted bilingual sentence corpora can significantly improve the performance of neural machine translation.

REFERENCES

  1. [1] Chandar Sarath, Lauly Stanislas, Larochelle Hugo, Khapra Mitesh, Ravindran Balaraman, Raykar Vikas C., and Saha Amrita. 2014. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems. MIT Press, 18531861.Google ScholarGoogle Scholar
  2. [2] Artetxe Mikel, Labaka Gorka, and Agirre Eneko. 2019. An effective approach to unsupervised machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 194203.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Artetxe Mikel and Schwenk Holger. 2019. Margin-based parallel corpus mining with multilingual sentence embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 31973203.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Artetxe Mikel and Schwenk Holger. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 7 (2019), 597610.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Miceli-Barone Antonio Valerio. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. Retrieved from https://arXiv:1608.02996.Google ScholarGoogle Scholar
  6. [6] Bojanowski Piotr, Grave Edouard, Joulin Armand, and Mikolov Tomas. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5 (2017), 135146.Google ScholarGoogle Scholar
  7. [7] Bouamor Houda and Sajjad Hassan. 2018. [email protected]: Parallel sentence extraction from comparable corpora using multilingual sentence embeddings. In Proceedings of the Workshop on Building and Using Comparable Corpora.Google ScholarGoogle Scholar
  8. [8] Chaudhary Vishrav, Tang Yuqing, Guzmán Francisco, Schwenk Holger, and Koehn Philipp. 2019. Low-resource corpus filtering using multilingual sentence embeddings. https://arXiv:1906.08885.Google ScholarGoogle Scholar
  9. [9] Conneau Alexis and Lample Guillaume. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems. MIT Press, 70597069.Google ScholarGoogle Scholar
  10. [10] Dinu Georgiana, Lazaridou Angeliki, and Baroni Marco. 2014. Improving zero-shot learning by mitigating the hubness problem. Retrieved from https://arXiv:1412.6568.Google ScholarGoogle Scholar
  11. [11] Esplà-Gomis Miquel, Forcada Mikel L., Rojas Sergio Ortiz, and Ferrández-Tordera Jorge. 2016. Bitextor’s participation in WMT’16: Shared task on document alignment. In Proceedings of the 1st Conference on Machine Translation. 685691.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Grégoire Francis and Langlais Philippe. 2018. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. In Proceedings of the 27th International Conference on Computational Linguistics. 14421453.Google ScholarGoogle Scholar
  13. [13] Grover Jeenu and Mitra Pabitra. 2017. Bilingual word embeddings with bucketed cnn for parallel sentence extraction. In Proceedings of the ACL Student Research Workshop. 1116.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Guo Mandy, Shen Qinlan, Yang Yinfei, Ge Heming, Cer Daniel, Abrego Gustavo Hernandez, Stevens Keith, Constant Noah, Sung Yun-Hsuan, Strope Brian, et al. 2018. Effective parallel corpus mining using bilingual sentence embeddings. In Proceedings of the 3rd Conference on Machine Translation. 165176.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hangya Viktor and Fraser Alexander. 2018. An unsupervised system for parallel corpus filtering. In Proceedings of the Third Conference on Machine Translation. 882887.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Hangya Viktor and Fraser Alexander. 2019. Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 12241234.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Hassan Hany, Aue Anthony, Chen Chang, Chowdhary Vishal, Clark Jonathan, Federmann Christian, Huang Xuedong, Junczys-Dowmunt Marcin, Lewis William, Li Mu, et al. 2018. Achieving human parity on automatic chinese to english news translation. Retrieved from https://arXiv:1803.05567.Google ScholarGoogle Scholar
  18. [18] Huang Eric H., Socher Richard, Manning Christopher D., and Ng Andrew Y.. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 873882.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Junczys-Dowmunt Marcin. 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. Retrieved from https://arXiv:1809.00197.Google ScholarGoogle Scholar
  20. [20] Keung Phillip, Bhardwaj Vikas, et al. 2019. Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 13551360.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Keung Phillip, Salazar Julian, Lu Yichao, and Smith Noah A.. 2020. Unsupervised bitext mining and translation via self-trained contextual embeddings. Trans. Assoc. Comput. Linguist. 8 (2020), 828841.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Khayrallah Huda and Koehn Philipp. 2018. On the impact of various types of noise on neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 7483.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Kim Yunsu, Rosendahl Hendrik, Rossenbach Nick, Rosendahl Jan, Khadivi Shahram, and Ney Hermann. 2019. Learning bilingual sentence embeddings via autoencoding and computing similarities with a multilayer perceptron. Retrieved from https://arXiv:1906.01942.Google ScholarGoogle Scholar
  24. [24] Koehn Philipp, Khayrallah Huda, Heafield Kenneth, and Forcada Mikel L.. 2018. Findings of the WMT 2018 shared task on parallel corpus filtering. In Proceedings of the 3rd Conference on Machine Translation. 726739.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Kvapilíková Ivana, Artetxe Mikel, Labaka Gorka, Agirre Eneko, and Bojar Ondřej. 2020. Unsupervised multilingual sentence embeddings for parallel corpus mining. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 255262.Google ScholarGoogle Scholar
  26. [26] Lample Guillaume, Conneau Alexis, Ranzato Marc’Aurelio, Denoyer Ludovic, and Jégou Hervé. 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  27. [27] Marie Benjamin and Fujita Atsushi. 2017. Efficient extraction of pseudo-parallel sentences from raw monolingual data using word embeddings. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 392398.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Munteanu Dragos Stefan and Marcu Daniel. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 8188.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Muthu Balaanand, Cb Sivaparthipan, Kumar Priyan Malarvizhi, Kadry Seifedine Nimer, Hsu Ching-Hsien, Sanjuan Oscar, and Crespo Ruben Gonzalez. 2021. A framework for extractive text summarization based on deep learning modified neural network classifier. ACM Trans. Asian Low-Resour. Lang. Info. Process. 20, 3, Article 45 (July2021), 20 pages. .Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Ormazabal Aitor, Artetxe Mikel, Labaka Gorka, Soroa Aitor, and Agirre Eneko. 2019. Analyzing the limitations of cross-lingual word embedding mappings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 49904995.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Ren Shuo, Liu Shujie, Zhou Ming, and Ma Shuai. 2020. A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 34763485.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Sabet Ali, Gupta Prakhar, Cordonnier Jean-Baptiste, West Robert, and Jaggi Martin. 2019. Robust cross-lingual embeddings from parallel sentences. Retrieved from https://arXiv:1912.12481.Google ScholarGoogle Scholar
  33. [33] Schwenk Holger. 2018. Filtering and mining parallel data in a joint multilingual space. Retrieved from https://arXiv:1805.09822.Google ScholarGoogle Scholar
  34. [34] Schwenk Holger, Wenzek Guillaume, Edunov Sergey, Grave Edouard, and Joulin Armand. 2019. Ccmatrix: Mining billions of high-quality parallel sentences on the web. Retrieved from https://arXiv:1911.04944.Google ScholarGoogle Scholar
  35. [35] Shahnawaz and Mishra R. B.. 2013. Statistical machine translation system for English to Urdu. Int. J. Adv. Intell. Paradigms 5, 3 (2013), 182203.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Shahnawaz Shahnawaz and Mishra R.. 2012. A neural network based approach for English to Hindi machine translation. Int. J. Comput. Appl. 53 (Sep. 2012), 5056. .Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Smith Jason, Quirk Chris, and Toutanova Kristina. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies. 403411.Google ScholarGoogle Scholar
  38. [38] Smith Jason R., Quirk Chris, and Toutanova Kristina. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’10). Association for Computational Linguistics, 403411.Google ScholarGoogle Scholar
  39. [39] Trieu Hai-Long, Tran Duc-Vu, Ittoo Ashwin, and Nguyen Le-Minh. 2019. Leveraging additional resources for improving statistical machine translation on Asian low-resource languages. ACM Trans. Asian Low Resour. Lang. Info. Process. 18, 3 (2019), 32:1–32:22.Google ScholarGoogle Scholar
  40. [40] Zhang Meng, Liu Yang, Luan Huanbo, and Sun Maosong. 2017. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 19591970.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Zhang Meng, Liu Yang, Luan Huan-Bo, Sun Maosong, Izuha Tatsuya, and Hao Jie. 2016. Building earth mover’s distance on bilingual word embeddings for machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’16). 28702876.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Zhang Zheng and Zweigenbaum Pierre. 2017. zNLP: Identifying parallel sentences in Chinese-English comparable corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora. 5155.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Zhu Shaolin, Yang Yong, and Xu Chun. 2020. Extracting parallel sentences from nonparallel corpora using parallel hierarchical attention network. Comput. Intell. Neurosci. 2020 (2020), 8823906:1–8823906:9.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
        March 2023
        570 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3579816
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 March 2023
        • Online AM: 3 February 2022
        • Accepted: 11 September 2021
        • Revised: 1 September 2021
        • Received: 7 April 2021
        Published in tallip Volume 22, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!