skip to main content
research-article

Hybrid Pipeline for Building Arabic Tunisian Dialect-standard Arabic Neural Machine Translation Model from Scratch

Published:14 April 2023Publication History
Skip Abstract Section

Abstract

Deep Learning is one of the most promising technologies compared to other methods in the context of machine translation. It has been proven to achieve impressive results on large amounts of parallel data for well-endowed languages. Nevertheless, for low-resource languages such as the Arabic Dialects, Deep Learning models failed due to the lack of available parallel corpora. In this article, we present a method to create a parallel corpus to build an effective NMT model able to translate into MSA, Tunisian Dialect texts present in social networks. For this, we propose a set of data augmentation methods aiming to increase the size of the state-of-the-art parallel corpus. By evaluating the impact of this step, we noticed that it has effectively boosted both the size and the quality of the corpus. Then, using the resulted corpus, we compare the effectiveness of CNN, RNN and transformers models to translate Tunisian Dialect into MSA. Experiments show that a better translation is achieved by the transformer model with a BLEU score of 60 vs., respectively, 33.36 and 53.98 with RNN and CNN models.

REFERENCES

  1. [1] Al-Ibrahim R. and Duwairi R. M.. 2020. Neural machine translation from Jordanian dialect to modern standard Arabic. In Proceedings of the 11th International Conference on Information and Communication Systems (ICICS). 173178. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Karakanta Alina, Dehdari Jon, and Genabith Josef van. 2018. Neural machine translation for low-resource languages without parallel corpora. Mach. Translat. 32 (2018),167189.Google ScholarGoogle Scholar
  3. [3] Al-Ani Ebtesam H. Almansor and Ahmed. 2018. A hybrid neural machine translation technique for translating low resource languages. In Machine Learning and Data Mining in Pattern Recognition. Springer International Publishing, Cham, 347356. Google ScholarGoogle Scholar
  4. [4] Almansor Ebtesam H. and Al-Ani Ahmed. 2017. Translating dialectal Arabic as low resource language using word embedding. In Proceedings of the International Conference Recent Advances in Natural Language Processing. INCOMA Ltd., 5257. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2016. Neural Machine Translation by Jointly Learning to Align and Translate. arxiv:1409.0473 [cs.CL].Google ScholarGoogle Scholar
  6. [6] Baniata Laith H., Park Seyoung, and Park Seong-Bae. 2018. A neural machine translation model for Arabic dialects that utilizes multitask learning (MTL). Computat. Intell. Neurosci.Dec. 10 (2018).Google ScholarGoogle Scholar
  7. [7] Bouamor Houda, Habash Nizar, and Oflazer Kemal. 2014. A multidialectal parallel corpus of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 12401245. Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/523_Paper.pdf.Google ScholarGoogle Scholar
  8. [8] Bouamor Houda, Habash Nizar, Salameh Mohammad, Zaghouani Wajdi, Rambow Owen, Abdulrahim Dana, Obeid Ossama, Khalifa Salam, Eryani Fadhl, Erdmann Alexander, and Oflazer Kemal. 2018. The MADAR Arabic dialect corpus and lexicon. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association. Retrieved from https://www.aclweb.org/anthology/L18-1535.Google ScholarGoogle Scholar
  9. [9] Boujelbane Rahma, Khemekhem Mariem Ellouze, and Belguith Lamia Hadrich. 2013. Mapping rules for building a Tunisian dialect lexicon and generating corpora. In Proceedings of the 6th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 419428. Retrieved from https://www.aclweb.org/anthology/I13-1048.Google ScholarGoogle Scholar
  10. [10] Chen Kehai, Wang Rui, Utiyama Masao, Liu Lemao, Tamura Akihiro, Sumita Eiichiro, and Zhao Tiejun. 2017. Neural machine translation with source dependency representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805 [cs.CL].Google ScholarGoogle Scholar
  12. [12] El-taher Fatma El-zahraa, Hammouda Alaa Aldin, and Abdel-Mageid Salah. 2016. Automation of understanding textual contents in social networks. In Proceedings of the International Conference on Selected Topics in Mobile Wireless Networking (MoWNeT). 17. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Fei Gao, Jinhua Zhu, Lijun Wu, Yingce Xia, Tao Qin, Xueqi Cheng, Wengang Zhou, and Tie-Yan Liu. 2019. Soft contextual data augmentation for neural machine translation. In Proceedings of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  14. [14] Graja M., Jaoua M., and Belguith L. Hadrich. 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect. IEEE/ACM Trans. Audio, Speech Lang. Process. 23, 12 (2015), 23112321. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Hamdi Ahmed, Boujelbane Rahma, Habash Nizar, and Nasr Alexis. 2013. The effects of factorizing root and pattern mapping in bidirectional Tunisian–standard Arabic machine translation. In Proceedings of the MT Summit. Retrieved from https://hal.archives-ouvertes.fr/hal-00908761.Google ScholarGoogle Scholar
  16. [16] Jeblee Serena, Feely Weston, Bouamor Houda, Lavie Alon, Habash Nizar, and Oflazer Kemal. 2014. Domain and dialect adaptation for machine translation into Egyptian Arabic. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP). Association for Computational Linguistics, 196206. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Jinyi Zhang and Tadahiro Matsumoto. 2019. Corpus augmentation by sentence segmentation for low-resource neural machine translation. CoRR abs/1905.08945 (2019).Google ScholarGoogle Scholar
  18. [18] Meftouh Karima, Harrat Salima, Jamoussi S., Abbas M., and Smaïli Kamel. 2015. Machine translation experiments on PADIC: A parallel Arabic dialect corpus. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation.2634.Google ScholarGoogle Scholar
  19. [19] Kchaou Saméh, Boujelbane Rahma, and Hadrich-Belguith Lamia. 2020. Parallel resources for Tunisian Arabic dialect translation. In Proceedings of the 5th Arabic Natural Language Processing Workshop. Association for Computational Linguistics, 200206. Retrieved from https://www.aclweb.org/anthology/2020.wanlp-1.18.Google ScholarGoogle Scholar
  20. [20] Klein Guillaume, Kim Yoon, Deng Yuntian, Senellart Jean, and Rush Alexander. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations. Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Kreutzer Julia, Bastings Jasmijn, and Riezler Stefan. 2020. Joey NMT: A Minimalist NMT Toolkit for Novices. arxiv:1907.12484 [cs.CL].Google ScholarGoogle Scholar
  22. [22] Lakew Surafel M., Cettolo Mauro, and Federico Marcello. 2018. A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation. arxiv:1806.06957 [cs.CL].Google ScholarGoogle Scholar
  23. [23] Li Y., Li X., Yang Y., and Dong R.. 2020. A diverse data augmentation strategy for low-resource neural machine translation. Information11, 255 (2020),20782489.Google ScholarGoogle Scholar
  24. [24] Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.567573.Google ScholarGoogle Scholar
  25. [25] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Köpf Andreas, Yang Edward, DeVito Zach, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arxiv:1912.01703 [cs.LG].Google ScholarGoogle Scholar
  26. [26] Richburg Aquia, Eskander Ramy, Muresan Smaranda, and Carpuat Marine. 2020. An evaluation of subword segmentation strategies for neural machine translation of morphologically rich languages. In Proceedings of the the 4th Widening Natural Language Processing Workshop. Association for Computational Linguistics, 151155. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Rush Alexander. 2018. The annotated transformer. In Proceedings of the Workshop for NLP Open Source Software (NLP-OSS). Association for Computational Linguistics, 5260. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Salloum Wael and Habash Nizar. 2012. Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of COLING 2012: Demonstration Papers. The COLING 2012 Organizing Committee, 385392. Retrieved from https://www.aclweb.org/anthology/C12-3048.Google ScholarGoogle Scholar
  29. [29] Shaw Peter, Uszkoreit Jakob, and Vaswani Ashish. 2018. Self-attention with relative position representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 464468. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Talafha Bashar, Ali Mohammad, Za’ter Muhy Eddin, Seelawi Haitham, Tuffaha Ibraheem, Samir Mostafa, Farhan Wael, and Al-Natsheh Hussein T.. 2020. Multi-dialect Arabic BERT for Country-level Dialect Identification. arxiv:2007.05612 [cs.CL].Google ScholarGoogle Scholar
  31. [31] Tapo Allahsera Auguste, Coulibaly Bakary, Diarra Sébastien, Homan Christopher, Kreutzer Julia, Luger Sarah, Nagashima Arthur, Zampieri Marcos, and Leventhal Michael. 2020. Neural machine translation for extremely low-resource African languages: A case study on Bambara. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages. Association for Computational Linguistics.Google ScholarGoogle Scholar
  32. [32] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. CoRR abs/1706.03762 (2017).Google ScholarGoogle Scholar
  33. [33] Wu Shijie and Dredze Mark. 2019. Beto, Bentz, Becas: The Surprising Cross-lingual Effectiveness of BERT. arxiv:1904.09077 [cs.CL].Google ScholarGoogle Scholar
  34. [34] Zribi Inès, Ellouze M., Belguith L., and Blache P.. 2017. Morphological disambiguation of Tunisian dialect. J. King Saud Univ. Comput. Inf. Sci. 29 (2017), 147155.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hybrid Pipeline for Building Arabic Tunisian Dialect-standard Arabic Neural Machine Translation Model from Scratch

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
      March 2023
      570 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3579816
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 April 2023
      • Online AM: 2 November 2022
      • Accepted: 5 October 2022
      • Received: 26 January 2022
      Published in tallip Volume 22, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)171
      • Downloads (Last 6 weeks)9

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!