Abstract
Deep Learning is one of the most promising technologies compared to other methods in the context of machine translation. It has been proven to achieve impressive results on large amounts of parallel data for well-endowed languages. Nevertheless, for low-resource languages such as the Arabic Dialects, Deep Learning models failed due to the lack of available parallel corpora. In this article, we present a method to create a parallel corpus to build an effective NMT model able to translate into MSA, Tunisian Dialect texts present in social networks. For this, we propose a set of data augmentation methods aiming to increase the size of the state-of-the-art parallel corpus. By evaluating the impact of this step, we noticed that it has effectively boosted both the size and the quality of the corpus. Then, using the resulted corpus, we compare the effectiveness of CNN, RNN and transformers models to translate Tunisian Dialect into MSA. Experiments show that a better translation is achieved by the transformer model with a BLEU score of 60 vs., respectively, 33.36 and 53.98 with RNN and CNN models.
- [1] . 2020. Neural machine translation from Jordanian dialect to modern standard Arabic. In Proceedings of the 11th International Conference on Information and Communication Systems (ICICS). 173–178.
DOI: Google ScholarCross Ref
- [2] . 2018. Neural machine translation for low-resource languages without parallel corpora. Mach. Translat. 32 (2018),167–189.Google Scholar
- [3] . 2018. A hybrid neural machine translation technique for translating low resource languages. In Machine Learning and Data Mining in Pattern Recognition. Springer International Publishing, Cham, 347–356. Google Scholar
- [4] . 2017. Translating dialectal Arabic as low resource language using word embedding. In Proceedings of the International Conference Recent Advances in Natural Language Processing. INCOMA Ltd., 52–57.
DOI: Google ScholarCross Ref
- [5] . 2016. Neural Machine Translation by Jointly Learning to Align and Translate.
arxiv:1409.0473 [cs.CL].Google Scholar - [6] . 2018. A neural machine translation model for Arabic dialects that utilizes multitask learning (MTL). Computat. Intell. Neurosci.Dec. 10 (2018).Google Scholar
- [7] . 2014. A multidialectal parallel corpus of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 1240–1245. Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/523_Paper.pdf.Google Scholar
- [8] . 2018. The MADAR Arabic dialect corpus and lexicon. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association. Retrieved from https://www.aclweb.org/anthology/L18-1535.Google Scholar
- [9] . 2013. Mapping rules for building a Tunisian dialect lexicon and generating corpora. In Proceedings of the 6th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 419–428. Retrieved from https://www.aclweb.org/anthology/I13-1048.Google Scholar
- [10] . 2017. Neural machine translation with source dependency representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.Google Scholar
Cross Ref
- [11] . 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
arxiv:1810.04805 [cs.CL].Google Scholar - [12] . 2016. Automation of understanding textual contents in social networks. In Proceedings of the International Conference on Selected Topics in Mobile Wireless Networking (MoWNeT). 1–7.
DOI: Google ScholarCross Ref
- [13] . 2019. Soft contextual data augmentation for neural machine translation. In Proceedings of the Association for Computational Linguistics.Google Scholar
- [14] . 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect. IEEE/ACM Trans. Audio, Speech Lang. Process. 23, 12 (2015), 2311–2321.
DOI: Google ScholarDigital Library
- [15] . 2013. The effects of factorizing root and pattern mapping in bidirectional Tunisian–standard Arabic machine translation. In Proceedings of the MT Summit. Retrieved from https://hal.archives-ouvertes.fr/hal-00908761.Google Scholar
- [16] . 2014. Domain and dialect adaptation for machine translation into Egyptian Arabic. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP). Association for Computational Linguistics, 196–206.
DOI: Google ScholarCross Ref
- [17] . 2019. Corpus augmentation by sentence segmentation for low-resource neural machine translation. CoRR abs/1905.08945 (2019).Google Scholar
- [18] . 2015. Machine translation experiments on PADIC: A parallel Arabic dialect corpus. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation.26–34.Google Scholar
- [19] . 2020. Parallel resources for Tunisian Arabic dialect translation. In Proceedings of the 5th Arabic Natural Language Processing Workshop. Association for Computational Linguistics, 200–206. Retrieved from https://www.aclweb.org/anthology/2020.wanlp-1.18.Google Scholar
- [20] . 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations. Association for Computational Linguistics.Google Scholar
Cross Ref
- [21] . 2020. Joey NMT: A Minimalist NMT Toolkit for Novices.
arxiv:1907.12484 [cs.CL].Google Scholar - [22] . 2018. A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation.
arxiv:1806.06957 [cs.CL].Google Scholar - [23] . 2020. A diverse data augmentation strategy for low-resource neural machine translation. Information11, 255 (2020),2078–2489.Google Scholar
- [24] . 2017. Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.567–573.Google Scholar
- [25] . 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library.
arxiv:1912.01703 [cs.LG].Google Scholar - [26] . 2020. An evaluation of subword segmentation strategies for neural machine translation of morphologically rich languages. In Proceedings of the the 4th Widening Natural Language Processing Workshop. Association for Computational Linguistics, 151–155.
DOI: Google ScholarCross Ref
- [27] . 2018. The annotated transformer. In Proceedings of the Workshop for NLP Open Source Software (NLP-OSS). Association for Computational Linguistics, 52–60.
DOI: Google ScholarCross Ref
- [28] . 2012. Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of COLING 2012: Demonstration Papers. The COLING 2012 Organizing Committee, 385–392. Retrieved from https://www.aclweb.org/anthology/C12-3048.Google Scholar
- [29] . 2018. Self-attention with relative position representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 464–468.
DOI: Google ScholarCross Ref
- [30] . 2020. Multi-dialect Arabic BERT for Country-level Dialect Identification.
arxiv:2007.05612 [cs.CL].Google Scholar - [31] . 2020. Neural machine translation for extremely low-resource African languages: A case study on Bambara. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages. Association for Computational Linguistics.Google Scholar
- [32] . 2017. Attention is all you need. CoRR abs/1706.03762 (2017).Google Scholar
- [33] . 2019. Beto, Bentz, Becas: The Surprising Cross-lingual Effectiveness of BERT.
arxiv:1904.09077 [cs.CL].Google Scholar - [34] . 2017. Morphological disambiguation of Tunisian dialect. J. King Saud Univ. Comput. Inf. Sci. 29 (2017), 147–155.Google Scholar
Digital Library
Index Terms
Hybrid Pipeline for Building Arabic Tunisian Dialect-standard Arabic Neural Machine Translation Model from Scratch
Recommendations
Paraphrasing Arabic Metaphor with Neural Machine Translation
AbstractThe task of recognizing and generating paraphrases is an essential component in many Arabic natural language processing (NLP) applications. A well-established machine translation approach for automatically extracting paraphrases, leverages ...
Neural Networks Pipeline for Offline Machine Printed Arabic OCR
In the context of Arabic optical characters recognition, Arabic poses more challenges because of its cursive nature. We purpose a system for recognizing a document containing Arabic text, using a pipeline of three neural networks. The first network ...
A machine translation system from Arabic sign language to Arabic
AbstractArabic sign language (ArSL) is one of the sign languages that is used in Arab countries. This language has structure and grammar that differ from spoken Arabic. Available ArSL recognition systems perform direct mapping between the recognized sign ...






Comments