Abstract
The evolution of information and communication technology has markedly influenced communication between correspondents. This evolution has facilitated the transmission of information and has engendered new forms of written communication (email, chat, SMS, comments, etc.). Most of these messages and comments are written in Latin script, also called Arabizi. Moreover, the language used in social media and SMS messaging is characterized by the use of informal and non-standard vocabulary, such as repeated letters for emphasis, typos, non-standard abbreviations, and nonlinguistic content like emoticons. Since the Tunisian dialect suffers from the unavailability of basic tools and linguistic resources compared to Modern Standard Arabic, we resort to the use of these written sources as a starting point to build large corpora automatically. In the context of natural language processing and to benefit from these networks’ data, transliterating from Arabizi to Arabic script is a necessary step because most recently available tools for processing the Tunisian dialect expect Arabic script input. Indeed, the transliteration task can help construct and enrich parallel corpora and dictionaries for the Tunisian dialect and can be useful for developing various natural language processing applications such as sentiment analysis, opinion mining, topic detection, and machine translation. In this article, we focus on converting the Tunisian dialect text that is written in Latin script to Arabic script following the Conventional Orthography for Dialectal Arabic. Then, we propose two models to transliterate Arabizi into Arabic script for the Tunisian dialect, namely a rule-based model and a discriminative model as a sequence classification task based on conditional random fields). In the first model, we use a set of transliteration rules to convert the Tunisian dialect Arabizi texts to Arabic script. In the second model, transliteration is performed both at word and character levels. In the end, our models got a character error rate of 10.47%.
- M. Al-Badrashiny, R. Eskander, N. Habash, and O. Rambow. 2014. Automatic transliteration of Romanized dialectal Arabic. In Proceedings of the 18th Conference on Computational Natural Language Learning. 30--38.Google Scholar
- Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages. 1--13.Google Scholar
- W. Ammar, Ch. Dyer, and A. Smith. 2012. Transliteration by sequence labeling with lattice encodings and reranking. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 8--14.Google Scholar
- T. Baccouche. 2003. L’arabe, d’une koin dialectale une langue de culture, Memoires de la société linguistique de Paris, TomeXI (les langues de Communication). 87--93.Google Scholar
- D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google Scholar
- A. Bies, Z. Song, M. Maamouri, S. Grimes, H. Lee, J. Wright, S. Sreassel, N. Habash, R. Eskander, and O. Rambow. 2014. Transliteration of Arabizi into Arabic orthography: Developing a parallel annotated Arabizi-Arabic script SMS/chat corpus. In Proceedings of Neural Information Processing Systems (NIPS’14). 93--103.Google Scholar
- T. Buckwalter. 2004. Issues in Arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages. 31--34.Google Scholar
Digital Library
- F. Casacuberta and E. Vidal. 2007. GIZA++: Training of Statistical Translation Models. Retrieved October 29, 2019 from http://fjoch.com/GIZA++.html.Google Scholar
- A. Chalabi and H. Gergers. 2012. Romanized Arabic transliteration. In Proceedings of the 2nd Workshop on Advances in Text Input Methods (WTIM 2). 89--96.Google Scholar
- K. Darwish. 2014. Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 217--224.Google Scholar
Cross Ref
- T. Deselaers, S. Hasan, O. Bender, and H. Ney. 2009. A deep learning approach to machine transliteration. In Proceedings of the 4th Workshop on Statistical Machine Translation. 233--241.Google Scholar
- A. El-Kahky, K. Darwish, A. Aldein, M. El-Wahab, A. Hefny, and W. Ammar. 2011. Improved transliteration mining using graph reinforcement. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1384--1393.Google Scholar
- A. El-Kahky, K. Darwish, A. Aldein, M. El-Wahab, and W. Ammar. 2012. Transliteration mining using large training and test sets. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 243--252.Google Scholar
- R. Eskander, M. Al-Badrashiny, N. Habash, and O. Rambow. 2014. Foreign words and the automatic processing of Arabic social media text written in Roman script. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 1--12.Google Scholar
- G. Surya, S. Harsha, P. Pingali, and V. Varma. 2008. Statistical transliteration for cross language information retrieval using HMM alignment and CRF. In Proceedings of the 2nd Workshop on Cross Lingual Information Access. 42--47.Google Scholar
- I. Guellil, F. Azouaou, M. Abbas, and F. Sadat. 2017. Arabizi transliteration of Algerian Arabic dialect into Modern Standard Arabic. In Proceedings of the 1st Workshop on Social Media and User Generated Content Machine Translation.Google Scholar
- M. L. Gibson. 1998. Dialect Contact in Tunisian Arabic: Sociolinguistic and Structural Aspects. University of Reading.Google Scholar
- M. Graja, M. Jaoua, and L. Belguith. 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian Dialect. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 12, 2311--2321 DOI:10.1109/TASLP.2015.2464687Google Scholar
Digital Library
- M. Graja, M. Jaoua, and L. Belguith. 2013. Discriminative framework for spoken Tunisian dialect understanding. In Proceedings of the 1st International Conference on Statistical Language and Speech Processing (SLSP’13). 29--31.Google Scholar
- N. Habash, A. Shahrour, and M. Al-Khalil. 2016. Exploiting Arabic diacritization for high quality automatic annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).Google Scholar
- N. Habash, M. Diab, and O. Rambow. 2012. Conventional orthography for dialectal Arabic. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12).Google Scholar
- N. Habash, A. Soudi, and T. Buckwalter. 2007. On Arabic transliteration. In Arabic Computational Morphology: Knowledge-Based and Empirical Methods, A. Soudi, A. van den Bosch, and G. Neumann (Eds.). Springer, 3--14.Google Scholar
- M. Hadj Ameur, F. Meziane, and A. Guessouma. 2017. Arabic machine transliteration using an attention-based encoder-decoder model. In Proceedings of the 3rd International Conference on Arabic Computational Linguistics. 5--6.Google Scholar
- O. Hamed and T. Zesch. 2017. A survey and comparative study of Arabic diacritization tools. Journal of Language Technology and Computational Linguistics 32, 1 (2017), 1--21.Google Scholar
- C. Holes. 2004. Modern Arabic: Structures, Functions, and Varieties. Georgetown Classics. Georgetown University Press, Washington, DC.Google Scholar
- I. Illina, D. Fohr, and D. Jouvet. 2011. Grapheme-to-phoneme conversion using conditional random fields. In Proceedings of the 12th Annual Conference of the International Speech Communication Association(INTERSPEECH’11).Google Scholar
- B. Kang and K. Cho. 2000. Automatic transliteration and back-transliteration by decision tree learning. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC’00). 227--233.Google Scholar
- N. Karmani, H. Soussou, and A. Alimi. 2019. Tunisian Arabic chat alphabet transliteration using probabilistic finite state transducers. International Arab Journal of Information Technology 16, 2.Google Scholar
- J. Lafferty, A. McCallum, and F. Peireira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. 282--289.Google Scholar
Digital Library
- J. Maleki and L. Ahrenberg. 2008. Converting Romanized Persian to the Arabic writing systems. In Proceedings of the 6th International Language Resources and Evaluation (LREC’08). 2904--2908.Google Scholar
- A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, and L. Belguith. 2014. Phonetic tool for the Tunisian Arabic. In Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages.Google Scholar
- A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, L. Belguith, and N. Habash. 2014. A corpus and a phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 19th Edition of the Language Resources and Evaluation Conference.Google Scholar
- A. Masmoudi, N. Habash, M. Khmekhem, Y. Estéve, and L. Belguith. 2015. Arabic transliteration of Romanized Tunisian dialect text: A preliminary investigation. In Proceedings of the 16th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’15).Google Scholar
- A. Masmoudi, M. Khmekhem, F. Bougares, Y. Estéve, and L. Belguith. 2016. Conditional random fields for the Tunisian dialect grapheme-to-phoneme conversion. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH’16). 8--12.Google Scholar
- A. Masmoudi, F. Bougares, M. Khmekhem, Y. Estéve, and L. Belguith. 2018. Automatic speech recognition system for Tunisian dialect. Language Resources and Evaluation 52 (2018), 249--267.Google Scholar
Digital Library
- A. Masmoudi, S. Medhaffar, R. Sellami, and L. Belguith. 2018. Automatic diacritics restoration for Tunisian dialect. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3, Article 28.Google Scholar
- S. Mejri, M. Said, and I. Sfar. 2009. Pluringuisme et diglossie en Tunisie. Synergy Tunisie 1, 53--74.Google Scholar
- H. Rathod, Manikrao L. Dhore, and R. M. Dhore. 2013. Hindi and Marathi to English machine transliteration using SVM. International Journal on Natural Language Computing 2, 4, 55--71.Google Scholar
Cross Ref
- M. Rosca and T. Breuel. 2016. Sequence-to-sequence neural network models for transliteration. arXiv:1610.09565.Google Scholar
- H. Saadane and N. Semmar. 2012. Utilisation de la translittération Arabe pour l’amélioration de l’lignement de mots a partir de corpus paralléles Francais-Arabe. In Proceedings of the Joint JEP-TALN-RECITAL Conference. 127--140.Google Scholar
- H. Saadane and N. Habash. 2015. A conventional orthography for Algerian Arabic. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.Google Scholar
- I. Sfar. 2005. Morphologie des noms de professions: Incorporation et paraphrase, La terminologie, entre traduction et bilinguisme, 15--16.Google Scholar
- K. Shaalan, M. Abo Bakr, and I. Ziedan. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages.Google Scholar
- E. Souissi and F. Debili. 2001. Transliteration of Arab proper names. In Proceedings of the 9th International Conference on Human-Computer Interaction (HCI’01).Google Scholar
- A. Stolcke. 2002. SRILM: An extensible language modeling toolkit. In Proceedings of the International Conference on Speech and Language Processing.Google Scholar
- D. Wang and S. King. 2011. Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Processing Letters 18, 2 (2011), 122--125.Google Scholar
Cross Ref
- J. Younes, H. Achour, and E. Souissi. 2015. Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In Proceedings of the 15th International Conference on Current Trends in Web Engineering (ICWE’15 Workshops). 3--14.Google Scholar
- J. Younes, E. Souissi, and H. Achour. 2016. A hidden Markov model for the automatic transliteration of Romanized Tunisian dialect. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16).Google Scholar
- J. Younes, E. Souissi, H. Achour, and A. Ferchechi. 2018. A sequence-to-sequence based approach for the double transliteration of Tunisian dialect. In Proceedings of the 4th International Conference on Arabic Computational Linguistics (ACLing’18).Google Scholar
- I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze, L. Belguith, and N. Habash. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).Google Scholar
- I. Zribi, M. Khmekhem, L. Belguith, and P. Blache. 2017. Morphological disambiguation of Tunisian dialect. Journal of King Saud University, Computer and Information Sciences 29 (2017), 147--155.Google Scholar
Digital Library
Index Terms
Transliteration of Arabizi into Arabic Script for Tunisian Dialect
Recommendations
Automatic Diacritics Restoration for Tunisian Dialect
Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics ...
Romanized Tunisian dialect transliteration using sequence labelling techniques
AbstractIn recent years, social web users in Arabic countries have been resorting to the dialects as a written language in their social exchanges. Arabic dialects derive from modern standard Arabic (MSA) and differ significantly from one ...
A Sequence-to-Sequence based Approach For the double Transliteration of Tunisian Dialect
AbstractTransliteration consists of automatically transforming a grapheme’s transcription from one writing system to another, while preserving its pronunciation. It is usually used in the context of machine translation and cross language information ...






Comments