skip to main content
research-article

Transliteration of Arabizi into Arabic Script for Tunisian Dialect

Published:28 November 2019Publication History
Skip Abstract Section

Abstract

The evolution of information and communication technology has markedly influenced communication between correspondents. This evolution has facilitated the transmission of information and has engendered new forms of written communication (email, chat, SMS, comments, etc.). Most of these messages and comments are written in Latin script, also called Arabizi. Moreover, the language used in social media and SMS messaging is characterized by the use of informal and non-standard vocabulary, such as repeated letters for emphasis, typos, non-standard abbreviations, and nonlinguistic content like emoticons. Since the Tunisian dialect suffers from the unavailability of basic tools and linguistic resources compared to Modern Standard Arabic, we resort to the use of these written sources as a starting point to build large corpora automatically. In the context of natural language processing and to benefit from these networks’ data, transliterating from Arabizi to Arabic script is a necessary step because most recently available tools for processing the Tunisian dialect expect Arabic script input. Indeed, the transliteration task can help construct and enrich parallel corpora and dictionaries for the Tunisian dialect and can be useful for developing various natural language processing applications such as sentiment analysis, opinion mining, topic detection, and machine translation. In this article, we focus on converting the Tunisian dialect text that is written in Latin script to Arabic script following the Conventional Orthography for Dialectal Arabic. Then, we propose two models to transliterate Arabizi into Arabic script for the Tunisian dialect, namely a rule-based model and a discriminative model as a sequence classification task based on conditional random fields). In the first model, we use a set of transliteration rules to convert the Tunisian dialect Arabizi texts to Arabic script. In the second model, transliteration is performed both at word and character levels. In the end, our models got a character error rate of 10.47%.

References

  1. M. Al-Badrashiny, R. Eskander, N. Habash, and O. Rambow. 2014. Automatic transliteration of Romanized dialectal Arabic. In Proceedings of the 18th Conference on Computational Natural Language Learning. 30--38.Google ScholarGoogle Scholar
  2. Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages. 1--13.Google ScholarGoogle Scholar
  3. W. Ammar, Ch. Dyer, and A. Smith. 2012. Transliteration by sequence labeling with lattice encodings and reranking. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 8--14.Google ScholarGoogle Scholar
  4. T. Baccouche. 2003. L’arabe, d’une koin dialectale une langue de culture, Memoires de la société linguistique de Paris, TomeXI (les langues de Communication). 87--93.Google ScholarGoogle Scholar
  5. D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google ScholarGoogle Scholar
  6. A. Bies, Z. Song, M. Maamouri, S. Grimes, H. Lee, J. Wright, S. Sreassel, N. Habash, R. Eskander, and O. Rambow. 2014. Transliteration of Arabizi into Arabic orthography: Developing a parallel annotated Arabizi-Arabic script SMS/chat corpus. In Proceedings of Neural Information Processing Systems (NIPS’14). 93--103.Google ScholarGoogle Scholar
  7. T. Buckwalter. 2004. Issues in Arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages. 31--34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Casacuberta and E. Vidal. 2007. GIZA++: Training of Statistical Translation Models. Retrieved October 29, 2019 from http://fjoch.com/GIZA++.html.Google ScholarGoogle Scholar
  9. A. Chalabi and H. Gergers. 2012. Romanized Arabic transliteration. In Proceedings of the 2nd Workshop on Advances in Text Input Methods (WTIM 2). 89--96.Google ScholarGoogle Scholar
  10. K. Darwish. 2014. Arabizi detection and conversion to Arabic. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 217--224.Google ScholarGoogle ScholarCross RefCross Ref
  11. T. Deselaers, S. Hasan, O. Bender, and H. Ney. 2009. A deep learning approach to machine transliteration. In Proceedings of the 4th Workshop on Statistical Machine Translation. 233--241.Google ScholarGoogle Scholar
  12. A. El-Kahky, K. Darwish, A. Aldein, M. El-Wahab, A. Hefny, and W. Ammar. 2011. Improved transliteration mining using graph reinforcement. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1384--1393.Google ScholarGoogle Scholar
  13. A. El-Kahky, K. Darwish, A. Aldein, M. El-Wahab, and W. Ammar. 2012. Transliteration mining using large training and test sets. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 243--252.Google ScholarGoogle Scholar
  14. R. Eskander, M. Al-Badrashiny, N. Habash, and O. Rambow. 2014. Foreign words and the automatic processing of Arabic social media text written in Roman script. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 1--12.Google ScholarGoogle Scholar
  15. G. Surya, S. Harsha, P. Pingali, and V. Varma. 2008. Statistical transliteration for cross language information retrieval using HMM alignment and CRF. In Proceedings of the 2nd Workshop on Cross Lingual Information Access. 42--47.Google ScholarGoogle Scholar
  16. I. Guellil, F. Azouaou, M. Abbas, and F. Sadat. 2017. Arabizi transliteration of Algerian Arabic dialect into Modern Standard Arabic. In Proceedings of the 1st Workshop on Social Media and User Generated Content Machine Translation.Google ScholarGoogle Scholar
  17. M. L. Gibson. 1998. Dialect Contact in Tunisian Arabic: Sociolinguistic and Structural Aspects. University of Reading.Google ScholarGoogle Scholar
  18. M. Graja, M. Jaoua, and L. Belguith. 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian Dialect. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 12, 2311--2321 DOI:10.1109/TASLP.2015.2464687Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Graja, M. Jaoua, and L. Belguith. 2013. Discriminative framework for spoken Tunisian dialect understanding. In Proceedings of the 1st International Conference on Statistical Language and Speech Processing (SLSP’13). 29--31.Google ScholarGoogle Scholar
  20. N. Habash, A. Shahrour, and M. Al-Khalil. 2016. Exploiting Arabic diacritization for high quality automatic annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).Google ScholarGoogle Scholar
  21. N. Habash, M. Diab, and O. Rambow. 2012. Conventional orthography for dialectal Arabic. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12).Google ScholarGoogle Scholar
  22. N. Habash, A. Soudi, and T. Buckwalter. 2007. On Arabic transliteration. In Arabic Computational Morphology: Knowledge-Based and Empirical Methods, A. Soudi, A. van den Bosch, and G. Neumann (Eds.). Springer, 3--14.Google ScholarGoogle Scholar
  23. M. Hadj Ameur, F. Meziane, and A. Guessouma. 2017. Arabic machine transliteration using an attention-based encoder-decoder model. In Proceedings of the 3rd International Conference on Arabic Computational Linguistics. 5--6.Google ScholarGoogle Scholar
  24. O. Hamed and T. Zesch. 2017. A survey and comparative study of Arabic diacritization tools. Journal of Language Technology and Computational Linguistics 32, 1 (2017), 1--21.Google ScholarGoogle Scholar
  25. C. Holes. 2004. Modern Arabic: Structures, Functions, and Varieties. Georgetown Classics. Georgetown University Press, Washington, DC.Google ScholarGoogle Scholar
  26. I. Illina, D. Fohr, and D. Jouvet. 2011. Grapheme-to-phoneme conversion using conditional random fields. In Proceedings of the 12th Annual Conference of the International Speech Communication Association(INTERSPEECH’11).Google ScholarGoogle Scholar
  27. B. Kang and K. Cho. 2000. Automatic transliteration and back-transliteration by decision tree learning. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC’00). 227--233.Google ScholarGoogle Scholar
  28. N. Karmani, H. Soussou, and A. Alimi. 2019. Tunisian Arabic chat alphabet transliteration using probabilistic finite state transducers. International Arab Journal of Information Technology 16, 2.Google ScholarGoogle Scholar
  29. J. Lafferty, A. McCallum, and F. Peireira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. 282--289.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Maleki and L. Ahrenberg. 2008. Converting Romanized Persian to the Arabic writing systems. In Proceedings of the 6th International Language Resources and Evaluation (LREC’08). 2904--2908.Google ScholarGoogle Scholar
  31. A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, and L. Belguith. 2014. Phonetic tool for the Tunisian Arabic. In Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages.Google ScholarGoogle Scholar
  32. A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, L. Belguith, and N. Habash. 2014. A corpus and a phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 19th Edition of the Language Resources and Evaluation Conference.Google ScholarGoogle Scholar
  33. A. Masmoudi, N. Habash, M. Khmekhem, Y. Estéve, and L. Belguith. 2015. Arabic transliteration of Romanized Tunisian dialect text: A preliminary investigation. In Proceedings of the 16th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’15).Google ScholarGoogle Scholar
  34. A. Masmoudi, M. Khmekhem, F. Bougares, Y. Estéve, and L. Belguith. 2016. Conditional random fields for the Tunisian dialect grapheme-to-phoneme conversion. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH’16). 8--12.Google ScholarGoogle Scholar
  35. A. Masmoudi, F. Bougares, M. Khmekhem, Y. Estéve, and L. Belguith. 2018. Automatic speech recognition system for Tunisian dialect. Language Resources and Evaluation 52 (2018), 249--267.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Masmoudi, S. Medhaffar, R. Sellami, and L. Belguith. 2018. Automatic diacritics restoration for Tunisian dialect. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3, Article 28.Google ScholarGoogle Scholar
  37. S. Mejri, M. Said, and I. Sfar. 2009. Pluringuisme et diglossie en Tunisie. Synergy Tunisie 1, 53--74.Google ScholarGoogle Scholar
  38. H. Rathod, Manikrao L. Dhore, and R. M. Dhore. 2013. Hindi and Marathi to English machine transliteration using SVM. International Journal on Natural Language Computing 2, 4, 55--71.Google ScholarGoogle ScholarCross RefCross Ref
  39. M. Rosca and T. Breuel. 2016. Sequence-to-sequence neural network models for transliteration. arXiv:1610.09565.Google ScholarGoogle Scholar
  40. H. Saadane and N. Semmar. 2012. Utilisation de la translittération Arabe pour l’amélioration de l’lignement de mots a partir de corpus paralléles Francais-Arabe. In Proceedings of the Joint JEP-TALN-RECITAL Conference. 127--140.Google ScholarGoogle Scholar
  41. H. Saadane and N. Habash. 2015. A conventional orthography for Algerian Arabic. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.Google ScholarGoogle Scholar
  42. I. Sfar. 2005. Morphologie des noms de professions: Incorporation et paraphrase, La terminologie, entre traduction et bilinguisme, 15--16.Google ScholarGoogle Scholar
  43. K. Shaalan, M. Abo Bakr, and I. Ziedan. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages.Google ScholarGoogle Scholar
  44. E. Souissi and F. Debili. 2001. Transliteration of Arab proper names. In Proceedings of the 9th International Conference on Human-Computer Interaction (HCI’01).Google ScholarGoogle Scholar
  45. A. Stolcke. 2002. SRILM: An extensible language modeling toolkit. In Proceedings of the International Conference on Speech and Language Processing.Google ScholarGoogle Scholar
  46. D. Wang and S. King. 2011. Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Processing Letters 18, 2 (2011), 122--125.Google ScholarGoogle ScholarCross RefCross Ref
  47. J. Younes, H. Achour, and E. Souissi. 2015. Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web. In Proceedings of the 15th International Conference on Current Trends in Web Engineering (ICWE’15 Workshops). 3--14.Google ScholarGoogle Scholar
  48. J. Younes, E. Souissi, and H. Achour. 2016. A hidden Markov model for the automatic transliteration of Romanized Tunisian dialect. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16).Google ScholarGoogle Scholar
  49. J. Younes, E. Souissi, H. Achour, and A. Ferchechi. 2018. A sequence-to-sequence based approach for the double transliteration of Tunisian dialect. In Proceedings of the 4th International Conference on Arabic Computational Linguistics (ACLing’18).Google ScholarGoogle Scholar
  50. I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze, L. Belguith, and N. Habash. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).Google ScholarGoogle Scholar
  51. I. Zribi, M. Khmekhem, L. Belguith, and P. Blache. 2017. Morphological disambiguation of Tunisian dialect. Journal of King Saud University, Computer and Information Sciences 29 (2017), 147--155.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Transliteration of Arabizi into Arabic Script for Tunisian Dialect

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!