skip to main content
research-article

Automatic Diacritics Restoration for Tunisian Dialect

Published:12 July 2019Publication History
Skip Abstract Section

Abstract

Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. First, a word without diacratics could have many possible meanings depending on their diacritization. Second, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12]. In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Third, without diacritics a word could have many possible parts of speech (POS) instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8]. Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this article, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on Conditional Random Fields (CRF). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (Word Error Rate (WER) 21.44% for CRF and WER 34.6% for SMT).

References

  1. G. Abandah, A. Graves, B. Al-Shagoor, A. Arabiyat, F. Jamour, and M. Al-Taee. 2015. Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recogn. 18, 2 (June 2015), 183--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Ahmed and M. Elaraby. 2000. A large-scale computational processor of the Arabic morphology and applications. PhD thesis, Faculty of Engineering, Cairo University Giza, Egypt.Google ScholarGoogle Scholar
  3. A. Al-Taani and S. Abu Al-Rub. 2009. A rule-based approach for tagging non-vocalized Arabic words. Int. Arab J. Info. Technol. 6, 3 (2009), 320--328.Google ScholarGoogle Scholar
  4. Y. A. Alotaibi, A. H. Meftah, and S. A. Selouani. 2013. Diacritization, automatic segmentation and labeling for levantine Arabic speech. In Proceedings of the Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE’13).Google ScholarGoogle Scholar
  5. M. Al-Badrashiny, A. Hawwari, and M. Diab. 2017. A layered language model-based hybrid approach to automatic full diacritization of Arabic. In Proceedings of the 3rd Arabic Natural Language Processing Workshop (WANLP’17).Google ScholarGoogle Scholar
  6. M. Ameur, Y. Moulahoum, and A. Guessoum. 2015. Restoration of Arabic diacritics using a multilevel statistical model. In Proceedings of the International Federation for Information Processing (IFIP’15).Google ScholarGoogle Scholar
  7. A. Z. Ayman, M. Elmahdy, H. Husni, and J. Al Jaam. 2016. Automatic diacritics restoration for Arabic text. International J. Comput. Info. Sci. 12, 2 (Dec. 2016), 159--165.Google ScholarGoogle Scholar
  8. A. Azmi and R. Almajed. 2015. A survey of automatic Arabic diacritization techniques. Nat. Lang. Eng. 21, 477--495.Google ScholarGoogle ScholarCross RefCross Ref
  9. T. Baccouche. 2003. L’arabe, d’une koin dialectale une langue de culture, Memoires de la societe linguistique de Paris, TomeXI (les langues de Communication), 87--93.Google ScholarGoogle Scholar
  10. T. Baccouche. 1994. L’emprunt en arabe moderne, Beit Elhikma et IBLV.Google ScholarGoogle Scholar
  11. Y. Belinkov and J. Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle Scholar
  12. H. Bouamor, W. Zaghouani, M. Diab, O. Obeid, O. Kemal, M. Ghoneim, and A. Hawwari. 2015. A pilot study on Arabic multi-genre corpus diacritization annotation. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.Google ScholarGoogle Scholar
  13. R. Boujelbane, M. Mallek, M. Ellouze, and L. Belguith. 2014. Fine-grained (POS) tagging of spoken Tunisian dialect corpora. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, (NLDB’14).Google ScholarGoogle Scholar
  14. P. Brown, S. Pietra, V. Pietra, and R. Mercer. 1993. The mathematic of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 2 (1993), 263--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Diab, M. Ghoneim, and N. Habash. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of MTSummit.Google ScholarGoogle Scholar
  16. M. Elshafei, H. Al-muhtaseb, and M. Alghamdi. 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of Saudi 18th National Computer Conference (NCC’06).Google ScholarGoogle Scholar
  17. A. Fashwan and S. Alansary. 2017. SHAKKIL: An automatic diacritization system for modern standard Arabic texts. Proceedings of the 3rd Arabic Natural Language Processing Workshop (WANLP’17).Google ScholarGoogle Scholar
  18. Y. Gal. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (SEMITIC’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. L. Gibson. 1998. Dialect Contact in Tunisian Arabic: Sociolinguistic and Structural Aspects. PhD thesis. Department of linguistic science, University of Reading.Google ScholarGoogle Scholar
  20. M. Graja, M. Jaoua, and L. Belguith. 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect. In IEEE/ACM Trans. Audio, Speech, Lang. Process. 23, 12 (2015), 2311--2321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Habash, A. Shahrour, and M. Al-Khalil. 2016. Exploiting Arabic diacritization for high-quality automatic annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).Google ScholarGoogle Scholar
  22. N. Habash, M. Diab, and O. Rambow. 2012. Conventional orthography for dialectal Arabic. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12).Google ScholarGoogle Scholar
  23. N. Habash and O. Rambow. 2007. Arabic diacritization through full morphological tagging. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. O. Hamed and T. Zesch. 2017. A survey and comparative study of Arabic diacritization tools. J. Lang. Technol. Computat. Linguist. 32, 1.Google ScholarGoogle Scholar
  25. S. Harrat, M. Abbas, K. Meftouh, K. Smaili, E. N. S. Bouzareah, and C. Loria. 2013. Diacritics restoration for Arabic dialect texts. In Proceedings of the 14th Annual Conference of the International Speech Communication.Google ScholarGoogle Scholar
  26. E. Hermena, D. Drieghe, S. Hellmuth, and P. Simon. 2015. Processing of Arabic diacritical marks: Phonological syntactic disambiguation of homographic verbs and visual crowding effects. J. Exper. Psychol. Hum. Percept. Perform. 41, 494--507.Google ScholarGoogle ScholarCross RefCross Ref
  27. Y. Hifny. 2012. Higher order n-gram language models for Arabic diacritics restoration. In Proceedings of the 12th Conference on Language Engineering (ESOLEC’12).Google ScholarGoogle Scholar
  28. C. Holes. 2004. Modern Arabic: Structures, Functions, and Varieties. Georgetown University Press, Washington.Google ScholarGoogle Scholar
  29. M. Jarrar, N. Habash, D. Akra, N. Zalmout, and W. Bank. 2014. Building a corpus for palestinian Arabic: A preliminary study. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 18--27.Google ScholarGoogle Scholar
  30. K. Kirchhoff and D. Vergyri. 2005. Cross- dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Commun. 46, 37--51.Google ScholarGoogle ScholarCross RefCross Ref
  31. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, and N. Bertoldi. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the Association for Computational Linguistics (ACL’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Kubra and G. Eryigit. 2014. Vowel and diacritic restoration for social media texts. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM’14).Google ScholarGoogle Scholar
  33. J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML’01). 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Maamouri, A. Bies, and S. Kulick. 2006. Diacritization: A challenge to Arabic treebank annotation and parsing. Proceedings of the British Computer Society Arabic NLP/MT Conference.Google ScholarGoogle Scholar
  35. M. Maamouri, A. Bies, and S. Kulick. 2008. Enhancing the Arabic treebank: A collaborative effort toward new annotation guidelines. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).Google ScholarGoogle Scholar
  36. A. Masmoudi, F. Bougares, M. Khmekhem, Y. Estéve, and L. Belguith. 2018. Automatic speech recognition system for Tunisian dialect. J. Lang. Resour. Eval. 52, 249--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, and L. Belguith. 2014. Phonetic tool for the Tunisian Arabic. In Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-resourced Languages.Google ScholarGoogle Scholar
  38. A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, L. Belguith, and N. Habash. 2014. A corpus and a phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 19th Language Resources and Evaluation Conference.Google ScholarGoogle Scholar
  39. A. Masmoudi, N. Habash, M. Khmekhem, Y. Estéve, and L. Belguith. 2015. Arabic transliteration of romanized Tunisian dialect text: A preliminary investigation. In Proceedings of the 16th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’15).Google ScholarGoogle Scholar
  40. S. Mejri, M. Said, and I. Sfar. 2009. Pluringuisme et diglossie en Tunisie. Synerg. Tunisie 1, 53--74.Google ScholarGoogle Scholar
  41. F. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29 1, 19--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Rashwan, A. Al Sallab, H. Raafat, and A. Rafea. 2015. Deep learning framework with confused sub set resolution architecture for automatic Arabic diacritization. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23, 3 (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. H. Saadane and N. Habash. 2015. A conventional orthography for algerian Arabic. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.Google ScholarGoogle Scholar
  44. I. Sfar. 2005. Morphologie des noms de professions : Incorporation et paraphrase, La terminologie, entre traduction et bilinguisme (2005), 15--16.Google ScholarGoogle Scholar
  45. K. Shaalan, M. Abo Bakr, and I. Ziedan. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the Workshop on Computational Approaches to Semitic Languages (EACL’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. K. Shaalan, H. Abo Bakr, and I. Ziedan. 2008. A statistical method for adding case ending diacritics for Arabic text. In Proceedings of the Language Engineering Conference.Google ScholarGoogle Scholar
  47. T. Schlippe. 2008. Statistical methods for automatic diacritization of Arabic texts. Carnegie Mellon University, Pittsburgh, PA.Google ScholarGoogle Scholar
  48. A. Stolcke. 2002. SRILM an Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’02).Google ScholarGoogle Scholar
  49. F. Talmoudi. 1980. A morphosyntactic study of Romance verbs in the Arabic dialects of Tunis, Sousa, and Sfax. Göteborg, Acta Universitatis Gothoburgensis.Google ScholarGoogle Scholar
  50. M. Tilmatine. 1999. Substrat Et Convergences: Le Berbére Et L’arabe Nord-Africain. In Estudios de Dialectologia Norteafricana Y Andalusi, M. Haak, R. Jong, K. De Versteegh (Eds.).Google ScholarGoogle Scholar
  51. D. Wang and S. King. 2011. Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process. Lett. 18, 2 (2011), 122--125.Google ScholarGoogle ScholarCross RefCross Ref
  52. W. Zaghouani, H. Bouamor, A. Hawwari, M. Diab, O. Obeid, M. Ghoneim, S. Alqahtani, and K. Oflazer. 2016. Guidelines and framework for a large-scale Arabic diacritized corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).Google ScholarGoogle Scholar
  53. W. Zaghouani, N. Habash, O. Obeid, B. Mohit, H. Bouamor, and K. Oflazer. 2016. Building an Arabic machine translation post-edited corpus: Guidelines and annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16).Google ScholarGoogle Scholar
  54. W. Zaghouani, N. Habash, H. Bouamor, A. Rozovskaya, B. Mohit, A. Heider, and K. Oflazer. 2015. Correction annotation for non-native Arabic texts: Guidelines and corpus. In Proceedings of the Association for Computational Linguistics Fourth Linguistic Annotation Workshop.Google ScholarGoogle Scholar
  55. I. Zitouni, J. Sorensen, and R. Sarikaya. 2006. Maximum entropy-based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. I. Zitouni and R. Sarikaya. 2009. Arabic diacritic restoration approach based on maximum entropy models. J. Comput. Speech Lang. 23 (2009), 257--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze, L. Belguith, and N. Habash. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).Google ScholarGoogle Scholar
  58. I. Zribi, M. Khmekhem, L. Belguith, and P. Blache. 2017. Morphological disambiguation of Tunisian dialect. J. King Saud Univ. Comput. Info. Sci. 29, 147--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. I. Zribi, M. Ellouze, L. H. Belguith, and P. Blache. 2015. Spoken Tunisian Arabic Corpus “STAC”: Transcription and Annotation. Res. Comput. Sci. 90 (2015), 123--135.Google ScholarGoogle ScholarCross RefCross Ref
  60. I. Zribi, M. Graja, M. E. Khemakhem, M. Jaoua, and L. Belguith. 2013. Orthographic transcription for spoken Tunisian Arabic. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’13), A. Gelbukh (Ed.). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic Diacritics Restoration for Tunisian Dialect

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!