Abstract
Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. First, a word without diacratics could have many possible meanings depending on their diacritization. Second, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12]. In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Third, without diacritics a word could have many possible parts of speech (POS) instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8]. Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this article, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on Conditional Random Fields (CRF). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (Word Error Rate (WER) 21.44% for CRF and WER 34.6% for SMT).
- G. Abandah, A. Graves, B. Al-Shagoor, A. Arabiyat, F. Jamour, and M. Al-Taee. 2015. Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recogn. 18, 2 (June 2015), 183--197. Google Scholar
Digital Library
- A. Ahmed and M. Elaraby. 2000. A large-scale computational processor of the Arabic morphology and applications. PhD thesis, Faculty of Engineering, Cairo University Giza, Egypt.Google Scholar
- A. Al-Taani and S. Abu Al-Rub. 2009. A rule-based approach for tagging non-vocalized Arabic words. Int. Arab J. Info. Technol. 6, 3 (2009), 320--328.Google Scholar
- Y. A. Alotaibi, A. H. Meftah, and S. A. Selouani. 2013. Diacritization, automatic segmentation and labeling for levantine Arabic speech. In Proceedings of the Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE’13).Google Scholar
- M. Al-Badrashiny, A. Hawwari, and M. Diab. 2017. A layered language model-based hybrid approach to automatic full diacritization of Arabic. In Proceedings of the 3rd Arabic Natural Language Processing Workshop (WANLP’17).Google Scholar
- M. Ameur, Y. Moulahoum, and A. Guessoum. 2015. Restoration of Arabic diacritics using a multilevel statistical model. In Proceedings of the International Federation for Information Processing (IFIP’15).Google Scholar
- A. Z. Ayman, M. Elmahdy, H. Husni, and J. Al Jaam. 2016. Automatic diacritics restoration for Arabic text. International J. Comput. Info. Sci. 12, 2 (Dec. 2016), 159--165.Google Scholar
- A. Azmi and R. Almajed. 2015. A survey of automatic Arabic diacritization techniques. Nat. Lang. Eng. 21, 477--495.Google Scholar
Cross Ref
- T. Baccouche. 2003. L’arabe, d’une koin dialectale une langue de culture, Memoires de la societe linguistique de Paris, TomeXI (les langues de Communication), 87--93.Google Scholar
- T. Baccouche. 1994. L’emprunt en arabe moderne, Beit Elhikma et IBLV.Google Scholar
- Y. Belinkov and J. Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google Scholar
- H. Bouamor, W. Zaghouani, M. Diab, O. Obeid, O. Kemal, M. Ghoneim, and A. Hawwari. 2015. A pilot study on Arabic multi-genre corpus diacritization annotation. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.Google Scholar
- R. Boujelbane, M. Mallek, M. Ellouze, and L. Belguith. 2014. Fine-grained (POS) tagging of spoken Tunisian dialect corpora. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, (NLDB’14).Google Scholar
- P. Brown, S. Pietra, V. Pietra, and R. Mercer. 1993. The mathematic of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 2 (1993), 263--311. Google Scholar
Digital Library
- M. Diab, M. Ghoneim, and N. Habash. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of MTSummit.Google Scholar
- M. Elshafei, H. Al-muhtaseb, and M. Alghamdi. 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of Saudi 18th National Computer Conference (NCC’06).Google Scholar
- A. Fashwan and S. Alansary. 2017. SHAKKIL: An automatic diacritization system for modern standard Arabic texts. Proceedings of the 3rd Arabic Natural Language Processing Workshop (WANLP’17).Google Scholar
- Y. Gal. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (SEMITIC’02). Google Scholar
Digital Library
- M. L. Gibson. 1998. Dialect Contact in Tunisian Arabic: Sociolinguistic and Structural Aspects. PhD thesis. Department of linguistic science, University of Reading.Google Scholar
- M. Graja, M. Jaoua, and L. Belguith. 2015. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect. In IEEE/ACM Trans. Audio, Speech, Lang. Process. 23, 12 (2015), 2311--2321. Google Scholar
Digital Library
- N. Habash, A. Shahrour, and M. Al-Khalil. 2016. Exploiting Arabic diacritization for high-quality automatic annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).Google Scholar
- N. Habash, M. Diab, and O. Rambow. 2012. Conventional orthography for dialectal Arabic. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12).Google Scholar
- N. Habash and O. Rambow. 2007. Arabic diacritization through full morphological tagging. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. Google Scholar
Digital Library
- O. Hamed and T. Zesch. 2017. A survey and comparative study of Arabic diacritization tools. J. Lang. Technol. Computat. Linguist. 32, 1.Google Scholar
- S. Harrat, M. Abbas, K. Meftouh, K. Smaili, E. N. S. Bouzareah, and C. Loria. 2013. Diacritics restoration for Arabic dialect texts. In Proceedings of the 14th Annual Conference of the International Speech Communication.Google Scholar
- E. Hermena, D. Drieghe, S. Hellmuth, and P. Simon. 2015. Processing of Arabic diacritical marks: Phonological syntactic disambiguation of homographic verbs and visual crowding effects. J. Exper. Psychol. Hum. Percept. Perform. 41, 494--507.Google Scholar
Cross Ref
- Y. Hifny. 2012. Higher order n-gram language models for Arabic diacritics restoration. In Proceedings of the 12th Conference on Language Engineering (ESOLEC’12).Google Scholar
- C. Holes. 2004. Modern Arabic: Structures, Functions, and Varieties. Georgetown University Press, Washington.Google Scholar
- M. Jarrar, N. Habash, D. Akra, N. Zalmout, and W. Bank. 2014. Building a corpus for palestinian Arabic: A preliminary study. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 18--27.Google Scholar
- K. Kirchhoff and D. Vergyri. 2005. Cross- dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Commun. 46, 37--51.Google Scholar
Cross Ref
- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, and N. Bertoldi. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the Association for Computational Linguistics (ACL’07). Google Scholar
Digital Library
- A. Kubra and G. Eryigit. 2014. Vowel and diacritic restoration for social media texts. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM’14).Google Scholar
- J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML’01). 282--289. Google Scholar
Digital Library
- M. Maamouri, A. Bies, and S. Kulick. 2006. Diacritization: A challenge to Arabic treebank annotation and parsing. Proceedings of the British Computer Society Arabic NLP/MT Conference.Google Scholar
- M. Maamouri, A. Bies, and S. Kulick. 2008. Enhancing the Arabic treebank: A collaborative effort toward new annotation guidelines. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08).Google Scholar
- A. Masmoudi, F. Bougares, M. Khmekhem, Y. Estéve, and L. Belguith. 2018. Automatic speech recognition system for Tunisian dialect. J. Lang. Resour. Eval. 52, 249--267. Google Scholar
Digital Library
- A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, and L. Belguith. 2014. Phonetic tool for the Tunisian Arabic. In Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-resourced Languages.Google Scholar
- A. Masmoudi, M. Khmekhem, Y. Estéve, F. Bougares, L. Belguith, and N. Habash. 2014. A corpus and a phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 19th Language Resources and Evaluation Conference.Google Scholar
- A. Masmoudi, N. Habash, M. Khmekhem, Y. Estéve, and L. Belguith. 2015. Arabic transliteration of romanized Tunisian dialect text: A preliminary investigation. In Proceedings of the 16th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’15).Google Scholar
- S. Mejri, M. Said, and I. Sfar. 2009. Pluringuisme et diglossie en Tunisie. Synerg. Tunisie 1, 53--74.Google Scholar
- F. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29 1, 19--52. Google Scholar
Digital Library
- M. Rashwan, A. Al Sallab, H. Raafat, and A. Rafea. 2015. Deep learning framework with confused sub set resolution architecture for automatic Arabic diacritization. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23, 3 (2015). Google Scholar
Digital Library
- H. Saadane and N. Habash. 2015. A conventional orthography for algerian Arabic. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing.Google Scholar
- I. Sfar. 2005. Morphologie des noms de professions : Incorporation et paraphrase, La terminologie, entre traduction et bilinguisme (2005), 15--16.Google Scholar
- K. Shaalan, M. Abo Bakr, and I. Ziedan. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the Workshop on Computational Approaches to Semitic Languages (EACL’09). Google Scholar
Digital Library
- K. Shaalan, H. Abo Bakr, and I. Ziedan. 2008. A statistical method for adding case ending diacritics for Arabic text. In Proceedings of the Language Engineering Conference.Google Scholar
- T. Schlippe. 2008. Statistical methods for automatic diacritization of Arabic texts. Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- A. Stolcke. 2002. SRILM an Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’02).Google Scholar
- F. Talmoudi. 1980. A morphosyntactic study of Romance verbs in the Arabic dialects of Tunis, Sousa, and Sfax. Göteborg, Acta Universitatis Gothoburgensis.Google Scholar
- M. Tilmatine. 1999. Substrat Et Convergences: Le Berbére Et L’arabe Nord-Africain. In Estudios de Dialectologia Norteafricana Y Andalusi, M. Haak, R. Jong, K. De Versteegh (Eds.).Google Scholar
- D. Wang and S. King. 2011. Letter-to-sound pronunciation prediction using conditional random fields. IEEE Signal Process. Lett. 18, 2 (2011), 122--125.Google Scholar
Cross Ref
- W. Zaghouani, H. Bouamor, A. Hawwari, M. Diab, O. Obeid, M. Ghoneim, S. Alqahtani, and K. Oflazer. 2016. Guidelines and framework for a large-scale Arabic diacritized corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).Google Scholar
- W. Zaghouani, N. Habash, O. Obeid, B. Mohit, H. Bouamor, and K. Oflazer. 2016. Building an Arabic machine translation post-edited corpus: Guidelines and annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16).Google Scholar
- W. Zaghouani, N. Habash, H. Bouamor, A. Rozovskaya, B. Mohit, A. Heider, and K. Oflazer. 2015. Correction annotation for non-native Arabic texts: Guidelines and corpus. In Proceedings of the Association for Computational Linguistics Fourth Linguistic Annotation Workshop.Google Scholar
- I. Zitouni, J. Sorensen, and R. Sarikaya. 2006. Maximum entropy-based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Google Scholar
Digital Library
- I. Zitouni and R. Sarikaya. 2009. Arabic diacritic restoration approach based on maximum entropy models. J. Comput. Speech Lang. 23 (2009), 257--276. Google Scholar
Digital Library
- I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze, L. Belguith, and N. Habash. 2014. A conventional orthography for Tunisian Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).Google Scholar
- I. Zribi, M. Khmekhem, L. Belguith, and P. Blache. 2017. Morphological disambiguation of Tunisian dialect. J. King Saud Univ. Comput. Info. Sci. 29, 147--155. Google Scholar
Digital Library
- I. Zribi, M. Ellouze, L. H. Belguith, and P. Blache. 2015. Spoken Tunisian Arabic Corpus “STAC”: Transcription and Annotation. Res. Comput. Sci. 90 (2015), 123--135.Google Scholar
Cross Ref
- I. Zribi, M. Graja, M. E. Khemakhem, M. Jaoua, and L. Belguith. 2013. Orthographic transcription for spoken Tunisian Arabic. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’13), A. Gelbukh (Ed.). Google Scholar
Digital Library
Index Terms
Automatic Diacritics Restoration for Tunisian Dialect
Recommendations
Transliteration of Arabizi into Arabic Script for Tunisian Dialect
The evolution of information and communication technology has markedly influenced communication between correspondents. This evolution has facilitated the transmission of information and has engendered new forms of written communication (email, chat, ...
Morphological disambiguation of Tunisian dialect
In this paper, we propose a method to disambiguate the output of a morphological analyzer of the Tunisian dialect. We test three machine-learning techniques that classify the morphological analysis of each word token into two classes: true and false. ...






Comments