Abstract
Words in Arabic consist of letters and short vowel symbols called diacritics inscribed atop regular letters. Changing diacritics may change the syntax and semantics of a word; turning it into another. This results in difficulties when comparing words based solely on string matching. Typically, Arabic NLP applications resort to morphological analysis to battle ambiguity originating from this and other challenges. In this article, we introduce three alternative algorithms to compare two words with possibly different diacritics. We propose the Subsume knowledge-based algorithm, the Imply rule-based algorithm, and the Alike machine-learning-based algorithm. We evaluated the soundness, completeness, and accuracy of the algorithms against a large dataset of 86,886 word pairs. Our evaluation shows that the accuracy of Subsume (100%), Imply (99.32%), and Alike (99.53%). Although accurate, Subsume was able to judge only 75% of the data. Both Subsume and Imply are sound, while Alike is not. We demonstrate the utility of the algorithms using a real-life use case -- in lemma disambiguation and in linking hundreds of Arabic dictionaries.
- Diab Abuaiadah, Dileep Rajendran, and Mustafa Jarrar. 2017. Clustering Arabic tweets for sentiment analysis. IEEE/ACS 14th International Conference on Computer Systems and Applications.Google Scholar
Cross Ref
- Shihadeh Alqrainy, Hasan AlSerhan, and Aladdin Ayesh. 2008. Pattern-based algorithm for part-of-speech tagging Arabic text. Proceedings of ICCES, 119--124.Google Scholar
Cross Ref
- Mohammed Attia. 2008. Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation. Ph.D. Dissertation. University of Manchester.Google Scholar
- A. M. Azmi and R. S. Almajed. 2015. A survey of automatic Arabic diacritization techniques. Natural Language Engineering, 21, 3 (2015) 477--495.Google Scholar
Cross Ref
- Alia Bahanshal and Hend Al-Khalifa. 2012. A first approach to the evaluation of Arabic diacritization systems. Proceedings of ICDIM, 155--158.Google Scholar
Cross Ref
- Kenneth Beesley. 2001. Finite-state morphological analysis and generation of Arabic at Xerox research: Status and plans. In ACL Workshop on Arabic Language Processing: Status and Perspective 1, 1--8.Google Scholar
- Y. Belinkov and J. Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.Google Scholar
- Makram Boujelben, Chafik Aloulou, and Lamia Hadrich Belguith. 2008. Toward a robust detection/correction system for the agreement errors in non-voweled arabic texts. In Proceedings of ACIT 2008.Google Scholar
- Tim Buckwalter. 2002. Buckwalter {Arabic} morphological analyzer version 1.0. LDC catalog number LDC2002L49, Technical Report.Google Scholar
- Kareem Darwish, Hamdy Mubarak, and A. Abdelali. 2017. Arabic diacritization: Stats, rules, and hacks. In Proceedings of the 3rd Arabic Natural Language Processing Workshop, 9--17.Google Scholar
- Fathi Debili, Hadhémi Achour, and E. Souissi. 2002. De l'étiquetage grammatical à la voyellation automatique de l'arabe. Technical Report.Google Scholar
- Nizar Habash. 2007. Arabic morphological representations for machine translation. book chapter. In Arabic Computational Morphology. Springer, 263—285.Google Scholar
- Nizar Habash, Owen Rambow, and Ryan Roth. 2007. MADA+ TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of MEDAR’09.Google Scholar
- Abdullah Hattab and Abdulameer Hussain. 2012. Hybrid statistical and morpho-syntactical Arabic language diacritizing system. International Journal of Academic Research, 4, 4.Google Scholar
Cross Ref
- Mustafa Jarrar, Nizar Habash, Faeq Alrimawi, Diyam Akra, and Nasser Zalmout. 2016. Curras: An annotated corpus for the palestinian arabic dialect. Journal Language Resources and Evaluation. 51, 3 (2016) 745--775. Google Scholar
Digital Library
- Mustafa Jarrar, Nizar Habash, Diyam Akra, and Nasser Zalmout. 2014. Building a corpus for Palestinian arabic: A preliminary study. The EMNLP Workshop on Arabic Natural Language Processing. ACL.Google Scholar
Cross Ref
- Mustafa Jarrar. 2011. Building a formal Arabic ontology. In Proceedings of the Experts Meeting on Arabic Ontologies and Semantic Networks. ALESCO, Arab League.Google Scholar
- Mustafa Jarrar, Anton Deik, and Bilal Faraj. 2011. Ontology-based data and process governance framework -the case of e-government interoperability in Palestine. The IFIP International Symposium on Data-Driven Process Discovery and Analysis.Google Scholar
- Mustafa Jarrar. 2006. Towards the notion of gloss, and the adoption of linguistic resources in formal ontology engineering. The 15th International World Wide Web Conference. ACM Press. Google Scholar
Digital Library
- Mustafa Jarrar and Stijn Heymans. 2008. Towards pattern-based reasoning for friendly ontology debugging. Journal of Artificial Intelligence Tools 17, 4, 2008.Google Scholar
- Nouha Chaâben Kammoun, Lamia Hadrich Belguith, and Abdelmajid Ben Hamadou. 2010. The MORPH2 new version: A robust morphological analyzer for Arabic texts. JADT 2010: 10th International Conference on Statistical Analysis of Textual Data.Google Scholar
- Mohammad Khorsheed. 2013. An HMM-based system to diacritize Arabic text. Journal of Software Engineering and Applications 5, 124.Google Scholar
Cross Ref
- George Anton Kiraz. 1998. Arabic computational morphology in the west. In Proceedings of the 6th International Conference and Exhibition on Multi-lingual Computing, 3--5.Google Scholar
- Seth Kulick, Ann Bies, and Mohamed Maamouri. 2010. Consistent and flexible integration of morphological annotation in the Arabic treebank. In Proceedings of LREC’2010.Google Scholar
- B. Mohamed, A. Chennoufi, A. Mazroui, and A. Lakhouaja. 2014. Hybrid approaches for the automatic vowelization of Arabic texts. Natural Language Computing. 3, 4.Google Scholar
- J. R. Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan-Kaufmann Publishers. Google Scholar
Digital Library
- Mohsen Rashwan, Mohamed Al-Badrashiny, Mohamed Attia, Sherif Abdou, and Ahmed Rafea. 2011. A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Transactions on Audio, Speech, and Language Processing 19, 1 (2011) 166--175. Google Scholar
Digital Library
- Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. The 46th Annual Meeting of the ACL: Short Papers, ACL, 117--120. Google Scholar
Digital Library
- Ahmed Said, Mohamed El-Sharqwi, Achraf Chalabi, and Eslam Kamal. 2013. A hybrid approach for Arabic diacritization. International Conference on Application of Natural Language to Information Systems. Springer Berlin, 53--64.Google Scholar
Cross Ref
- Abdullah Seraye. 2004. The Role of Short Vowels and Context in the Reading of Arabic, Comprehension and Word Recognition of Highly Skilled Readers. Ph.D. Thesis. University of Pittsburgh.Google Scholar
- Dimitra Vergyri and Katrin Kirchhoff. 2004. Automatic diacritization of Arabic for acoustic modeling in speech recognition. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages. ACL, 66--73. Google Scholar
Digital Library
- I. Zitouni, and R. Sarikaya. 2009. Arabic diacritic restoration approach based on maximum entropy models. Computer Speech 8 Language 23, 3 (2009). Google Scholar
Digital Library
Index Terms
Diacritic-Based Matching of Arabic Words
Recommendations
A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features
This paper introduces a large-scale dual-mode stochastic system to automatically diacritize raw Arabic text. The first of these modes determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum ...
Two template matching approaches to Arabic, Amharic and Latin isolated characters recognition
With the establishment of commercial OCR systems for Latin text, recent research efforts have been directed at the design of recognition systems for non-Latin scripts, such as Japanese, Cyrillic, Chinese, Hindi, Tibetan, and in particular Arabic. The ...
Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts
Diacritic restoration (also known as diacritization or vowelization) is the process of inserting the correct diacritical markings into a text. Modern Arabic is typically written without diacritics, e.g., newspapers. This lack of diacritical markings often ...






Comments