Abstract
Diacritic restoration (also known as diacritization or vowelization) is the process of inserting the correct diacritical markings into a text. Modern Arabic is typically written without diacritics, e.g., newspapers. This lack of diacritical markings often causes ambiguity, and though natives are adept at resolving, there are times they may fail. Diacritic restoration is a classical problem in computer science. Still, as most of the works tackle the full (heavy) diacritization of text, we, however, are interested in diacritizing the text using a fewer number of diacritics. Studies have shown that a fully diacritized text is visually displeasing and slows down the reading. This article proposes a system to diacritize homographs using the least number of diacritics, thus the name “light.” There is a large class of words that fall under the homograph category, and we will be dealing with the class of words that share the spelling but not the meaning. With fewer diacritics, we do not expect any effect on reading speed, while eye strain is reduced. The system contains morphological analyzer and context similarities. The morphological analyzer is used to generate all word candidates for diacritics. Then, through a statistical approach and context similarities, we resolve the homographs. Experimentally, the system shows very promising results, and our best accuracy is 85.6%.
- [1] . 2010. Homonymy in English and Arabic: A contrastive study. J. Univ. Babylon 18, 4 (2010), 964–984.Google Scholar
- [2] . 2018. Urdu word sense disambiguation using machine learning approach. Cluster Comput. 21, 1 (2018), 515–522.Google Scholar
Cross Ref
- [3] . 1998. Reading Arabic texts: Effects of text type, reader type, and vowelization. Reading Writing: Interdisc. J. 10 (1998), 106–119.Google Scholar
Cross Ref
- [4] . 1981. Dirasat Fi Fiqh Al-lughah (in Arabic). Dar al-‘Ilm lil-Malayin, Beirut, Lebanon.Google Scholar
- [5] . 2020. Impact of stemming and word embedding on deep learning-based Arabic text categorization. IEEE Access 8 (2020), 127913–127928.Google Scholar
Digital Library
- [6] . 2017. Automatic minimal diacritization of Arabic texts. In Proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing’17). 169–174.Google Scholar
Cross Ref
- [7] . 2013. Diacritization, automatic segmentation and labeling for Levantine Arabic speech. In Proceedings of the IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE’13). 7–11.Google Scholar
Cross Ref
- [8] . 2017. Investigating hybrid approaches for Arabic text diacritization with recurrent neural networks. In Proceedings of the IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT’17). 1–6.Google Scholar
Cross Ref
- [9] . 2019. Diacritics effect on Arabic speech recognition. Arab. J. Sci. Eng. 44, 11 (2019), 9043–9056.Google Scholar
Cross Ref
- [10] . 2000. A large-scale computational processor of the Arabic morphology and applications. Master’s Thesis. Faculty of Engineering, Cairo University, Giza, Egypt.Google Scholar
- [11] . 2019. Computational and natural language processing based studies of hadith literature: A survey. Artific. Intell. Rev. 52, 2 (2019), 1369–1414. Google Scholar
Digital Library
- [12] . 2018. Universal web accessibility and the challenge to integrate informal Arabic users: A case study. Univ. Access Info. Soc. 17, 1 (2018), 131–145. Google Scholar
Digital Library
- [13] . 2015. A survey of automatic Arabic diacritization techniques. Natural Lang. Eng. 21, 3 (2015), 477–495.Google Scholar
Cross Ref
- [14] . 2014. A calligraphic based scheme to justify Arabic text improving readability and comprehension. Comput. Hum. Behav. 39 (2014), 177–186. Google Scholar
Digital Library
- [15] . 2006. Introducing the Arabic wordnet project. In Proceedings of the 3rd International WordNet Conference. 295–300.Google Scholar
- [16] . 2017. AlKhalil morpho sys 2: A robust Arabic morpho-syntactic analyzer. J. King Saud Univ. Comput. Info. Sci. 29, 2 (2017), 141–146. Google Scholar
Digital Library
- [17] . [n. d.]. AraMorph. Retrieved from http://www.nongnu.org/aramorph/english/index.html.Google Scholar
- [18] . 2004. Buckwalter Arabic Morphological Analyzer Version 2.0. Linguistic Data Consortium (LDC), University of Pennsylvania.
Technical Report .Google Scholar - [19] . 2018. Improving homograph disambiguation with supervised machine learning. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). 1349–1352.Google Scholar
- [20] . 2017. A survey and comparative study of Arabic diacritization tools. J. Lang. Technol. Comput. Ling. 32, 1 (2017), 27–47.Google Scholar
- [21] . 2016. Aspects of word and sentence processing during reading Arabic: Evidence from eye movements. Ph.D. Dissertation. University of Southampton.Google Scholar
- [22] . 2018. Hybrid LSTM/MaxEnt networks for Arabic syntactic diacritics restoration. IEEE Signal Process. Lett. 25, 10 (2018), 1515–1519.Google Scholar
Cross Ref
- [23] . 2019. Open vocabulary Arabic diacritics restoration. IEEE Signal Process. Lett. 26, 10 (2019), 1421–1425.Google Scholar
Cross Ref
- [24] . 2012. Word sense disambiguation of Persian homographs. In Proceedings of the 7th International Conference on Software Paradigm Trends (ICSOFT’12). 328–331.Google Scholar
- [25] . 1999. Stemming arabic text. Computing Department, Lancaster University, Lancaster, UK.Google Scholar
- [26] . 2007. Light stemming for Arabic information retrieval. In Arabic Computational Morphology. Springer, 221–243.Google Scholar
Cross Ref
- [27] . 2004. The Penn Arabic treebank: Building a large-scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, Vol. 27. 466–467.Google Scholar
- [28] . 2010. LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1. Retrieved from https://catalog.ldc.upenn.edu/LDC2010L01.Google Scholar
- [29] . 2019. Automatic diacritization of Tunisian dialect text using recurrent neural network. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’19). 730–739.Google Scholar
- [30] . 2019. Automatic diacritics restoration for Tunisian dialect. ACM Trans. Asian Low-Resource Lang. Info. Process. 18, 3 (2019), 1–18. Google Scholar
Digital Library
- [31] . 2005. Arabic diacritization using weighted finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (Semitic’05). 79–86. Google Scholar
Digital Library
- [32] . 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14), Vol. 14. 1094–1101.Google Scholar
- [33] . 2015. Deep learning framework with confused sub-set resolution architecture for automatic Arabic diacritization. IEEE/ACM Transactions on Audio, Speech, And Language Processing 23, 3 (2015), 505–516. Google Scholar
Digital Library
- [34] . 1987. A comparative study: How we read in Arabic and French. In Eye Movements from Physiology to Cognition, and (Eds.). 431–440.Google Scholar
Cross Ref
- [35] . 2007. Arabic morphological generation from interlingua: A rule-based approach. In Intelligent Information Processing III, , , and (Eds.). Springer U.S., 441–451.Google Scholar
- [36] . 2018. Computational Linguistics, Speech and Image Processing For Arabic Language. World Scientific, Singapore, 59–83.Google Scholar
Cross Ref
- [37] . 2013. Research on the homonyms disambiguation based on Mongolian nouns semantic network. In Proceedings of the 6th International Conference on Intelligent Networks and Intelligent Systems (ICINIS’13). 244–247. Google Scholar
Digital Library
- [38] . 2016. Automatic diacritics restoration for modern standard Arabic text. In Proceedings of the IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE’16). 221–225.Google Scholar
Cross Ref
- [39] . 2014. Arabic Corpora Resources, Tashkila Collection from the Arabic Al-Shamela Library.Google Scholar
Index Terms
Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts
Recommendations
Morphological, syntactic and diacritics rules for automatic diacritization of Arabic sentences
The diacritical marks of Arabic language are characters other than letters and are in the majority of cases absent from Arab writings. This paper presents a hybrid system for automatic diacritization of Arabic sentences combining linguistic rules and ...
Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization
The absence of short vowels in Arabic texts is the source of some difficulties in several automatic processing systems of Arabic language. Several developed hybrid systems of automatic diacritization of the Arabic texts are presented and evaluated in ...
A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features
This paper introduces a large-scale dual-mode stochastic system to automatically diacritize raw Arabic text. The first of these modes determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum ...






Comments