skip to main content
research-article

Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts

Authors Info & Claims
Published:13 December 2021Publication History
Skip Abstract Section

Abstract

Diacritic restoration (also known as diacritization or vowelization) is the process of inserting the correct diacritical markings into a text. Modern Arabic is typically written without diacritics, e.g., newspapers. This lack of diacritical markings often causes ambiguity, and though natives are adept at resolving, there are times they may fail. Diacritic restoration is a classical problem in computer science. Still, as most of the works tackle the full (heavy) diacritization of text, we, however, are interested in diacritizing the text using a fewer number of diacritics. Studies have shown that a fully diacritized text is visually displeasing and slows down the reading. This article proposes a system to diacritize homographs using the least number of diacritics, thus the name “light.” There is a large class of words that fall under the homograph category, and we will be dealing with the class of words that share the spelling but not the meaning. With fewer diacritics, we do not expect any effect on reading speed, while eye strain is reduced. The system contains morphological analyzer and context similarities. The morphological analyzer is used to generate all word candidates for diacritics. Then, through a statistical approach and context similarities, we resolve the homographs. Experimentally, the system shows very promising results, and our best accuracy is 85.6%.

REFERENCES

  1. [1] Abdul-Ameer Ahmed Mohammed Ali and Altaie Areej As‘ad Ja‘far. 2010. Homonymy in English and Arabic: A contrastive study. J. Univ. Babylon 18, 4 (2010), 964984.Google ScholarGoogle Scholar
  2. [2] Abid Muhammad, Habib Asad, Ashraf Jawad, and Shahid Abdul. 2018. Urdu word sense disambiguation using machine learning approach. Cluster Comput. 21, 1 (2018), 515522.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Abu-Rabia S.. 1998. Reading Arabic texts: Effects of text type, reader type, and vowelization. Reading Writing: Interdisc. J. 10 (1998), 106119.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Al-Salih Subhi. 1981. Dirasat Fi Fiqh Al-lughah (in Arabic). Dar al-‘Ilm lil-Malayin, Beirut, Lebanon.Google ScholarGoogle Scholar
  5. [5] Almuzaini Huda A. and Azmi Aqil M.. 2020. Impact of stemming and word embedding on deep learning-based Arabic text categorization. IEEE Access 8 (2020), 127913127928.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Alnefaie Rehab and Azmi Aqil M.. 2017. Automatic minimal diacritization of Arabic texts. In Proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing’17). 169174.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Alotaibi Y. A., Meftah A. H., and Selouani S.. 2013. Diacritization, automatic segmentation and labeling for Levantine Arabic speech. In Proceedings of the IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE’13). 711.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Alqudah Saba’, Abandah Gheith, and Arabiyat Alaa. 2017. Investigating hybrid approaches for Arabic text diacritization with recurrent neural networks. In Proceedings of the IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT’17). 16.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Alshayeji Mohammad, Sultan Sari et al. 2019. Diacritics effect on Arabic speech recognition. Arab. J. Sci. Eng. 44, 11 (2019), 90439056.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Attia M.. 2000. A large-scale computational processor of the Arabic morphology and applications. Master’s Thesis. Faculty of Engineering, Cairo University, Giza, Egypt.Google ScholarGoogle Scholar
  11. [11] Azmi Aqil M., Al-Qabbany Abdulaziz O., and Hussain Amir. 2019. Computational and natural language processing based studies of hadith literature: A survey. Artific. Intell. Rev. 52, 2 (2019), 13691414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Azmi Aqil M. and Aljafari Eman A.. 2018. Universal web accessibility and the challenge to integrate informal Arabic users: A case study. Univ. Access Info. Soc. 17, 1 (2018), 131145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Azmi Aqil M. and Almajed Reham S.. 2015. A survey of automatic Arabic diacritization techniques. Natural Lang. Eng. 21, 3 (2015), 477495.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Azmi Aqil M. and Alsaiari Abeer. 2014. A calligraphic based scheme to justify Arabic text improving readability and comprehension. Comput. Hum. Behav. 39 (2014), 177186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Black William, Elkateb Sabri, Rodriguez Horacio, Alkhalifa Musa, Vossen Piek, Pease Adam, and Fellbaum Christiane. 2006. Introducing the Arabic wordnet project. In Proceedings of the 3rd International WordNet Conference. 295300.Google ScholarGoogle Scholar
  16. [16] Boudchiche Mohamed, Mazroui Azzeddine, Bebah Mohamed Ould Abdallahi Ould, Lakhouaja Abdelhak, and Boudlal Abderrahim. 2017. AlKhalil morpho sys 2: A robust Arabic morpho-syntactic analyzer. J. King Saud Univ. Comput. Info. Sci. 29, 2 (2017), 141146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Brihaye P.. [n. d.]. AraMorph. Retrieved from http://www.nongnu.org/aramorph/english/index.html.Google ScholarGoogle Scholar
  18. [18] Buckwalter Tim. 2004. Buckwalter Arabic Morphological Analyzer Version 2.0. Linguistic Data Consortium (LDC), University of Pennsylvania. Technical Report.Google ScholarGoogle Scholar
  19. [19] Gorman Kyle, Mazovetskiy Gleb, and Nikolaev Vitaly. 2018. Improving homograph disambiguation with supervised machine learning. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). 13491352.Google ScholarGoogle Scholar
  20. [20] Hamed Osama and Zesch Torsten. 2017. A survey and comparative study of Arabic diacritization tools. J. Lang. Technol. Comput. Ling. 32, 1 (2017), 2747.Google ScholarGoogle Scholar
  21. [21] Hermena Ehab. 2016. Aspects of word and sentence processing during reading Arabic: Evidence from eye movements. Ph.D. Dissertation. University of Southampton.Google ScholarGoogle Scholar
  22. [22] Hifny Yasser. 2018. Hybrid LSTM/MaxEnt networks for Arabic syntactic diacritics restoration. IEEE Signal Process. Lett. 25, 10 (2018), 15151519.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Hifny Yasser. 2019. Open vocabulary Arabic diacritics restoration. IEEE Signal Process. Lett. 26, 10 (2019), 14211425.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Jani F. and Pilevar A. H.. 2012. Word sense disambiguation of Persian homographs. In Proceedings of the 7th International Conference on Software Paradigm Trends (ICSOFT’12). 328331.Google ScholarGoogle Scholar
  25. [25] Khoja Shereen and Garside Roger. 1999. Stemming arabic text. Computing Department, Lancaster University, Lancaster, UK.Google ScholarGoogle Scholar
  26. [26] Larkey Leah S., Ballesteros Lisa, and Connell Margaret E.. 2007. Light stemming for Arabic information retrieval. In Arabic Computational Morphology. Springer, 221243.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Maamouri Mohamed, Bies Ann, Buckwalter Tim, and Mekki Wigdan. 2004. The Penn Arabic treebank: Building a large-scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, Vol. 27. 466467.Google ScholarGoogle Scholar
  28. [28] Maamouri Mohamed, Graff David, Bouziri Basma, Krouna Sondos, Bies Ann, and Kulick Seth. 2010. LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1. Retrieved from https://catalog.ldc.upenn.edu/LDC2010L01.Google ScholarGoogle Scholar
  29. [29] Masmoudi Abir, Khemekhem Mariem Ellouze, et al. 2019. Automatic diacritization of Tunisian dialect text using recurrent neural network. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’19). 730739.Google ScholarGoogle Scholar
  30. [30] Masmoudi Abir, Mdhaffar Salima, Sellami Rahma, and Belguith Lamia Hadrich. 2019. Automatic diacritics restoration for Tunisian dialect. ACM Trans. Asian Low-Resource Lang. Info. Process. 18, 3 (2019), 118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Nelken Rani and Shieber Stuart M.. 2005. Arabic diacritization using weighted finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (Semitic’05). 7986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Pasha Arfath, Al-Badrashiny Mohamed, Diab Mona T., Kholy Ahmed El, Eskander Ramy, Habash Nizar, Pooleery Manoj, Rambow Owen, and Roth Ryan. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14), Vol. 14. 10941101.Google ScholarGoogle Scholar
  33. [33] Rashwan Mohsen A. A., Sallab Ahmad A. Al, Raafat Hazem M., and Rafea Ahmed. 2015. Deep learning framework with confused sub-set resolution architecture for automatic Arabic diacritization. IEEE/ACM Transactions on Audio, Speech, And Language Processing 23, 3 (2015), 505516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Roman G. and Pavard B.. 1987. A comparative study: How we read in Arabic and French. In Eye Movements from Physiology to Cognition, O’Regan J. K. and Levy-Schoen A. (Eds.). 431440.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Shaalan Khaled, Monem Azza Abdel, and Rafea Ahmed. 2007. Arabic morphological generation from interlingua: A rule-based approach. In Intelligent Information Processing III, Shi Zhongzhi, Shimohara K., and Feng D. (Eds.). Springer U.S., 441451.Google ScholarGoogle Scholar
  36. [36] Shaalan K., Siddiqui S., Alkhatib M., and Monem A. A.. 2018. Computational Linguistics, Speech and Image Processing For Arabic Language. World Scientific, Singapore, 5983.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Xiaoxiao Xiong et al. 2013. Research on the homonyms disambiguation based on Mongolian nouns semantic network. In Proceedings of the 6th International Conference on Intelligent Networks and Intelligent Systems (ICINIS’13). 244247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Zayyan A. A., Elmahdy M., Husni H. binti, and Ja’am J. M. Al. 2016. Automatic diacritics restoration for modern standard Arabic text. In Proceedings of the IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE’16). 221225.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Zerrouki T.. 2014. Arabic Corpora Resources, Tashkila Collection from the Arabic Al-Shamela Library.Google ScholarGoogle Scholar

Index Terms

  1. Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 3
        May 2022
        413 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3505182
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 December 2021
        • Accepted: 1 September 2021
        • Revised: 1 July 2021
        • Received: 1 May 2020
        Published in tallip Volume 21, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)48
        • Downloads (Last 6 weeks)3

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!