skip to main content
research-article

Diacritic-Based Matching of Arabic Words

Published:14 December 2018Publication History
Skip Abstract Section

Abstract

Words in Arabic consist of letters and short vowel symbols called diacritics inscribed atop regular letters. Changing diacritics may change the syntax and semantics of a word; turning it into another. This results in difficulties when comparing words based solely on string matching. Typically, Arabic NLP applications resort to morphological analysis to battle ambiguity originating from this and other challenges. In this article, we introduce three alternative algorithms to compare two words with possibly different diacritics. We propose the Subsume knowledge-based algorithm, the Imply rule-based algorithm, and the Alike machine-learning-based algorithm. We evaluated the soundness, completeness, and accuracy of the algorithms against a large dataset of 86,886 word pairs. Our evaluation shows that the accuracy of Subsume (100%), Imply (99.32%), and Alike (99.53%). Although accurate, Subsume was able to judge only 75% of the data. Both Subsume and Imply are sound, while Alike is not. We demonstrate the utility of the algorithms using a real-life use case -- in lemma disambiguation and in linking hundreds of Arabic dictionaries.

References

  1. Diab Abuaiadah, Dileep Rajendran, and Mustafa Jarrar. 2017. Clustering Arabic tweets for sentiment analysis. IEEE/ACS 14th International Conference on Computer Systems and Applications.Google ScholarGoogle ScholarCross RefCross Ref
  2. Shihadeh Alqrainy, Hasan AlSerhan, and Aladdin Ayesh. 2008. Pattern-based algorithm for part-of-speech tagging Arabic text. Proceedings of ICCES, 119--124.Google ScholarGoogle ScholarCross RefCross Ref
  3. Mohammed Attia. 2008. Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation. Ph.D. Dissertation. University of Manchester.Google ScholarGoogle Scholar
  4. A. M. Azmi and R. S. Almajed. 2015. A survey of automatic Arabic diacritization techniques. Natural Language Engineering, 21, 3 (2015) 477--495.Google ScholarGoogle ScholarCross RefCross Ref
  5. Alia Bahanshal and Hend Al-Khalifa. 2012. A first approach to the evaluation of Arabic diacritization systems. Proceedings of ICDIM, 155--158.Google ScholarGoogle ScholarCross RefCross Ref
  6. Kenneth Beesley. 2001. Finite-state morphological analysis and generation of Arabic at Xerox research: Status and plans. In ACL Workshop on Arabic Language Processing: Status and Perspective 1, 1--8.Google ScholarGoogle Scholar
  7. Y. Belinkov and J. Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle Scholar
  8. Makram Boujelben, Chafik Aloulou, and Lamia Hadrich Belguith. 2008. Toward a robust detection/correction system for the agreement errors in non-voweled arabic texts. In Proceedings of ACIT 2008.Google ScholarGoogle Scholar
  9. Tim Buckwalter. 2002. Buckwalter {Arabic} morphological analyzer version 1.0. LDC catalog number LDC2002L49, Technical Report.Google ScholarGoogle Scholar
  10. Kareem Darwish, Hamdy Mubarak, and A. Abdelali. 2017. Arabic diacritization: Stats, rules, and hacks. In Proceedings of the 3rd Arabic Natural Language Processing Workshop, 9--17.Google ScholarGoogle Scholar
  11. Fathi Debili, Hadhémi Achour, and E. Souissi. 2002. De l'étiquetage grammatical à la voyellation automatique de l'arabe. Technical Report.Google ScholarGoogle Scholar
  12. Nizar Habash. 2007. Arabic morphological representations for machine translation. book chapter. In Arabic Computational Morphology. Springer, 263—285.Google ScholarGoogle Scholar
  13. Nizar Habash, Owen Rambow, and Ryan Roth. 2007. MADA+ TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of MEDAR’09.Google ScholarGoogle Scholar
  14. Abdullah Hattab and Abdulameer Hussain. 2012. Hybrid statistical and morpho-syntactical Arabic language diacritizing system. International Journal of Academic Research, 4, 4.Google ScholarGoogle ScholarCross RefCross Ref
  15. Mustafa Jarrar, Nizar Habash, Faeq Alrimawi, Diyam Akra, and Nasser Zalmout. 2016. Curras: An annotated corpus for the palestinian arabic dialect. Journal Language Resources and Evaluation. 51, 3 (2016) 745--775. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mustafa Jarrar, Nizar Habash, Diyam Akra, and Nasser Zalmout. 2014. Building a corpus for Palestinian arabic: A preliminary study. The EMNLP Workshop on Arabic Natural Language Processing. ACL.Google ScholarGoogle ScholarCross RefCross Ref
  17. Mustafa Jarrar. 2011. Building a formal Arabic ontology. In Proceedings of the Experts Meeting on Arabic Ontologies and Semantic Networks. ALESCO, Arab League.Google ScholarGoogle Scholar
  18. Mustafa Jarrar, Anton Deik, and Bilal Faraj. 2011. Ontology-based data and process governance framework -the case of e-government interoperability in Palestine. The IFIP International Symposium on Data-Driven Process Discovery and Analysis.Google ScholarGoogle Scholar
  19. Mustafa Jarrar. 2006. Towards the notion of gloss, and the adoption of linguistic resources in formal ontology engineering. The 15th International World Wide Web Conference. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mustafa Jarrar and Stijn Heymans. 2008. Towards pattern-based reasoning for friendly ontology debugging. Journal of Artificial Intelligence Tools 17, 4, 2008.Google ScholarGoogle Scholar
  21. Nouha Chaâben Kammoun, Lamia Hadrich Belguith, and Abdelmajid Ben Hamadou. 2010. The MORPH2 new version: A robust morphological analyzer for Arabic texts. JADT 2010: 10th International Conference on Statistical Analysis of Textual Data.Google ScholarGoogle Scholar
  22. Mohammad Khorsheed. 2013. An HMM-based system to diacritize Arabic text. Journal of Software Engineering and Applications 5, 124.Google ScholarGoogle ScholarCross RefCross Ref
  23. George Anton Kiraz. 1998. Arabic computational morphology in the west. In Proceedings of the 6th International Conference and Exhibition on Multi-lingual Computing, 3--5.Google ScholarGoogle Scholar
  24. Seth Kulick, Ann Bies, and Mohamed Maamouri. 2010. Consistent and flexible integration of morphological annotation in the Arabic treebank. In Proceedings of LREC’2010.Google ScholarGoogle Scholar
  25. B. Mohamed, A. Chennoufi, A. Mazroui, and A. Lakhouaja. 2014. Hybrid approaches for the automatic vowelization of Arabic texts. Natural Language Computing. 3, 4.Google ScholarGoogle Scholar
  26. J. R. Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan-Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mohsen Rashwan, Mohamed Al-Badrashiny, Mohamed Attia, Sherif Abdou, and Ahmed Rafea. 2011. A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Transactions on Audio, Speech, and Language Processing 19, 1 (2011) 166--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. The 46th Annual Meeting of the ACL: Short Papers, ACL, 117--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ahmed Said, Mohamed El-Sharqwi, Achraf Chalabi, and Eslam Kamal. 2013. A hybrid approach for Arabic diacritization. International Conference on Application of Natural Language to Information Systems. Springer Berlin, 53--64.Google ScholarGoogle ScholarCross RefCross Ref
  30. Abdullah Seraye. 2004. The Role of Short Vowels and Context in the Reading of Arabic, Comprehension and Word Recognition of Highly Skilled Readers. Ph.D. Thesis. University of Pittsburgh.Google ScholarGoogle Scholar
  31. Dimitra Vergyri and Katrin Kirchhoff. 2004. Automatic diacritization of Arabic for acoustic modeling in speech recognition. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages. ACL, 66--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. I. Zitouni, and R. Sarikaya. 2009. Arabic diacritic restoration approach based on maximum entropy models. Computer Speech 8 Language 23, 3 (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Diacritic-Based Matching of Arabic Words

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 2
        June 2019
        208 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3300146
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 December 2018
        • Accepted: 1 July 2018
        • Revised: 1 May 2018
        • Received: 1 January 2018
        Published in tallip Volume 18, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!