skip to main content
research-article

The Impact of Arabic Diacritization on Word Embeddings

Authors Info & Claims
Published:16 June 2023Publication History
Skip Abstract Section

Abstract

Word embedding is used to represent words for text analysis. It plays an essential role in many Natural Language Processing (NLP) studies and has hugely contributed to the extraordinary developments in the field in the last few years. In Arabic, diacritic marks are a vital feature for the readability and understandability of the language. Current Arabic word embeddings are non-diacritized. In this article, we aim to develop and compare word embedding models based on diacritized and non-diacritized corpora to study the impact of Arabic diacritization on word embeddings. We propose evaluating the models in four different ways: clustering of the nearest words; morphological semantic analysis; part-of-speech tagging; and semantic analysis. For a better evaluation, we took the challenge to create three new datasets from scratch for the three downstream tasks. We conducted the downstream tasks with eight machine learning algorithms and two deep learning algorithms. Experimental results show that the diacritized model exhibits a better ability to capture syntactic and semantic relations and in clustering words of similar categories. Overall, the diacritized model outperforms the non-diacritized model. We obtained some more interesting findings. For example, from the morphological semantics analysis, we found that with the increase in the number of target words, the advantages of the diacritized model are also more obvious, and the diacritic marks have more significance in POS tagging than in other tasks.

REFERENCES

  1. Abid Wael and Mourri Younes Bensouda. 2018. Improving English to Arabic machine translation. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS’18), Montréal, Canada.Google ScholarGoogle Scholar
  2. Adewumi Tosin P., Liwicki Foteini, and Liwicki Marcus. 2020. The challenge of diacritics in Yorùbá embeddings. In Proceedings of the ML4D Workshop at 34th Conference on Neural Information Processing Systems (NeurIPS) 2020 Workshop on Machine Learning for the Developing World. Vancouver, Canada. arXiv preprint arXiv:2011.07605.Google ScholarGoogle Scholar
  3. Alabi Jesujoba, Amponsah-Kaakyire Kwabena, Adelani David, and España-Bonet Cristina. 2020. Massive vs. curated embeddings for low-resourced languages: the case of Yorùbá and Twi. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC’20). Computation and Language. arXi v preprint arXiv:1912.02481. Version 2.Google ScholarGoogle Scholar
  4. Alotaibi Yousef, Meftah Ali, and Selouani Sid Ahmed. 2013. Diacritization, automatic segmentation and labeling for Levantine Arabic speech. In Proceedings of IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE). 711.Google ScholarGoogle ScholarCross RefCross Ref
  5. Altabba Muhammad, Al-Zaraee Ammar, and Shukairy Mohammad Arif. 2010. An Arabic Morphological Analyzer and Part-of-speech tagger. A Thesis Presented to the Faculty of Informatics Engineering, Arab International University, Damascus, Syria.Google ScholarGoogle Scholar
  6. Alqahtani Sawsan, Ghoneim Mahmoud, and Diab Mona. 2016. Investigating the impact of various partial diacritization schemes on Arabic-English statistical machine translation. In Proceedings of the Association for Machine Translation in the Americas: MT Researchers' Track Conferences. Austin, TX, USA, 191–204.Google ScholarGoogle Scholar
  7. Binbeshr Farid, Kamsin Amirrudin, and Mohammed Manal. 2021. A systematic review on hadith authentication and classification methods. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 2 (2021), Article 34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Borah Angana, Barman Manash Pratim, and Awekar Amit. 2021. Are word embedding methods stable and should we care about it?. In Proceedings of the 32nd ACM Conference on Hypertext and Social Media. 4555. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dharma Eddy Muntina, Gaol Ford Lumban, Leslie Hendric Spits Warnars Harco, and Soewito Benfano. 2022. The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network text classification. Journal of Theoretical and Applied Information Technology (2022).Google ScholarGoogle Scholar
  10. Diab Mona, Ghoneim Mahmoud, and Habash Nizar. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of Machine Translation Summit XI, Copenhagen, Denmark.Google ScholarGoogle Scholar
  11. Fleming Esther. 2020. Is Madinah Arabic free? Retrieved May 20, 2022 from https://www.sidmartinbio.org/is-madinah-arabic-free/.Google ScholarGoogle Scholar
  12. Grave Edouard, Bojanowski Piotr, Gupta Prakhar, Joulin Armand, and Mikolov Tomas. 2018. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893.Google ScholarGoogle Scholar
  13. Hadjir Ismail, Abbache Mohamed, and Belkredim Fatma Zohra. 2019. An approach for Arabic diacritization. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. NLDB 2019: Natural Language Processing and Information Systems. Springer, 337344.Google ScholarGoogle ScholarCross RefCross Ref
  14. Manning Christopher D., Raghavan Prabhakar, and Schütze Hinrich. 2008. An Introduction to Information Retrieval. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  15. Masmoudi Abir, Ellouze Khemakhem Mariem, Estève Yannick, Hadrich Belguith Lamia, and Habash Nizar. 2014. A corpus and phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14), ID L14-1385, Reykjavik, Iceland, 306310.Google ScholarGoogle Scholar
  16. Masmoudi Abir, Mdhaffar Salima, Sellami Rahma, and Hadrich Lamia. 2019. Automatic diacritics restoration for Tunisian dialect. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3 (2019), 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Nadeem Ayman. 2020. Arabic trilateral roots. Retrieved May 20, 2022 from https://medium.com/@aymannadeem/arabic-trilateral-roots-3186e8319b0.Google ScholarGoogle Scholar
  18. Bakr Soliman Abu, Eissa Kareem, and El-Beltagy Samhaa R.. 2017. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Computer Science 117 (2017), 256265.Google ScholarGoogle ScholarCross RefCross Ref
  19. Stohler Robin. 2018. Document Embedding Models - A Comparison with Bag-of-Words. Master Thesis, Supervisors: Abraham Bernstein. Merlin - OEC Faculty Information System. University of Zurich. Zurich ZH, Switzerland.Google ScholarGoogle Scholar
  20. Qawaqzeh Othman Salem Bakheet. 2019. Morphological Indication of Al-Khasa'is Book for Ebn Jini: Descriptive Analytical Study. University of Jordan Deanship of Academic Research (DAR).Google ScholarGoogle Scholar
  21. Neme Alexis. 2011. A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers. In Proceedings of the WoLeR 2011 Conference at ESSLLI International Workshop on Lexical Resources at: Ljubliana.Google ScholarGoogle Scholar
  22. Younes Ahmed and Weeds Julie. 2020. Embed more ignore less (EMIL): Exploiting enriched representations for Arabic NLP. In Proceedings of the 5th Arabic Natural Language Processing Workshop. 139154.Google ScholarGoogle Scholar
  23. Zerrouki Taha and Balla Amar. 2017. Tashkeela: Novel corpus of Arabic vocalized texts, data for autodiacritization systems. Data Brief, 147151.Google ScholarGoogle Scholar

Index Terms

  1. The Impact of Arabic Diacritization on Word Embeddings

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
      June 2023
      635 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3604597
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 June 2023
      • Online AM: 19 April 2023
      • Accepted: 30 March 2023
      • Revised: 3 February 2023
      • Received: 7 June 2022
      Published in tallip Volume 22, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)119
      • Downloads (Last 6 weeks)19

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!