Abstract
Word embedding is used to represent words for text analysis. It plays an essential role in many Natural Language Processing (NLP) studies and has hugely contributed to the extraordinary developments in the field in the last few years. In Arabic, diacritic marks are a vital feature for the readability and understandability of the language. Current Arabic word embeddings are non-diacritized. In this article, we aim to develop and compare word embedding models based on diacritized and non-diacritized corpora to study the impact of Arabic diacritization on word embeddings. We propose evaluating the models in four different ways: clustering of the nearest words; morphological semantic analysis; part-of-speech tagging; and semantic analysis. For a better evaluation, we took the challenge to create three new datasets from scratch for the three downstream tasks. We conducted the downstream tasks with eight machine learning algorithms and two deep learning algorithms. Experimental results show that the diacritized model exhibits a better ability to capture syntactic and semantic relations and in clustering words of similar categories. Overall, the diacritized model outperforms the non-diacritized model. We obtained some more interesting findings. For example, from the morphological semantics analysis, we found that with the increase in the number of target words, the advantages of the diacritized model are also more obvious, and the diacritic marks have more significance in POS tagging than in other tasks.
- . 2018. Improving English to Arabic machine translation. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS’18), Montréal, Canada.Google Scholar
- . 2020. The challenge of diacritics in Yorùbá embeddings. In Proceedings of the ML4D Workshop at 34th Conference on Neural Information Processing Systems (NeurIPS) 2020 Workshop on Machine Learning for the Developing World. Vancouver, Canada. arXiv preprint arXiv:2011.07605.Google Scholar
- . 2020. Massive vs. curated embeddings for low-resourced languages: the case of Yorùbá and Twi. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC’20). Computation and Language. arXi v preprint arXiv:1912.02481. Version 2.Google Scholar
- . 2013. Diacritization, automatic segmentation and labeling for Levantine Arabic speech. In Proceedings of IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE). 7–11.Google Scholar
Cross Ref
- . 2010. An Arabic Morphological Analyzer and Part-of-speech tagger. A Thesis Presented to the Faculty of Informatics Engineering, Arab International University, Damascus, Syria.Google Scholar
- . 2016. Investigating the impact of various partial diacritization schemes on Arabic-English statistical machine translation. In Proceedings of the Association for Machine Translation in the Americas: MT Researchers' Track Conferences. Austin, TX, USA, 191–204.Google Scholar
- . 2021. A systematic review on hadith authentication and classification methods. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 2 (2021),
Article 34 .Google ScholarDigital Library
- . 2021. Are word embedding methods stable and should we care about it?. In Proceedings of the 32nd ACM Conference on Hypertext and Social Media. 45–55. Google Scholar
Digital Library
- . 2022. The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network text classification. Journal of Theoretical and Applied Information Technology (2022).Google Scholar
- . 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of Machine Translation Summit XI, Copenhagen, Denmark.Google Scholar
- . 2020. Is Madinah Arabic free? Retrieved May 20, 2022 from https://www.sidmartinbio.org/is-madinah-arabic-free/.Google Scholar
- . 2018. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893.Google Scholar
- . 2019. An approach for Arabic diacritization. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. NLDB 2019: Natural Language Processing and Information Systems. Springer, 337–344.Google Scholar
Cross Ref
- . 2008. An Introduction to Information Retrieval. Cambridge University Press.Google Scholar
Cross Ref
- . 2014. A corpus and phonetic dictionary for Tunisian Arabic speech recognition. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14), ID L14-1385, Reykjavik, Iceland, 306–310.Google Scholar
- . 2019. Automatic diacritics restoration for Tunisian dialect. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3 (2019), 1–18.Google Scholar
Digital Library
- . 2020. Arabic trilateral roots. Retrieved May 20, 2022 from https://medium.com/@aymannadeem/arabic-trilateral-roots-3186e8319b0.Google Scholar
- . 2017. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Computer Science 117 (2017), 256–265.Google Scholar
Cross Ref
- . 2018. Document Embedding Models - A Comparison with Bag-of-Words. Master Thesis, Supervisors: Abraham Bernstein. Merlin - OEC Faculty Information System. University of Zurich. Zurich ZH, Switzerland.Google Scholar
- . 2019. Morphological Indication of Al-Khasa'is Book for Ebn Jini: Descriptive Analytical Study. University of Jordan Deanship of Academic Research (DAR).Google Scholar
- . 2011. A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers. In Proceedings of the WoLeR 2011 Conference at ESSLLI International Workshop on Lexical Resources at: Ljubliana.Google Scholar
- . 2020. Embed more ignore less (EMIL): Exploiting enriched representations for Arabic NLP. In Proceedings of the 5th Arabic Natural Language Processing Workshop. 139–154.Google Scholar
- . 2017. Tashkeela: Novel corpus of Arabic vocalized texts, data for autodiacritization systems. Data Brief, 147–151.Google Scholar
Index Terms
The Impact of Arabic Diacritization on Word Embeddings
Recommendations
Fassieh¯, a Semi-Automatic Visual Interactive Tool for Morphological, PoS-Tags, Phonetic, and Semantic Annotation of Arabic Text Corpora
This paper introduces an Arabic text annotation tool called Fassiehreg. Via a sophisticated interactive GUI application, Fassiehreg makes it easy to build structured large standard written Arabic corpora, then allows the production of fundamental ...
ArSphere: Arabic word vectors embedded in a polar sphere
AbstractWord embeddings mean the mapping of words into vectors in an N-dimensional space. ArSphere: is an approach that designs word embeddings for the Arabic language. This approach overcomes one of the shortcomings of word embeddings (for English ...
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial IntelligenceIn recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...






Comments