Abstract
Word clustering is a serious challenge in low-resource languages. Since words that share semantics are expected to be clustered together, it is common to use a feature vector representation generated from a distributional theory-based word embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low-resource Iraqi vocabulary. We began with a new Dialect Fast Stemming Algorithm (DFSA) that utilizes the MSA data. The proposed algorithm achieved 0.85 accuracy measured by the F1 score. Then, the distributional theory-based word embedding method and a new simple, yet effective, feature vector named Wasf-Vec word embedding are tested. Wasf-Vec word representation utilizes a word’s topology features. The difference between Wasf-Vec and distributional theory-based word embedding is that Wasf-Vec captures relations that are not contextually based. The embedding is followed by an analysis of how the dialect words are clustered within other MSA words. The analysis is based on the word semantic relations that are well supported by solid linguistic theories to shed light on the strong and weak word relation representations identified by each embedding method. The analysis is handled by visualizing the feature vector in two-dimensional (2D) space. The feature vectors of the distributional theory-based word embedding method are plotted in 2D space using the t-sne algorithm, while the Wasf-Vec feature vectors are plotted directly in 2D space. A word’s nearest neighbors and the distance-histograms of the plotted words are examined. For validation purpose of the word classification used in this article, the produced classes are employed in Class-based Language Modeling (CBLM). Wasf-Vec CBLM achieved a 7% lower perplexity (pp) than the distributional theory-based word embedding method CBLM. This result is significant when working with low-resource languages.
- Tiba Zaki Abdulhameed, Imed Zitouni, and Ikhlas Abdel-Qader. 2018. Enhancement of the word2vec class-based language modeling by optimizing the features vector using PCA. In Proceedings of the 2018 IEEE International Conference on Electro/Information Technology (EIT’18), Vol. 2018-. IEEE, 0866--0870.Google Scholar
Cross Ref
- Tiba Zaki Abdulhameed, Imed Zitouni, Ikhlas Abdel-Qader, and Mohamed Abusharkh. 2017. Assessing the usability of modern standard Arabic data in enhancing the language model of limited size dialect conversations. Proceedings of the International Conference on Natural Language, Signal and Speech Processing 2017, Casablanca, Moroco.Google Scholar
- George W. Adamson and Jillian Boreham. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Information Storage and Retrieval 10, 7 (1974), 253--260.Google Scholar
Cross Ref
- Abdulaziz M. Alayba, Vasile Palade, Matthew England, and Rahat Iqbal. 2018. Improving sentiment analysis in Arabic using word representation. In Proceedings of the 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR). IEEE, 13--18.Google Scholar
Cross Ref
- Zaki Abdulhameed Alhabba. 2002. ?????? ?????? ?? ???????? ???? ???? [Persian vocabulary in Baghdadiat Aziz Hejiyah]. Arab Encyclopedia House.Google Scholar
- Salih J. Altoma. 1969. The Problem of Diglossia in Arabic: A Comparative Study of Classical and Iraqi Arabic. Distributed for the Center for Middle Eastern Studies of Harvard University by Harvard University Press.Google Scholar
- A. Aziz Altowayan and Ashraf Elnagar. 2017. Improving Arabic sentiment analysis with sentiment-specific embeddings. In Proceedings of the IEEE International Conference on Big Data (Big Data’17), 4314--4320.Google Scholar
Cross Ref
- A. Aziz Altowayan and Lixin Tao. 2016. Word embeddings for arabic sentiment analysis. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data’16). IEEE, 3820--3825.Google Scholar
Cross Ref
- Fadi Biadsy, Julia Hirschberg, and Nizar Habash. 2009. Spoken Arabic dialect identification using phonotactic modeling. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics, 53--61.Google Scholar
Cross Ref
- Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017).Google Scholar
- Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Comput. Ling. 18, 4 (1992), 467--479.Google Scholar
Digital Library
- Barbara Cassin (Ed.). 2013. Paronym, Derivatively Named, Cognate Word. Princeton University Press. Retrieved from http://search.credoreference.com/content/entry/prunt/paronym_derivatively_named_cognate_word/0.Google Scholar
- Heeryon Cho and Sang Min Yoon. 2017. Issues in visualizing intercultural dialogue using Word2Vec and t-SNE. In Proceedings of the International Conference on Culture and Computing (Culture and Computing’17). 149--150.Google Scholar
Cross Ref
- M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (NAACL HLT’15). Association for Computational Linguistics, 1606--1615.Google Scholar
- Ram Frost and Leonard Katz. 1992. Orthography, phonology, morphology, and meaning: An overview. Adv. Psychol. 94, C (1992), 1--8.Google Scholar
- Meghan Glenn et al. 2013. GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 LDC2013T04. Linguistic Data Consortium, Philadelphia, PA.Google Scholar
- Meghan Glenn et al. 2013. GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2 LDC2013T17. Linguistic Data Consortium, Philadelphia, PA.Google Scholar
- Siegfried Handschuh, Vivian S. Silva, Manuela Hürliman, André Freitas, and Brian Davis. 2016. Semantic relation classification: Task formalisation and refinement. Proceedings of the Workshop on Cognitive Aspects of the Lexicon at the International Conference on Computational Linguistics (CogAlex-V@ COLING’16). Association for Computational Linguistics, 30--39.Google Scholar
- Sarah Haydosi. 2016. ????? ??????? ????????????، ??????? ??????? [The semantic lesson of fundamentalists, Case study-Alshafie]. Master’s thesis. Larbi ben Mahidi Oum El Bouaqhi, Algeria. Retrieved from http://bib.univ-oeb.dz:8080/jspui/bitstream/123456789/1802/1/%D9%85%D8%B0%D9%83%D8%B1%D8%A9.pdf.Google Scholar
- Mahmud Fahmi Hijazi. 1978. ??? ??? ????? ??????? [ ’Usus ’ilm al-lugat al-’arabiyya. ??? ??????? [Dar at-Taqafa], Cairo. 144--145.Google Scholar
- Barbara Johnstone. 1991. Repetition in Arabic Discourse Paradigms, Syntagms, and the Ecology of Language / Barbara Johnstone. John Benjamins, Amsterdam.Google Scholar
- Christopher S. G. Khoo and Jin-Cheon Na. 2006. Semantic relations in information science. Annu. Rev. Inf. Sci. Technol. 40, 1 (2006), 157--228.Google Scholar
Digital Library
- Prodromos Kolyvakis, Alexandros Kalousis, and Dimitris Kiritsis. 2018. DeepAlignment: Unsupervised ontology matching with refined word vectors. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1. 787--798. https://doi.org/10.18653/v1/N18-1072Google Scholar
Cross Ref
- K. Kunjunni Raja. 1963. Indian theories of meaning. Adyar Library and Research Centre.Google Scholar
- Mohamed Zakaria Kurdi. 2016. Natural Language Processing and Computational Linguistic: Speech, Morphology and Syntax. iSTE, Wiley. Retrieved from http://ebookcentral.proquest.com/lib/wmichlib-ebooks/detail.action?docID=4648735.Google Scholar
- Kutas and Federmeier. 2000. Electrophysiology reveals semantic memory use in language comprehension. Trends Cogn. Sci. 4, 12 (2000).Google Scholar
- Wiem Lahbib, Ibrahim Bounhas, Bilel Elayeb, Fabrice Evrard, and Yahya Slimani. 2013. A hybrid approach for arabic semantic relation extraction. In Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference.Google Scholar
- Aryeh Levin. 1998. Arabic linguistic thought and dialectology. J. Sci. Appl. Inf. 1 (1998).Google Scholar
- Shusen Liu, Peer-Timo Bremer, Jayaraman J. Thiagarajan, Vivek Srikumar, Bei Wang, Yarden Livnat, and Valerio Pascucci. 2018. Visual exploration of semantic relationships in neural word embeddings. IEEE Trans. Vis. Comput. Graph. 24, 1 (1 2018), 553--562. DOI:https://doi.org/10.1109/TVCG.2017.2745141Google Scholar
Cross Ref
- Appen Pty Ltd, Sydney, and Australia. 2006. Iraqi Arabic Conversational Telephone Speech LDC2006S45. Web Download. Linguistic Data Consortium, Philadelphia. (2006).Google Scholar
- Appen Pty Ltd, Sydney, and Australia. 2006. Iraqi Arabic Conversational Telephone Speech, Transcripts LDC2006T16. (2006).Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. (Seventh International Conference on Learning Representations (ICLR)).Google Scholar
- Nikola Mrkšić, Diarmuid O Séaghdha, Blaise Thomson, Milica Gašić, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’16). Association for Computational Linguistics.Google Scholar
- Don Lee and Fred Nilsen. 1975. Semantic theory : a linguistic perspective / Don L. F. Nilsen, Alleen Pace Nilsen.Google Scholar
- Arfath Pasha, Mohamed Al-Badrashiny, Mona T. Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14).Google Scholar
- Mario Pei. 1965. Invitation to Linguistics: A Basic Introduction to the Science of Language (1st ed.). Doubleday.Google Scholar
- Malika Sadi. 2011. ??????? ???? ???????????? [Significance and dialectic of the word relation to meaning]. Oudnad 61 (2011). https://www.oudnad.net/spip.php?article85.Google Scholar
- Magnus Sahlgren. 2006. The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations between Words in High-dimensional Vector Spaces. Ph.D. Dissertation. Stockholm University, US-AB, Sweden.Google Scholar
- Kazutoshi Sasahara. 2016. Visualizing collective attention using association networks. New Gener. Comput. 34, 4 (Oct. 2016), 323--340.Google Scholar
Cross Ref
- Abu Bakr Soliman, Kareem Eissa, and Samhaa R. El-Beltagy. 2017. AraVec: A set of Arabic word embedding models for use in Arabic NLP. Proc. Comput. Sci. 117 (2017), 256--265.Google Scholar
Cross Ref
- Ahmed Mukhtar Umer. 1998. ??? ??? ????? [A Basic Introduction to the Science of Language] (8th ed.). Alam Alkotob, Cairo, 144--145. http://waqfeya.com/book.php?bid=11743.Google Scholar
- Ahmed Mukhtar Umer. 1998. ??? ??????? [Ilm al-dilalah] (5th. ed.). Alam Alkotob, Cairo.Google Scholar
- Wout Van Bekkum, Jan Houben, Ineke Sluiter, and Kees Versteegh. 1997. The Emergence of Semantics in Four Linguistic Traditions: Hebrew, Sanskrit, Greek, Arabic. Retrieved from http://search.proquest.com/docview/85641136/.Google Scholar
- Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 1 (2014), 3221--3245.Google Scholar
- Laurens Van Der Maaten and G. Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 (2008), 2579--2605.Google Scholar
- Ivan Vulić, Nikola Mrkšić, Roi Reichart, Diarmuid Ó Séaghdha, Steve Young, and Anna Korhonen. 2017. Morph-fitting: Fine-tuning word vector spaces with simple language-specific rules. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (ACL’17), Vol. 1. Association for Computational Linguistics, 56--68.Google Scholar
- Suhad A. Yousif, Venus W. Samawi, Islam Elkabani, and Rached Zantout. 2015. Enhancement of Arabic text classification using semantic relations of Arabic WordNet text classification based on semantic relations view project failure recovery in distributed storage systems view project. J. Comput. Sci. 11, 3 (2015), 498--509.Google Scholar
Cross Ref
- Muhamed Abu Zahrah. 1978. ??????? ????? ?????-?????? ??????? [al-Shafie Hayatuh wa asruh wa arauh al-faqhieah] (2nd ed.). Dar al-fiker al-araby, Cairo. https://books.google.com/books?id=1BhHCwAAQBAJ.Google Scholar
- Mohamed A. Zahran, Ahmed Magooda, Ashraf Y. Mahgoub, Hazem Raafat, Mohsen Rashwan, and Amir Atyia. 2015. Word representations in vector space and their applications for arabic. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, Cham, 430--443.Google Scholar
Cross Ref
Index Terms
Wasf-Vec: Topology-based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology
Recommendations
Fassieh¯, a Semi-Automatic Visual Interactive Tool for Morphological, PoS-Tags, Phonetic, and Semantic Annotation of Arabic Text Corpora
This paper introduces an Arabic text annotation tool called Fassiehreg. Via a sophisticated interactive GUI application, Fassiehreg makes it easy to build structured large standard written Arabic corpora, then allows the production of fundamental ...
Morphological Word Embedding for Arabic
AbstractWord embedding has opened new and exciting avenues for understanding and processing languages. The simple yet effective word embedding models rapidly became a dominant building block for Natural Language Processing (NLP) applications as they ...
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...






Comments