skip to main content
research-article

Wasf-Vec: Topology-based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology

Authors Info & Claims
Published:12 December 2019Publication History
Skip Abstract Section

Abstract

Word clustering is a serious challenge in low-resource languages. Since words that share semantics are expected to be clustered together, it is common to use a feature vector representation generated from a distributional theory-based word embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low-resource Iraqi vocabulary. We began with a new Dialect Fast Stemming Algorithm (DFSA) that utilizes the MSA data. The proposed algorithm achieved 0.85 accuracy measured by the F1 score. Then, the distributional theory-based word embedding method and a new simple, yet effective, feature vector named Wasf-Vec word embedding are tested. Wasf-Vec word representation utilizes a word’s topology features. The difference between Wasf-Vec and distributional theory-based word embedding is that Wasf-Vec captures relations that are not contextually based. The embedding is followed by an analysis of how the dialect words are clustered within other MSA words. The analysis is based on the word semantic relations that are well supported by solid linguistic theories to shed light on the strong and weak word relation representations identified by each embedding method. The analysis is handled by visualizing the feature vector in two-dimensional (2D) space. The feature vectors of the distributional theory-based word embedding method are plotted in 2D space using the t-sne algorithm, while the Wasf-Vec feature vectors are plotted directly in 2D space. A word’s nearest neighbors and the distance-histograms of the plotted words are examined. For validation purpose of the word classification used in this article, the produced classes are employed in Class-based Language Modeling (CBLM). Wasf-Vec CBLM achieved a 7% lower perplexity (pp) than the distributional theory-based word embedding method CBLM. This result is significant when working with low-resource languages.

References

  1. Tiba Zaki Abdulhameed, Imed Zitouni, and Ikhlas Abdel-Qader. 2018. Enhancement of the word2vec class-based language modeling by optimizing the features vector using PCA. In Proceedings of the 2018 IEEE International Conference on Electro/Information Technology (EIT’18), Vol. 2018-. IEEE, 0866--0870.Google ScholarGoogle ScholarCross RefCross Ref
  2. Tiba Zaki Abdulhameed, Imed Zitouni, Ikhlas Abdel-Qader, and Mohamed Abusharkh. 2017. Assessing the usability of modern standard Arabic data in enhancing the language model of limited size dialect conversations. Proceedings of the International Conference on Natural Language, Signal and Speech Processing 2017, Casablanca, Moroco.Google ScholarGoogle Scholar
  3. George W. Adamson and Jillian Boreham. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Information Storage and Retrieval 10, 7 (1974), 253--260.Google ScholarGoogle ScholarCross RefCross Ref
  4. Abdulaziz M. Alayba, Vasile Palade, Matthew England, and Rahat Iqbal. 2018. Improving sentiment analysis in Arabic using word representation. In Proceedings of the 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR). IEEE, 13--18.Google ScholarGoogle ScholarCross RefCross Ref
  5. Zaki Abdulhameed Alhabba. 2002. ?????? ?????? ?? ???????? ???? ???? [Persian vocabulary in Baghdadiat Aziz Hejiyah]. Arab Encyclopedia House.Google ScholarGoogle Scholar
  6. Salih J. Altoma. 1969. The Problem of Diglossia in Arabic: A Comparative Study of Classical and Iraqi Arabic. Distributed for the Center for Middle Eastern Studies of Harvard University by Harvard University Press.Google ScholarGoogle Scholar
  7. A. Aziz Altowayan and Ashraf Elnagar. 2017. Improving Arabic sentiment analysis with sentiment-specific embeddings. In Proceedings of the IEEE International Conference on Big Data (Big Data’17), 4314--4320.Google ScholarGoogle ScholarCross RefCross Ref
  8. A. Aziz Altowayan and Lixin Tao. 2016. Word embeddings for arabic sentiment analysis. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data’16). IEEE, 3820--3825.Google ScholarGoogle ScholarCross RefCross Ref
  9. Fadi Biadsy, Julia Hirschberg, and Nizar Habash. 2009. Spoken Arabic dialect identification using phonotactic modeling. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics, 53--61.Google ScholarGoogle ScholarCross RefCross Ref
  10. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017).Google ScholarGoogle Scholar
  11. Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Comput. Ling. 18, 4 (1992), 467--479.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Barbara Cassin (Ed.). 2013. Paronym, Derivatively Named, Cognate Word. Princeton University Press. Retrieved from http://search.credoreference.com/content/entry/prunt/paronym_derivatively_named_cognate_word/0.Google ScholarGoogle Scholar
  13. Heeryon Cho and Sang Min Yoon. 2017. Issues in visualizing intercultural dialogue using Word2Vec and t-SNE. In Proceedings of the International Conference on Culture and Computing (Culture and Computing’17). 149--150.Google ScholarGoogle ScholarCross RefCross Ref
  14. M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (NAACL HLT’15). Association for Computational Linguistics, 1606--1615.Google ScholarGoogle Scholar
  15. Ram Frost and Leonard Katz. 1992. Orthography, phonology, morphology, and meaning: An overview. Adv. Psychol. 94, C (1992), 1--8.Google ScholarGoogle Scholar
  16. Meghan Glenn et al. 2013. GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 LDC2013T04. Linguistic Data Consortium, Philadelphia, PA.Google ScholarGoogle Scholar
  17. Meghan Glenn et al. 2013. GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2 LDC2013T17. Linguistic Data Consortium, Philadelphia, PA.Google ScholarGoogle Scholar
  18. Siegfried Handschuh, Vivian S. Silva, Manuela Hürliman, André Freitas, and Brian Davis. 2016. Semantic relation classification: Task formalisation and refinement. Proceedings of the Workshop on Cognitive Aspects of the Lexicon at the International Conference on Computational Linguistics (CogAlex-V@ COLING’16). Association for Computational Linguistics, 30--39.Google ScholarGoogle Scholar
  19. Sarah Haydosi. 2016. ????? ??????? ????????????، ??????? ??????? [The semantic lesson of fundamentalists, Case study-Alshafie]. Master’s thesis. Larbi ben Mahidi Oum El Bouaqhi, Algeria. Retrieved from http://bib.univ-oeb.dz:8080/jspui/bitstream/123456789/1802/1/%D9%85%D8%B0%D9%83%D8%B1%D8%A9.pdf.Google ScholarGoogle Scholar
  20. Mahmud Fahmi Hijazi. 1978. ??? ??? ????? ??????? [ ’Usus ’ilm al-lugat al-’arabiyya. ??? ??????? [Dar at-Taqafa], Cairo. 144--145.Google ScholarGoogle Scholar
  21. Barbara Johnstone. 1991. Repetition in Arabic Discourse Paradigms, Syntagms, and the Ecology of Language / Barbara Johnstone. John Benjamins, Amsterdam.Google ScholarGoogle Scholar
  22. Christopher S. G. Khoo and Jin-Cheon Na. 2006. Semantic relations in information science. Annu. Rev. Inf. Sci. Technol. 40, 1 (2006), 157--228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Prodromos Kolyvakis, Alexandros Kalousis, and Dimitris Kiritsis. 2018. DeepAlignment: Unsupervised ontology matching with refined word vectors. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1. 787--798. https://doi.org/10.18653/v1/N18-1072Google ScholarGoogle ScholarCross RefCross Ref
  24. K. Kunjunni Raja. 1963. Indian theories of meaning. Adyar Library and Research Centre.Google ScholarGoogle Scholar
  25. Mohamed Zakaria Kurdi. 2016. Natural Language Processing and Computational Linguistic: Speech, Morphology and Syntax. iSTE, Wiley. Retrieved from http://ebookcentral.proquest.com/lib/wmichlib-ebooks/detail.action?docID=4648735.Google ScholarGoogle Scholar
  26. Kutas and Federmeier. 2000. Electrophysiology reveals semantic memory use in language comprehension. Trends Cogn. Sci. 4, 12 (2000).Google ScholarGoogle Scholar
  27. Wiem Lahbib, Ibrahim Bounhas, Bilel Elayeb, Fabrice Evrard, and Yahya Slimani. 2013. A hybrid approach for arabic semantic relation extraction. In Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference.Google ScholarGoogle Scholar
  28. Aryeh Levin. 1998. Arabic linguistic thought and dialectology. J. Sci. Appl. Inf. 1 (1998).Google ScholarGoogle Scholar
  29. Shusen Liu, Peer-Timo Bremer, Jayaraman J. Thiagarajan, Vivek Srikumar, Bei Wang, Yarden Livnat, and Valerio Pascucci. 2018. Visual exploration of semantic relationships in neural word embeddings. IEEE Trans. Vis. Comput. Graph. 24, 1 (1 2018), 553--562. DOI:https://doi.org/10.1109/TVCG.2017.2745141Google ScholarGoogle ScholarCross RefCross Ref
  30. Appen Pty Ltd, Sydney, and Australia. 2006. Iraqi Arabic Conversational Telephone Speech LDC2006S45. Web Download. Linguistic Data Consortium, Philadelphia. (2006).Google ScholarGoogle Scholar
  31. Appen Pty Ltd, Sydney, and Australia. 2006. Iraqi Arabic Conversational Telephone Speech, Transcripts LDC2006T16. (2006).Google ScholarGoogle Scholar
  32. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. (Seventh International Conference on Learning Representations (ICLR)).Google ScholarGoogle Scholar
  33. Nikola Mrkšić, Diarmuid O Séaghdha, Blaise Thomson, Milica Gašić, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’16). Association for Computational Linguistics.Google ScholarGoogle Scholar
  34. Don Lee and Fred Nilsen. 1975. Semantic theory : a linguistic perspective / Don L. F. Nilsen, Alleen Pace Nilsen.Google ScholarGoogle Scholar
  35. Arfath Pasha, Mohamed Al-Badrashiny, Mona T. Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14).Google ScholarGoogle Scholar
  36. Mario Pei. 1965. Invitation to Linguistics: A Basic Introduction to the Science of Language (1st ed.). Doubleday.Google ScholarGoogle Scholar
  37. Malika Sadi. 2011. ??????? ???? ???????????? [Significance and dialectic of the word relation to meaning]. Oudnad 61 (2011). https://www.oudnad.net/spip.php?article85.Google ScholarGoogle Scholar
  38. Magnus Sahlgren. 2006. The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations between Words in High-dimensional Vector Spaces. Ph.D. Dissertation. Stockholm University, US-AB, Sweden.Google ScholarGoogle Scholar
  39. Kazutoshi Sasahara. 2016. Visualizing collective attention using association networks. New Gener. Comput. 34, 4 (Oct. 2016), 323--340.Google ScholarGoogle ScholarCross RefCross Ref
  40. Abu Bakr Soliman, Kareem Eissa, and Samhaa R. El-Beltagy. 2017. AraVec: A set of Arabic word embedding models for use in Arabic NLP. Proc. Comput. Sci. 117 (2017), 256--265.Google ScholarGoogle ScholarCross RefCross Ref
  41. Ahmed Mukhtar Umer. 1998. ??? ??? ????? [A Basic Introduction to the Science of Language] (8th ed.). Alam Alkotob, Cairo, 144--145. http://waqfeya.com/book.php?bid=11743.Google ScholarGoogle Scholar
  42. Ahmed Mukhtar Umer. 1998. ??? ??????? [Ilm al-dilalah] (5th. ed.). Alam Alkotob, Cairo.Google ScholarGoogle Scholar
  43. Wout Van Bekkum, Jan Houben, Ineke Sluiter, and Kees Versteegh. 1997. The Emergence of Semantics in Four Linguistic Traditions: Hebrew, Sanskrit, Greek, Arabic. Retrieved from http://search.proquest.com/docview/85641136/.Google ScholarGoogle Scholar
  44. Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 1 (2014), 3221--3245.Google ScholarGoogle Scholar
  45. Laurens Van Der Maaten and G. Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 (2008), 2579--2605.Google ScholarGoogle Scholar
  46. Ivan Vulić, Nikola Mrkšić, Roi Reichart, Diarmuid Ó Séaghdha, Steve Young, and Anna Korhonen. 2017. Morph-fitting: Fine-tuning word vector spaces with simple language-specific rules. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (ACL’17), Vol. 1. Association for Computational Linguistics, 56--68.Google ScholarGoogle Scholar
  47. Suhad A. Yousif, Venus W. Samawi, Islam Elkabani, and Rached Zantout. 2015. Enhancement of Arabic text classification using semantic relations of Arabic WordNet text classification based on semantic relations view project failure recovery in distributed storage systems view project. J. Comput. Sci. 11, 3 (2015), 498--509.Google ScholarGoogle ScholarCross RefCross Ref
  48. Muhamed Abu Zahrah. 1978. ??????? ????? ?????-?????? ??????? [al-Shafie Hayatuh wa asruh wa arauh al-faqhieah] (2nd ed.). Dar al-fiker al-araby, Cairo. https://books.google.com/books?id=1BhHCwAAQBAJ.Google ScholarGoogle Scholar
  49. Mohamed A. Zahran, Ahmed Magooda, Ashraf Y. Mahgoub, Hazem Raafat, Mohsen Rashwan, and Amir Atyia. 2015. Word representations in vector space and their applications for arabic. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, Cham, 430--443.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Wasf-Vec: Topology-based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Article Metrics

                  • Downloads (Last 12 months)20
                  • Downloads (Last 6 weeks)0

                  Other Metrics

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                HTML Format

                View this article in HTML Format .

                View HTML Format
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!