skip to main content
research-article

The Impact of Weighting Schemes and Stemming Process on Topic Modeling of Arabic Long and Short Texts

Published:12 November 2020Publication History
Skip Abstract Section

Abstract

In this article, first a comprehensive study of the impact of term weighting schemes on the topic modeling performance (i.e., LDA and DMM) on Arabic long and short texts is presented. We investigate six term weighting methods including Word count method (standard topic models), TFIDF, PMI, BDC, CLPB, and CEW. Moreover, we propose a novel combination term weighting scheme, namely, CmTLB. We utilize the mTFIDF that takes into account the missing terms and the number of the documents in which the term appears when calculating the term weight. For further robust term weight, we combine mTFIDF with two weighting methods. We evaluate CmTLB against the studied weighting schemes by the quality of the learned topics (topic visualization and topic coherence), classification, and clustering tasks. We applied weighting schemes to Latent Dirichlet allocation (LDA) and Dirichlet multinomial mixture (DMM) on eight Arabic long and short document datasets, respectively. The experiment results outline that appropriate weighting schemes can effectively improve topic modeling performance on Arabic texts. More importantly, our proposed CmTLB significantly outperforms the other weighting schemes. Secondly, we investigate whether the Arabic stemming process can improve topic modeling performance. We study the three approaches of Arabic stemming including root-based, stem-based, and statistical approaches. We also train topic models with weighting schemes on documents after applying four stemmers related to different stemming approaches. The results outline that applying the stemming process not only reduces the dimensionality of term-document matrix leading to fast estimation process, but also show enhancement of topic modeling performance both on short and long Arabic documents. Moreover, Farasa stemmer achieves the highest performance in most cases, since it prevents the ambiguity that may happen because of the blind removal of the affixes such as in root-based or stem-based stemmers.

References

  1. Kheireddine Abainia, Siham Ouamour, and Halim Sayoud. 2017. A novel robust Arabic light stemmer. Journal of Experimental and Theoretical Artificial Intelligence 29, 3 (2017), 557--573. DOI:https://doi.org/10.1080/0952813X.2016.1212100Google ScholarGoogle ScholarCross RefCross Ref
  2. Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. 2016. Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 11–16. DOI:https://doi.org/10.18653/v1/n16-3003Google ScholarGoogle ScholarCross RefCross Ref
  3. Diab Abuaiadah, Jihad El Sana, and Walid Abusalah. 2014. On the impact of dataset characteristics on Arabic document classification. International Journal of Computer Applications (2014). DOI:https://doi.org/10.5120/17701-8680Google ScholarGoogle Scholar
  4. Mohammed N. Al-Kabi, Saif A. Kazakzeh, Belal M. Abu Ata, Saif A. Al-Rababah, and Izzat M. Alsmadi. 2015. A novel root based Arabic stemmer. Journal of King Saud University - Computer and Information Sciences 27, 2 (2015), 94--103. DOI:https://doi.org/10.1016/j.jksuci.2014.04.001Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Alhawarat and M. Hegazi. 2018. Revisiting K-means and topic modeling, a comparison study to cluster Arabic documents. IEEE Access (2018). DOI:https://doi.org/10.1109/ACCESS.2018.2852648Google ScholarGoogle Scholar
  6. Manar Alkhatib, May El Barachi, and Khaled Shaalan. 2019. An Arabic social media based framework for incidents and events monitoring in smart cities. Journal of Cleaner Production 220 (May 2019), 771--785. DOI:https://doi.org/10.1016/j.jclepro.2019.02.063Google ScholarGoogle ScholarCross RefCross Ref
  7. Nasser Alsaedi, Pete Burnap, and Omer Rana. 2016. Sensing real-world events using Arabic Twitter posts. In Proceedings of the 10th International Conference on Web and Social Media (ICWSM’16).Google ScholarGoogle Scholar
  8. Abdullah Ayedh and Guanzheng Tan. 2016. Building and benchmarking novel Arabic stemmer for document classification. Journal of Computational and Theoretical Nanoscience (2016). DOI:https://doi.org/10.1166/jctn.2016.5077Google ScholarGoogle Scholar
  9. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research (2003).Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Naaima Boudad, Rdouan Faizi, Rachid Oulad Haj Thami, and Raddouane Chiheb. 2018. Sentiment analysis in Arabic: A review of the literature. Ain Shams Engineering Journal 9, 4 (2018), 2479--2490. DOI:https://doi.org/10.1016/j.asej.2017.04.007Google ScholarGoogle ScholarCross RefCross Ref
  11. Gerlof Bouma. 2009. Normalized (pointwise) mutual information in collocation extraction. Proceedings of German Society for Computational Linguistics (GSCL’09).Google ScholarGoogle Scholar
  12. Chien Hsing Chen. 2017. Improved TFIDF in big news retrieval: An empirical study. Pattern Recognition Letters (2017). DOI:https://doi.org/10.1016/j.patrec.2016.11.004Google ScholarGoogle Scholar
  13. Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).Google ScholarGoogle Scholar
  14. Omar Einea, Ashraf Elnagar, and Ridhwan Al Debsi. 2019. SANAD: Single-label Arabic news articles dataset for automatic text categorization. Data in Brief 25 (2019), 104076. DOI:https://doi.org/10.1016/j.dib.2019.104076Google ScholarGoogle ScholarCross RefCross Ref
  15. Hugo Jair Escalante, Mauricio A. García-Limón, Alicia Morales-Reyes, Mario Graff, Manuel Montes-y Gómez, Eduardo F. Morales, and José Martínez-Carranza. 2015. Term-weighting learning via genetic programming for text classification. Knowledge-Based Systems 83 (July 2015), 176--189. DOI:https://doi.org/10.1016/j.knosys.2015.03.025 arxiv:1410.0640Google ScholarGoogle Scholar
  16. Imane Guellil, Houda Saâdane, Faical Azouaou, Billel Gueni, and Damien Nouvel. 2019. Arabic natural language processing: An overview. Journal of King Saud University - Computer and Information Sciences (2019). DOI:https://doi.org/10.1016/j.jksuci.2019.02.006Google ScholarGoogle Scholar
  17. Nizar Y. Habash. 2010. Introduction to Arabic Natural Language Processing. Vol. 3. DOI:https://doi.org/10.2200/S00277ED1V01Y201008HLT010Google ScholarGoogle Scholar
  18. G. Khoja, S. Garside, and R. Knowles. 2001. Stemming Arabic text. NAACL 2001 (2001).Google ScholarGoogle Scholar
  19. Laila Khreisat. 2006. Arabic text classification using N-gram frequency statistics a comparative study. Conference on Data Mining (DMIN’06).Google ScholarGoogle Scholar
  20. R. Lakshmi and S. Baskar. 2019. Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms. Expert Systems with Applications (2019). DOI:https://doi.org/10.1016/j.eswa.2019.07.022Google ScholarGoogle Scholar
  21. Leah S. Larkey, Lisa Ballesteros, and Margaret E. Connell. 2007. Light stemming for Arabic information retrieval. In Arabic Computational Morphology. DOI:https://doi.org/10.1007/978-1-4020-6046-5_12Google ScholarGoogle Scholar
  22. Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings categories and subject descriptors. SIGIR (2016), 165--174.Google ScholarGoogle Scholar
  23. Ning Li, Wenjuan Luo, Kun Yang, Fuzhen Zhuang, Qing He, and Zhongzhi Shi. 2018. Self-organizing weighted incremental probabilistic latent semantic analysis. International Journal of Machine Learning and Cybernetics 9, 12 (Dec. 2018), 1987--1998. DOI:https://doi.org/10.1007/s13042-017-0681-9Google ScholarGoogle ScholarCross RefCross Ref
  24. Ximing Li, Yue Wang, Ang Zhang, Changchun Li, Jinjin Chi, and Jihong Ouyang. 2018. Filtering out the noise in short text topic modeling. Information Sciences 456 (Aug. 2018), 83--96. DOI:https://doi.org/10.1016/j.ins.2018.04.071Google ScholarGoogle Scholar
  25. Ximing Li, Ang Zhang, Changchun Li, Jihong Ouyang, and Yi Cai. 2018. Exploring coherent topics by topic modeling with term weighting. Information Processing and Management (2018). DOI:https://doi.org/10.1016/j.ipm.2018.05.009Google ScholarGoogle Scholar
  26. Ximing Li, Jiaojiao Zhang, and Jihong Ouyang. 2019. Dirichlet multinomial mixture with variational manifold regularization: Topic modeling over short texts. Proceedings of the AAAI Conference on Artificial Intelligence (2019). DOI:https://doi.org/10.1609/aaai.v33i01.33017884Google ScholarGoogle ScholarCross RefCross Ref
  27. Tinghuai Ma, Huan Rong, Yongsheng Hao, Jie Cao, Yuan Tian, and Mznah A. Al-Rodhaan. 2019. A novel sentiment polarity detection framework for chinese. IEEE Transactions on Affective Computing 3045, (2019). DOI:https://doi.org/10.1109/TAFFC.2019.2932061Google ScholarGoogle Scholar
  28. Tinghuai Ma, Yu Wei Zhao, Honghao Zhou, Yuan Tian, Abdullah Al-Dhelaan, and Mznah Al-Rodhaan. 2019. Natural disaster topic extraction in Sina microblogging based on graph analysis. Expert Systems with Applications 115 (2019), 346--355. DOI:https://doi.org/10.1016/j.eswa.2018.08.010Google ScholarGoogle ScholarCross RefCross Ref
  29. Jocelyn Mazarura and Alta de Waal. 2016. A comparison of the performance of latent Dirichlet allocation and the Dirichlet multinomial mixture model on short text. In 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech). IEEE, 1--6. DOI:https://doi.org/10.1109/RoboMech.2016.7813155Google ScholarGoogle ScholarCross RefCross Ref
  30. Will Monroe, Spence Green, and Christopher D. Manning. 2014. Word segmentation of informal Arabic with domain adaptation. In 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14) - Proceedings of the Conference. DOI:https://doi.org/10.3115/v1/p14-2034Google ScholarGoogle Scholar
  31. Marwa Naili, Anja Habacha Chaibi, and Henda Hajjami Ben Ghezala. 2019. Comparative study of Arabic stemming algorithms for topic identification. Procedia Computer Science 159 (2019), 794--802. DOI:https://doi.org/10.1016/j.procs.2019.09.238Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Marwa Naili, Anja Habacha Chaibi, and Henda Ben Ghézala. 2017. Arabic topic identification based on empirical studies of topic models. ARIMA Journal 27 (2017), 45--59. https://www.ethnologue.com/language/arb.Google ScholarGoogle Scholar
  33. Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).Google ScholarGoogle Scholar
  34. Ahmed Rafea and Nada A. GabAllah. 2018. Topic detection approaches in identifying topics and events from Arabic corpora. Procedia Computer Science 142 (2018), 270--277. DOI:https://doi.org/10.1016/j.procs.2018.10.492Google ScholarGoogle ScholarCross RefCross Ref
  35. Huan Rong, Tinghuai Ma, Jie Cao, Yuan Tian, Abdullah Al-Dhelaan, and Mznah Al-Rodhaan. 2019. Deep rolling: A novel emotion prediction model for a multi-participant communication context. Information Sciences 488 (2019), 158--180. DOI:https://doi.org/10.1016/j.ins.2019.03.023Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Huan Rong, Victor S. Sheng, Tinghuai Ma, Yang Zhou, and Mznah A. Al-Rodhaan. 2020. A self-play and sentiment-emphasized comment integration framework based on deep q-learning in a crowdsourcing scenario. IEEE Transactions on Knowledge and Data Engineering 4347, (2020), 1--1. DOI:https://doi.org/10.1109/tkde.2020.2993272Google ScholarGoogle Scholar
  37. Thabit Sabbah, Ali Selamat, Md Hafiz Selamat, Fawaz S. Al-Anzi, Enrique Herrera Viedma, Ondrej Krejcar, and Hamido Fujita. 2017. Modified frequency-based term weighting schemes for text classification. Applied Soft Computing Journal 58 (2017), 193--206. DOI:https://doi.org/10.1016/j.asoc.2017.04.069Google ScholarGoogle ScholarCross RefCross Ref
  38. Muazzam Ahmed Siddiqui, Syed Muhammad Faraz, and Sohail Abdul Sattar. 2013. Discovering the thematic structure of the Quran using probabilistic topic model. In 2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences. IEEE, 234--239. DOI:https://doi.org/10.1109/NOORIC.2013.55Google ScholarGoogle ScholarCross RefCross Ref
  39. Hussein Soori, Jan Platoš, and Václav Snášel. 2012. Simple stemming rules for Arabic language. In Advances in Intelligent Systems and Computing. DOI:https://doi.org/10.1007/978-3-642-31603-6_9Google ScholarGoogle Scholar
  40. Karen Spärck Jones. 2004. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 60, 5 (Oct. 2004), 493--502. DOI:https://doi.org/10.1108/00220410410560573Google ScholarGoogle Scholar
  41. Kazem Taghva, Rania Elkhoury, and Jeffrey Coombs. 2005. Arabic stemming without a root dictionary. In International Conference on Information Technology: Coding and Computing (ITCC).Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Padmaja CH V. R. 2018. Probabilistic —Asurvey. International Journal of Advanced Research in Computer Science 9, 3 (June 2018), 173--177. DOI:https://doi.org/10.26483/ijarcs.v9i3.6107Google ScholarGoogle Scholar
  43. Na Wang, Pengyuan Wang, and Baowei Zhang. 2010. An improved TF-IDF weights function based on information theory. In Proceedings of the 2010 International Conference on Computer and Communication Technologies in Agriculture Engineering (CCTAE’10). DOI:https://doi.org/10.1109/CCTAE.2010.5544382Google ScholarGoogle ScholarCross RefCross Ref
  44. Tao Wang, Yi Cai, Ho Fung Leung, Zhiwei Cai, and Huaqing Min. 2016. Entropy-based term weighting schemes for text categorization in VSM. In Proceedings of the International Conference on Tools with Artificial Intelligence (ICTAI). DOI:https://doi.org/10.1109/ICTAI.2015.57Google ScholarGoogle Scholar
  45. Andrew T. Wilson and Peter A. Chew. 2010. Term weighting schemes for latent Dirichlet allocation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT’10), Proceedings of the main conference.Google ScholarGoogle Scholar
  46. Kai Yang, Yi Cai, Zhenhong Chen, Ho Fung Leung, and Raymond Lau. 2016. Exploring topic discriminating power of words in latent Dirichlet allocation. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16), Technical Papers.Google ScholarGoogle Scholar
  47. Jianhua Yin and Jianyong Wang. 2014. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2014), 233--242. DOI:https://doi.org/10.1145/2623330.2623715Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yueting Zhuang, Hanqi Wang, Jun Xiao, Fei Wu, Yi Yang, Weiming Lu, and Zhongfei Zhang. 2017. Bag-of-discriminative-words (BoDW) representation via topic modeling. IEEE Transactions on Knowledge and Data Engineering (2017). DOI:https://doi.org/10.1109/TKDE.2017.2658571Google ScholarGoogle Scholar

Index Terms

  1. The Impact of Weighting Schemes and Stemming Process on Topic Modeling of Arabic Long and Short Texts

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!