Abstract
In this article, first a comprehensive study of the impact of term weighting schemes on the topic modeling performance (i.e., LDA and DMM) on Arabic long and short texts is presented. We investigate six term weighting methods including Word count method (standard topic models), TFIDF, PMI, BDC, CLPB, and CEW. Moreover, we propose a novel combination term weighting scheme, namely, CmTLB. We utilize the mTFIDF that takes into account the missing terms and the number of the documents in which the term appears when calculating the term weight. For further robust term weight, we combine mTFIDF with two weighting methods. We evaluate CmTLB against the studied weighting schemes by the quality of the learned topics (topic visualization and topic coherence), classification, and clustering tasks. We applied weighting schemes to Latent Dirichlet allocation (LDA) and Dirichlet multinomial mixture (DMM) on eight Arabic long and short document datasets, respectively. The experiment results outline that appropriate weighting schemes can effectively improve topic modeling performance on Arabic texts. More importantly, our proposed CmTLB significantly outperforms the other weighting schemes. Secondly, we investigate whether the Arabic stemming process can improve topic modeling performance. We study the three approaches of Arabic stemming including root-based, stem-based, and statistical approaches. We also train topic models with weighting schemes on documents after applying four stemmers related to different stemming approaches. The results outline that applying the stemming process not only reduces the dimensionality of term-document matrix leading to fast estimation process, but also show enhancement of topic modeling performance both on short and long Arabic documents. Moreover, Farasa stemmer achieves the highest performance in most cases, since it prevents the ambiguity that may happen because of the blind removal of the affixes such as in root-based or stem-based stemmers.
- Kheireddine Abainia, Siham Ouamour, and Halim Sayoud. 2017. A novel robust Arabic light stemmer. Journal of Experimental and Theoretical Artificial Intelligence 29, 3 (2017), 557--573. DOI:https://doi.org/10.1080/0952813X.2016.1212100Google Scholar
Cross Ref
- Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. 2016. Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 11–16. DOI:https://doi.org/10.18653/v1/n16-3003Google Scholar
Cross Ref
- Diab Abuaiadah, Jihad El Sana, and Walid Abusalah. 2014. On the impact of dataset characteristics on Arabic document classification. International Journal of Computer Applications (2014). DOI:https://doi.org/10.5120/17701-8680Google Scholar
- Mohammed N. Al-Kabi, Saif A. Kazakzeh, Belal M. Abu Ata, Saif A. Al-Rababah, and Izzat M. Alsmadi. 2015. A novel root based Arabic stemmer. Journal of King Saud University - Computer and Information Sciences 27, 2 (2015), 94--103. DOI:https://doi.org/10.1016/j.jksuci.2014.04.001Google Scholar
Digital Library
- M. Alhawarat and M. Hegazi. 2018. Revisiting K-means and topic modeling, a comparison study to cluster Arabic documents. IEEE Access (2018). DOI:https://doi.org/10.1109/ACCESS.2018.2852648Google Scholar
- Manar Alkhatib, May El Barachi, and Khaled Shaalan. 2019. An Arabic social media based framework for incidents and events monitoring in smart cities. Journal of Cleaner Production 220 (May 2019), 771--785. DOI:https://doi.org/10.1016/j.jclepro.2019.02.063Google Scholar
Cross Ref
- Nasser Alsaedi, Pete Burnap, and Omer Rana. 2016. Sensing real-world events using Arabic Twitter posts. In Proceedings of the 10th International Conference on Web and Social Media (ICWSM’16).Google Scholar
- Abdullah Ayedh and Guanzheng Tan. 2016. Building and benchmarking novel Arabic stemmer for document classification. Journal of Computational and Theoretical Nanoscience (2016). DOI:https://doi.org/10.1166/jctn.2016.5077Google Scholar
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research (2003).Google Scholar
Digital Library
- Naaima Boudad, Rdouan Faizi, Rachid Oulad Haj Thami, and Raddouane Chiheb. 2018. Sentiment analysis in Arabic: A review of the literature. Ain Shams Engineering Journal 9, 4 (2018), 2479--2490. DOI:https://doi.org/10.1016/j.asej.2017.04.007Google Scholar
Cross Ref
- Gerlof Bouma. 2009. Normalized (pointwise) mutual information in collocation extraction. Proceedings of German Society for Computational Linguistics (GSCL’09).Google Scholar
- Chien Hsing Chen. 2017. Improved TFIDF in big news retrieval: An empirical study. Pattern Recognition Letters (2017). DOI:https://doi.org/10.1016/j.patrec.2016.11.004Google Scholar
- Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16).Google Scholar
- Omar Einea, Ashraf Elnagar, and Ridhwan Al Debsi. 2019. SANAD: Single-label Arabic news articles dataset for automatic text categorization. Data in Brief 25 (2019), 104076. DOI:https://doi.org/10.1016/j.dib.2019.104076Google Scholar
Cross Ref
- Hugo Jair Escalante, Mauricio A. García-Limón, Alicia Morales-Reyes, Mario Graff, Manuel Montes-y Gómez, Eduardo F. Morales, and José Martínez-Carranza. 2015. Term-weighting learning via genetic programming for text classification. Knowledge-Based Systems 83 (July 2015), 176--189. DOI:https://doi.org/10.1016/j.knosys.2015.03.025 arxiv:1410.0640Google Scholar
- Imane Guellil, Houda Saâdane, Faical Azouaou, Billel Gueni, and Damien Nouvel. 2019. Arabic natural language processing: An overview. Journal of King Saud University - Computer and Information Sciences (2019). DOI:https://doi.org/10.1016/j.jksuci.2019.02.006Google Scholar
- Nizar Y. Habash. 2010. Introduction to Arabic Natural Language Processing. Vol. 3. DOI:https://doi.org/10.2200/S00277ED1V01Y201008HLT010Google Scholar
- G. Khoja, S. Garside, and R. Knowles. 2001. Stemming Arabic text. NAACL 2001 (2001).Google Scholar
- Laila Khreisat. 2006. Arabic text classification using N-gram frequency statistics a comparative study. Conference on Data Mining (DMIN’06).Google Scholar
- R. Lakshmi and S. Baskar. 2019. Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms. Expert Systems with Applications (2019). DOI:https://doi.org/10.1016/j.eswa.2019.07.022Google Scholar
- Leah S. Larkey, Lisa Ballesteros, and Margaret E. Connell. 2007. Light stemming for Arabic information retrieval. In Arabic Computational Morphology. DOI:https://doi.org/10.1007/978-1-4020-6046-5_12Google Scholar
- Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings categories and subject descriptors. SIGIR (2016), 165--174.Google Scholar
- Ning Li, Wenjuan Luo, Kun Yang, Fuzhen Zhuang, Qing He, and Zhongzhi Shi. 2018. Self-organizing weighted incremental probabilistic latent semantic analysis. International Journal of Machine Learning and Cybernetics 9, 12 (Dec. 2018), 1987--1998. DOI:https://doi.org/10.1007/s13042-017-0681-9Google Scholar
Cross Ref
- Ximing Li, Yue Wang, Ang Zhang, Changchun Li, Jinjin Chi, and Jihong Ouyang. 2018. Filtering out the noise in short text topic modeling. Information Sciences 456 (Aug. 2018), 83--96. DOI:https://doi.org/10.1016/j.ins.2018.04.071Google Scholar
- Ximing Li, Ang Zhang, Changchun Li, Jihong Ouyang, and Yi Cai. 2018. Exploring coherent topics by topic modeling with term weighting. Information Processing and Management (2018). DOI:https://doi.org/10.1016/j.ipm.2018.05.009Google Scholar
- Ximing Li, Jiaojiao Zhang, and Jihong Ouyang. 2019. Dirichlet multinomial mixture with variational manifold regularization: Topic modeling over short texts. Proceedings of the AAAI Conference on Artificial Intelligence (2019). DOI:https://doi.org/10.1609/aaai.v33i01.33017884Google Scholar
Cross Ref
- Tinghuai Ma, Huan Rong, Yongsheng Hao, Jie Cao, Yuan Tian, and Mznah A. Al-Rodhaan. 2019. A novel sentiment polarity detection framework for chinese. IEEE Transactions on Affective Computing 3045, (2019). DOI:https://doi.org/10.1109/TAFFC.2019.2932061Google Scholar
- Tinghuai Ma, Yu Wei Zhao, Honghao Zhou, Yuan Tian, Abdullah Al-Dhelaan, and Mznah Al-Rodhaan. 2019. Natural disaster topic extraction in Sina microblogging based on graph analysis. Expert Systems with Applications 115 (2019), 346--355. DOI:https://doi.org/10.1016/j.eswa.2018.08.010Google Scholar
Cross Ref
- Jocelyn Mazarura and Alta de Waal. 2016. A comparison of the performance of latent Dirichlet allocation and the Dirichlet multinomial mixture model on short text. In 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech). IEEE, 1--6. DOI:https://doi.org/10.1109/RoboMech.2016.7813155Google Scholar
Cross Ref
- Will Monroe, Spence Green, and Christopher D. Manning. 2014. Word segmentation of informal Arabic with domain adaptation. In 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14) - Proceedings of the Conference. DOI:https://doi.org/10.3115/v1/p14-2034Google Scholar
- Marwa Naili, Anja Habacha Chaibi, and Henda Hajjami Ben Ghezala. 2019. Comparative study of Arabic stemming algorithms for topic identification. Procedia Computer Science 159 (2019), 794--802. DOI:https://doi.org/10.1016/j.procs.2019.09.238Google Scholar
Digital Library
- Marwa Naili, Anja Habacha Chaibi, and Henda Ben Ghézala. 2017. Arabic topic identification based on empirical studies of topic models. ARIMA Journal 27 (2017), 45--59. https://www.ethnologue.com/language/arb.Google Scholar
- Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14).Google Scholar
- Ahmed Rafea and Nada A. GabAllah. 2018. Topic detection approaches in identifying topics and events from Arabic corpora. Procedia Computer Science 142 (2018), 270--277. DOI:https://doi.org/10.1016/j.procs.2018.10.492Google Scholar
Cross Ref
- Huan Rong, Tinghuai Ma, Jie Cao, Yuan Tian, Abdullah Al-Dhelaan, and Mznah Al-Rodhaan. 2019. Deep rolling: A novel emotion prediction model for a multi-participant communication context. Information Sciences 488 (2019), 158--180. DOI:https://doi.org/10.1016/j.ins.2019.03.023Google Scholar
Digital Library
- Huan Rong, Victor S. Sheng, Tinghuai Ma, Yang Zhou, and Mznah A. Al-Rodhaan. 2020. A self-play and sentiment-emphasized comment integration framework based on deep q-learning in a crowdsourcing scenario. IEEE Transactions on Knowledge and Data Engineering 4347, (2020), 1--1. DOI:https://doi.org/10.1109/tkde.2020.2993272Google Scholar
- Thabit Sabbah, Ali Selamat, Md Hafiz Selamat, Fawaz S. Al-Anzi, Enrique Herrera Viedma, Ondrej Krejcar, and Hamido Fujita. 2017. Modified frequency-based term weighting schemes for text classification. Applied Soft Computing Journal 58 (2017), 193--206. DOI:https://doi.org/10.1016/j.asoc.2017.04.069Google Scholar
Cross Ref
- Muazzam Ahmed Siddiqui, Syed Muhammad Faraz, and Sohail Abdul Sattar. 2013. Discovering the thematic structure of the Quran using probabilistic topic model. In 2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences. IEEE, 234--239. DOI:https://doi.org/10.1109/NOORIC.2013.55Google Scholar
Cross Ref
- Hussein Soori, Jan Platoš, and Václav Snášel. 2012. Simple stemming rules for Arabic language. In Advances in Intelligent Systems and Computing. DOI:https://doi.org/10.1007/978-3-642-31603-6_9Google Scholar
- Karen Spärck Jones. 2004. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 60, 5 (Oct. 2004), 493--502. DOI:https://doi.org/10.1108/00220410410560573Google Scholar
- Kazem Taghva, Rania Elkhoury, and Jeffrey Coombs. 2005. Arabic stemming without a root dictionary. In International Conference on Information Technology: Coding and Computing (ITCC).Google Scholar
Digital Library
- Padmaja CH V. R. 2018. Probabilistic —Asurvey. International Journal of Advanced Research in Computer Science 9, 3 (June 2018), 173--177. DOI:https://doi.org/10.26483/ijarcs.v9i3.6107Google Scholar
- Na Wang, Pengyuan Wang, and Baowei Zhang. 2010. An improved TF-IDF weights function based on information theory. In Proceedings of the 2010 International Conference on Computer and Communication Technologies in Agriculture Engineering (CCTAE’10). DOI:https://doi.org/10.1109/CCTAE.2010.5544382Google Scholar
Cross Ref
- Tao Wang, Yi Cai, Ho Fung Leung, Zhiwei Cai, and Huaqing Min. 2016. Entropy-based term weighting schemes for text categorization in VSM. In Proceedings of the International Conference on Tools with Artificial Intelligence (ICTAI). DOI:https://doi.org/10.1109/ICTAI.2015.57Google Scholar
- Andrew T. Wilson and Peter A. Chew. 2010. Term weighting schemes for latent Dirichlet allocation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT’10), Proceedings of the main conference.Google Scholar
- Kai Yang, Yi Cai, Zhenhong Chen, Ho Fung Leung, and Raymond Lau. 2016. Exploring topic discriminating power of words in latent Dirichlet allocation. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16), Technical Papers.Google Scholar
- Jianhua Yin and Jianyong Wang. 2014. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2014), 233--242. DOI:https://doi.org/10.1145/2623330.2623715Google Scholar
Digital Library
- Yueting Zhuang, Hanqi Wang, Jun Xiao, Fei Wu, Yi Yang, Weiming Lu, and Zhongfei Zhang. 2017. Bag-of-discriminative-words (BoDW) representation via topic modeling. IEEE Transactions on Knowledge and Data Engineering (2017). DOI:https://doi.org/10.1109/TKDE.2017.2658571Google Scholar
Index Terms
The Impact of Weighting Schemes and Stemming Process on Topic Modeling of Arabic Long and Short Texts
Recommendations
A biterm topic model for short texts
WWW '13: Proceedings of the 22nd international conference on World Wide WebUncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work ...
Sparse Biterm Topic Model for Short Texts
Web and Big DataAbstractExtracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus ...
A Non-Parametric Topic Model for Short Texts Incorporating Word Coherence Knowledge
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementMining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short ...






Comments