Abstract
A topic model is one of the best stochastic models for summarizing an extensive collection of text. It has accomplished an inordinate achievement in text analysis as well as text summarization. It can be employed to the set of documents that are represented as a bag-of-words, without considering grammar and order of the words. We modeled the topics for Gujarati news articles corpus. As the Gujarati language has a diverse morphological structure and inflectionally rich, Gujarati text processing finds more complexity. The size of the vocabulary plays an important role in the inference process and quality of topics. As the vocabulary size increases, the inference process becomes slower and topic semantic coherence decreases. If the vocabulary size is diminished, then the topic inference process can be accelerated. It may also improve the quality of topics. In this work, the list of suffixes has been prepared that encounters too frequently with words in Gujarati text. The inflectional forms have been reduced to the root words concerning the suffixes in the list. Moreover, Gujarati single-letter words have been eliminated for faster inference and better quality of topics. Experimentally, it has been proved that if inflectional forms are reduced to their root words, then vocabulary length is shrunk to a significant extent. It also caused the topic formation process quicker. Moreover, the inflectional forms reduction and single-letter word removal enhanced the interpretability of topics. The interpretability of topics has been assessed on semantic coherence, word length, and topic size. The experimental results showed improvements in the topical semantic coherence score. Also, the topic size grew notably as the number of tokens assigned to the topics increased.
- Edoardo M. Airoldi, David Blei, Elena A. Erosheva, and Stephen E. Fienberg. 2014. Handbook of Mixed Membership Models and Their Applications. Chapman and Hall/CRC.Google Scholar
- Nikolaos Aletras and Mark Stevenson. 2013. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS’13)--Long Papers (2013), 13--22.Google Scholar
- Juhi Ameta, Nisheeth Joshi, and Iti Mathur. 2012. A lightweight stemmer for Gujarati. arXiv:1210.5486). Retrieved from https://arxiv.org/abs/1210.5486.Google Scholar
- Niraj Aswani and Robert J. Gaizauskas. 2010. Developing morphological analysers for South Asian languages: Experimenting with the Hindi and Gujarati languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’10).Google Scholar
- David M. Blei. 2012. Probabilistic topic models. Commun. ACM 55, 4 (2012), 77--84. DOI:https://doi.org/doi:10.1145/2133806.2133826Google Scholar
Digital Library
- David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 113--120. DOI:https://doi.org/10.1145/1143844.1143859Google Scholar
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (Jan. 2003), 993--1022. DOI:https://doi.org/10.1162/jmlr.2003.3.4-5.993Google Scholar
Digital Library
- Abderrezak Brahmi, Ahmed Ech-Cherif, and Abdelkader Benyettou. 2012. Arabic texts analysis for topic modeling evaluation. Inf. Retriev. 15, 1 (2012), 33--53. DOI:https://doi.org/10.1007/s10791-011-9171-yGoogle Scholar
Digital Library
- Ying-Lang Chang and Jen-Tzung Chien. 2009. Latent Dirichlet learning for document summarization. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1689--1692. DOI:https://doi.org/10.1109/ICASSP.2009.4959927Google Scholar
Digital Library
- Raphael Cohen, Iddo Aviram, Michael Elhadad, and Noémie Elhadad. 2014. Redundancy-aware topic modeling for patient record notes. PLoS ONE 9, 2 (2014), e87555. DOI:https://doi.org/10.1371/journal.pone.0087555Google Scholar
- Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embeddings. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’15). 795--804.Google Scholar
Cross Ref
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391. DOI:https://doi.org/10.1002Google Scholar
Cross Ref
- Ismail El Bazi and Nabil Laachfoubi. 2017. Arabic named entity recognition using topic modeling. Context 230 (2017).Google Scholar
- John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Comput. Ling. 27, 2 (2001), 153--198.Google Scholar
Digital Library
- John Goldsmith. 2006. An algorithm for the unsupervised learning of morphology. Nat. Lang. Eng. 12, 4 (2006), 353--371.Google Scholar
Digital Library
- Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101, suppl 1 (2004), 5228--5235. DOI:https://doi.org/10.1073/pnas.0307752101Google Scholar
Cross Ref
- Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., 289--296. DOI:https://doi.org/10.1162/jmlr.2003.3.4-5.993Google Scholar
- Anni Järvelin, Heikki Keskustalo, Eero Sormunen, Miamaria Saastamoinen, and Kimmo Kettunen. 2015. Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach. J. Assoc. Inf. Sci. Technol. 67 (2015), 1--38. DOI:https://doi.org/10.1002/asi.23379Google Scholar
- Kartik Suba Dipti Jiandani and Pushpak Bhattacharyya. 2011. Hybrid inflectional stemmer and rule-based derivational stemmer for gujarati. In Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP’11), 1--8.Google Scholar
- Di Jiang, Yongxin Tong, and Yuanfeng Song. 2016. Cross-lingual topic discovery from multilingual search engine query log. ACM Trans. Inf. Syst. 35, 2 (2016), 9.Google Scholar
Digital Library
- Diptesh Kanojia, Aditya Joshi, Pushpak Bhattacharyya, and Mark James Carman. 2016. That’ll do fine!: A coarse lexical resource for English-Hindi MT, using polylingual topic models. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16).Google Scholar
- Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.Google Scholar
Digital Library
- Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the European Chapter of the Association for Computational Linguistics. 530--539.Google Scholar
Cross Ref
- Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36, 2 (2017), 11.Google Scholar
Digital Library
- Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2015. Multilingual topic models for bilingual dictionary extraction. ACM Trans. Asian Low-resource Lang. Inf. Process. 14, 3 (2015), 11.Google Scholar
Digital Library
- Kun Lu and Dietmar Wolfram. 2012. Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches. J. Am. Soc. Inf. Sci. Technol. 63, 10 (2012), 1973--1986.Google Scholar
Digital Library
- Baizhang Ma, Dongsong Zhang, Zhijun Yan, and Taeha Kim. 2013. An LDA and synonym lexicon based approach to product feature extraction from online consumer product reviews. J. Electr. Commerce Res. 14, 4 (2013), 304. DOI:https://doi.org/10.1016/j.im.2015.02.002Google Scholar
- Liping Ma, John Shepherd, and Anh Nguyen. 2003. Document classification via structure synopses. In Proceedings of the 14th Australasian Database Conference, Volume 17. Australian Computer Society, Inc., 59--65.Google Scholar
- Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and ChengXiang Zhai. 2007. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proceedings of the 16th International Conference on World Wide Web. ACM, 171--180.Google Scholar
Digital Library
- David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. 880--889. DOI:https://doi.org/10.3115/1699571.1699627Google Scholar
Digital Library
- N. K. Nagwani. 2015. Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J. Big Data 2, 1 (2015), 6.Google Scholar
Cross Ref
- Kamal Nigam, John Lafferty, and Andrew McCallum. 1999. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering, Vol. 1. 61--67.Google Scholar
- Jiaul H. Paik, Kimmo Kettunen, Dipasree Pal, and Kalervo Järvelin. 2013. Frequent case generation in ad hoc retrieval of three Indian languages--Bengali, Gujarati and Marathi. Multiling. Inf. Access South Asian Lang. 1 (2013), 38--50.Google Scholar
Cross Ref
- Michael J. Paul and Mark Dredze. 2014. Discovering health topics in social media using topic models. PLoS ONE 9, 8 (2014), e103408. DOI:https://doi.org/10.1371/journal.pone.0103408Google Scholar
- Snigdha Paul, Mini Tandon, Nisheeth Joshi, and Iti Mathur. 2013. Design of a rule based Hindi lemmatizer. In Proceedings of the 3rd International Workshop on Artificial Intelligence, Soft Computing and Applications (2013), 67--74.Google Scholar
Cross Ref
- James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, and Alex J. Smola. 2010. Word features for latent dirichlet allocation. Adv. Neur. Inf. Process. Syst. 1 (2010), 1921--1929.Google Scholar
- Pratikkumar Patel Kashyap Popat and Pushpak Bhattacharyya. 2010. Hybrid stemmer for Gujarati. In Proceedings of the 23rd International Conference on Computational Linguistics. 51.Google Scholar
- Zengchang Qin, Yonghui Cong, and Tao Wan. 2016. Topic modeling of Chinese language beyond a bag-of-words. Comput. Speech Lang. 40 (2016), 60--78. DOI:https://doi.org/10.1016/j.csl.2016.03.004Google Scholar
Digital Library
- Yafeng Ren, Ruimin Wang, and Donghong Ji. 2016. A topic-enhanced word embedding for twitter sentiment classification. Inf. Sci. 369 (2016), 188--198.Google Scholar
Digital Library
- Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 487--494. DOI:https://doi.org/10.1016/j.nima.2010.11.062Google Scholar
Digital Library
- Gary F. Simons and Charles D. Fennig. 2017. Ethnologue: Languages of Asia. sil International.Google Scholar
- David Sontag and Dan Roy. 2011. Complexity of inference in latent dirichlet allocation. Adv. Neur. Inf. Process. Syst. 1 (2011), 1008--1016.Google Scholar
- Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. Handbook Latent Semant. Anal. 427, 7 (2007), 424--440.Google Scholar
- Edmund M. Talley, David Newman, David Mimno, Bruce W. Herr II, Hanna M. Wallach, Gully A. P. C. Burns, A. G. Miriam Leenders, and Andrew McCallum. 2011. Database of NIH grants using machine-learned categories and graphical clustering. Nat. Methods 8, 6 (2011), 443.Google Scholar
Cross Ref
- Ivan Vulić, Wim De Smet, Jie Tang, and Marie-Francine Moens. 2015. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Inf. Process. Manage. 51, 1 (2015), 111--147. DOI:https://doi.org/10.1016/j.ipm.2014.08.003Google Scholar
Cross Ref
- Hanna M. Wallach. 2006. Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 977--984.Google Scholar
Digital Library
- Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. In Proceedings of the 26th Annual International Conference on Machine Learning. 1105--1112.Google Scholar
Digital Library
- Liang Yao, Yin Zhang, Baogang Wei, Wei Wang, Yuejiao Zhang, Xiaolin Ren, and Yali Bian. 2015. Discovering treatment pattern in traditional Chinese medicine clinical cases by exploiting supervised topic model and domain knowledge. J. Biomed. Inf. 58 (2015), 260--267. DOI:https://doi.org/10.1016/j.jbi.2015.10.012Google Scholar
Digital Library
- Jianping Zeng and Shiyong Zhang. 2007. Variable space hidden Markov model for topic detection and analysis. Knowl.-Based Syst. 20, 7 (2007), 607--613.Google Scholar
Digital Library
- Jianping Zeng and Shiyong Zhang. 2009. Incorporating topic transition in topic detection and tracking algorithms. Expert Syst. Appl. 36, 1 (2009), 227--232.Google Scholar
Digital Library
- Jianping Zeng, Shiyong Zhang, Chengrong Wu, and Xiangwen Ji. 2009. Modelling topic propagation over the Internet. Math. Comput. Model. Dynam. Syst. 15, 1 (2009), 83--93.Google Scholar
Cross Ref
- Tao Zhang, Kang Liu, Jun Zhao, et al. 2013. Cross lingual entity linking with bilingual topic model. Int. Joint Conf. Artif. Intell. 1 (2013), 2218--2224.Google Scholar
- Bing Zhao and Eric P. Xing. 2008. HM-BiTAM: Bilingual topic exploration, word alignment, and translation. 1 (unpublished), 1689--1696.Google Scholar
- Shi Zhong and Joydeep Ghosh. 2005. Generative model-based document clustering: A comparative study. Knowl. Inf. Syst. 8, 3 (2005), 374--384.Google Scholar
Digital Library
Index Terms
Improving Semantic Coherence of Gujarati Text Topic Model Using Inflectional Forms Reduction and Single-letter Words Removal
Recommendations
Identifying Sentence-Level Semantic Content Units with Topic Models
DEXA '10: Proceedings of the 2010 Workshops on Database and Expert Systems ApplicationsStatistical approaches to document content modeling typically focus either on broad topics or on discourse-level subtopics of a text. We present an analysis of the performance of probabilistic topic models on the task of learning sentence-level topics ...
Joint sentiment/topic model for sentiment analysis
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementSentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet ...
TFIDF based Feature Words Extraction and Topic Modeling for Short Text
ICMSS 2018: Proceedings of the 2018 2nd International Conference on Management Engineering, Software Engineering and Service SciencesIn this paper, feature words extraction and topic modeling based on Term Frequency times In-verse Document Frequency (TFIDF) and Latent Dirichlet Allocation (LDA) is achieved aiming at short titles text of The National Institutes of Health (NIH) ...






Comments