skip to main content
research-article

Improving Semantic Coherence of Gujarati Text Topic Model Using Inflectional Forms Reduction and Single-letter Words Removal

Published:10 March 2021Publication History
Skip Abstract Section

Abstract

A topic model is one of the best stochastic models for summarizing an extensive collection of text. It has accomplished an inordinate achievement in text analysis as well as text summarization. It can be employed to the set of documents that are represented as a bag-of-words, without considering grammar and order of the words. We modeled the topics for Gujarati news articles corpus. As the Gujarati language has a diverse morphological structure and inflectionally rich, Gujarati text processing finds more complexity. The size of the vocabulary plays an important role in the inference process and quality of topics. As the vocabulary size increases, the inference process becomes slower and topic semantic coherence decreases. If the vocabulary size is diminished, then the topic inference process can be accelerated. It may also improve the quality of topics. In this work, the list of suffixes has been prepared that encounters too frequently with words in Gujarati text. The inflectional forms have been reduced to the root words concerning the suffixes in the list. Moreover, Gujarati single-letter words have been eliminated for faster inference and better quality of topics. Experimentally, it has been proved that if inflectional forms are reduced to their root words, then vocabulary length is shrunk to a significant extent. It also caused the topic formation process quicker. Moreover, the inflectional forms reduction and single-letter word removal enhanced the interpretability of topics. The interpretability of topics has been assessed on semantic coherence, word length, and topic size. The experimental results showed improvements in the topical semantic coherence score. Also, the topic size grew notably as the number of tokens assigned to the topics increased.

References

  1. Edoardo M. Airoldi, David Blei, Elena A. Erosheva, and Stephen E. Fienberg. 2014. Handbook of Mixed Membership Models and Their Applications. Chapman and Hall/CRC.Google ScholarGoogle Scholar
  2. Nikolaos Aletras and Mark Stevenson. 2013. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS’13)--Long Papers (2013), 13--22.Google ScholarGoogle Scholar
  3. Juhi Ameta, Nisheeth Joshi, and Iti Mathur. 2012. A lightweight stemmer for Gujarati. arXiv:1210.5486). Retrieved from https://arxiv.org/abs/1210.5486.Google ScholarGoogle Scholar
  4. Niraj Aswani and Robert J. Gaizauskas. 2010. Developing morphological analysers for South Asian languages: Experimenting with the Hindi and Gujarati languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’10).Google ScholarGoogle Scholar
  5. David M. Blei. 2012. Probabilistic topic models. Commun. ACM 55, 4 (2012), 77--84. DOI:https://doi.org/doi:10.1145/2133806.2133826Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 113--120. DOI:https://doi.org/10.1145/1143844.1143859Google ScholarGoogle Scholar
  7. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (Jan. 2003), 993--1022. DOI:https://doi.org/10.1162/jmlr.2003.3.4-5.993Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Abderrezak Brahmi, Ahmed Ech-Cherif, and Abdelkader Benyettou. 2012. Arabic texts analysis for topic modeling evaluation. Inf. Retriev. 15, 1 (2012), 33--53. DOI:https://doi.org/10.1007/s10791-011-9171-yGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ying-Lang Chang and Jen-Tzung Chien. 2009. Latent Dirichlet learning for document summarization. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1689--1692. DOI:https://doi.org/10.1109/ICASSP.2009.4959927Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Raphael Cohen, Iddo Aviram, Michael Elhadad, and Noémie Elhadad. 2014. Redundancy-aware topic modeling for patient record notes. PLoS ONE 9, 2 (2014), e87555. DOI:https://doi.org/10.1371/journal.pone.0087555Google ScholarGoogle Scholar
  11. Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embeddings. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’15). 795--804.Google ScholarGoogle ScholarCross RefCross Ref
  12. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391. DOI:https://doi.org/10.1002Google ScholarGoogle ScholarCross RefCross Ref
  13. Ismail El Bazi and Nabil Laachfoubi. 2017. Arabic named entity recognition using topic modeling. Context 230 (2017).Google ScholarGoogle Scholar
  14. John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Comput. Ling. 27, 2 (2001), 153--198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. John Goldsmith. 2006. An algorithm for the unsupervised learning of morphology. Nat. Lang. Eng. 12, 4 (2006), 353--371.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101, suppl 1 (2004), 5228--5235. DOI:https://doi.org/10.1073/pnas.0307752101Google ScholarGoogle ScholarCross RefCross Ref
  17. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., 289--296. DOI:https://doi.org/10.1162/jmlr.2003.3.4-5.993Google ScholarGoogle Scholar
  18. Anni Järvelin, Heikki Keskustalo, Eero Sormunen, Miamaria Saastamoinen, and Kimmo Kettunen. 2015. Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach. J. Assoc. Inf. Sci. Technol. 67 (2015), 1--38. DOI:https://doi.org/10.1002/asi.23379Google ScholarGoogle Scholar
  19. Kartik Suba Dipti Jiandani and Pushpak Bhattacharyya. 2011. Hybrid inflectional stemmer and rule-based derivational stemmer for gujarati. In Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP’11), 1--8.Google ScholarGoogle Scholar
  20. Di Jiang, Yongxin Tong, and Yuanfeng Song. 2016. Cross-lingual topic discovery from multilingual search engine query log. ACM Trans. Inf. Syst. 35, 2 (2016), 9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Diptesh Kanojia, Aditya Joshi, Pushpak Bhattacharyya, and Mark James Carman. 2016. That’ll do fine!: A coarse lexical resource for English-Hindi MT, using polylingual topic models. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16).Google ScholarGoogle Scholar
  22. Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the European Chapter of the Association for Computational Linguistics. 530--539.Google ScholarGoogle ScholarCross RefCross Ref
  24. Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36, 2 (2017), 11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2015. Multilingual topic models for bilingual dictionary extraction. ACM Trans. Asian Low-resource Lang. Inf. Process. 14, 3 (2015), 11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kun Lu and Dietmar Wolfram. 2012. Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches. J. Am. Soc. Inf. Sci. Technol. 63, 10 (2012), 1973--1986.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Baizhang Ma, Dongsong Zhang, Zhijun Yan, and Taeha Kim. 2013. An LDA and synonym lexicon based approach to product feature extraction from online consumer product reviews. J. Electr. Commerce Res. 14, 4 (2013), 304. DOI:https://doi.org/10.1016/j.im.2015.02.002Google ScholarGoogle Scholar
  28. Liping Ma, John Shepherd, and Anh Nguyen. 2003. Document classification via structure synopses. In Proceedings of the 14th Australasian Database Conference, Volume 17. Australian Computer Society, Inc., 59--65.Google ScholarGoogle Scholar
  29. Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and ChengXiang Zhai. 2007. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proceedings of the 16th International Conference on World Wide Web. ACM, 171--180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. 880--889. DOI:https://doi.org/10.3115/1699571.1699627Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. K. Nagwani. 2015. Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J. Big Data 2, 1 (2015), 6.Google ScholarGoogle ScholarCross RefCross Ref
  32. Kamal Nigam, John Lafferty, and Andrew McCallum. 1999. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering, Vol. 1. 61--67.Google ScholarGoogle Scholar
  33. Jiaul H. Paik, Kimmo Kettunen, Dipasree Pal, and Kalervo Järvelin. 2013. Frequent case generation in ad hoc retrieval of three Indian languages--Bengali, Gujarati and Marathi. Multiling. Inf. Access South Asian Lang. 1 (2013), 38--50.Google ScholarGoogle ScholarCross RefCross Ref
  34. Michael J. Paul and Mark Dredze. 2014. Discovering health topics in social media using topic models. PLoS ONE 9, 8 (2014), e103408. DOI:https://doi.org/10.1371/journal.pone.0103408Google ScholarGoogle Scholar
  35. Snigdha Paul, Mini Tandon, Nisheeth Joshi, and Iti Mathur. 2013. Design of a rule based Hindi lemmatizer. In Proceedings of the 3rd International Workshop on Artificial Intelligence, Soft Computing and Applications (2013), 67--74.Google ScholarGoogle ScholarCross RefCross Ref
  36. James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, and Alex J. Smola. 2010. Word features for latent dirichlet allocation. Adv. Neur. Inf. Process. Syst. 1 (2010), 1921--1929.Google ScholarGoogle Scholar
  37. Pratikkumar Patel Kashyap Popat and Pushpak Bhattacharyya. 2010. Hybrid stemmer for Gujarati. In Proceedings of the 23rd International Conference on Computational Linguistics. 51.Google ScholarGoogle Scholar
  38. Zengchang Qin, Yonghui Cong, and Tao Wan. 2016. Topic modeling of Chinese language beyond a bag-of-words. Comput. Speech Lang. 40 (2016), 60--78. DOI:https://doi.org/10.1016/j.csl.2016.03.004Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yafeng Ren, Ruimin Wang, and Donghong Ji. 2016. A topic-enhanced word embedding for twitter sentiment classification. Inf. Sci. 369 (2016), 188--198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 487--494. DOI:https://doi.org/10.1016/j.nima.2010.11.062Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Gary F. Simons and Charles D. Fennig. 2017. Ethnologue: Languages of Asia. sil International.Google ScholarGoogle Scholar
  42. David Sontag and Dan Roy. 2011. Complexity of inference in latent dirichlet allocation. Adv. Neur. Inf. Process. Syst. 1 (2011), 1008--1016.Google ScholarGoogle Scholar
  43. Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. Handbook Latent Semant. Anal. 427, 7 (2007), 424--440.Google ScholarGoogle Scholar
  44. Edmund M. Talley, David Newman, David Mimno, Bruce W. Herr II, Hanna M. Wallach, Gully A. P. C. Burns, A. G. Miriam Leenders, and Andrew McCallum. 2011. Database of NIH grants using machine-learned categories and graphical clustering. Nat. Methods 8, 6 (2011), 443.Google ScholarGoogle ScholarCross RefCross Ref
  45. Ivan Vulić, Wim De Smet, Jie Tang, and Marie-Francine Moens. 2015. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Inf. Process. Manage. 51, 1 (2015), 111--147. DOI:https://doi.org/10.1016/j.ipm.2014.08.003Google ScholarGoogle ScholarCross RefCross Ref
  46. Hanna M. Wallach. 2006. Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 977--984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. In Proceedings of the 26th Annual International Conference on Machine Learning. 1105--1112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Liang Yao, Yin Zhang, Baogang Wei, Wei Wang, Yuejiao Zhang, Xiaolin Ren, and Yali Bian. 2015. Discovering treatment pattern in traditional Chinese medicine clinical cases by exploiting supervised topic model and domain knowledge. J. Biomed. Inf. 58 (2015), 260--267. DOI:https://doi.org/10.1016/j.jbi.2015.10.012Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Jianping Zeng and Shiyong Zhang. 2007. Variable space hidden Markov model for topic detection and analysis. Knowl.-Based Syst. 20, 7 (2007), 607--613.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Jianping Zeng and Shiyong Zhang. 2009. Incorporating topic transition in topic detection and tracking algorithms. Expert Syst. Appl. 36, 1 (2009), 227--232.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jianping Zeng, Shiyong Zhang, Chengrong Wu, and Xiangwen Ji. 2009. Modelling topic propagation over the Internet. Math. Comput. Model. Dynam. Syst. 15, 1 (2009), 83--93.Google ScholarGoogle ScholarCross RefCross Ref
  52. Tao Zhang, Kang Liu, Jun Zhao, et al. 2013. Cross lingual entity linking with bilingual topic model. Int. Joint Conf. Artif. Intell. 1 (2013), 2218--2224.Google ScholarGoogle Scholar
  53. Bing Zhao and Eric P. Xing. 2008. HM-BiTAM: Bilingual topic exploration, word alignment, and translation. 1 (unpublished), 1689--1696.Google ScholarGoogle Scholar
  54. Shi Zhong and Joydeep Ghosh. 2005. Generative model-based document clustering: A comparative study. Knowl. Inf. Syst. 8, 3 (2005), 374--384.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving Semantic Coherence of Gujarati Text Topic Model Using Inflectional Forms Reduction and Single-letter Words Removal

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    • Published in

                      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
                      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 1
                      Special issue on Deep Learning for Low-Resource Natural Language Processing, Part 1 and Regular Papers
                      January 2021
                      332 pages
                      ISSN:2375-4699
                      EISSN:2375-4702
                      DOI:10.1145/3439335
                      Issue’s Table of Contents

                      Copyright © 2021 ACM

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      • Published: 10 March 2021
                      • Accepted: 1 October 2020
                      • Revised: 1 May 2020
                      • Received: 1 December 2018
                      Published in tallip Volume 20, Issue 1

                      Permissions

                      Request permissions about this article.

                      Request Permissions

                      Check for updates

                      Qualifiers

                      • research-article
                      • Research
                      • Refereed

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader

                    HTML Format

                    View this article in HTML Format .

                    View HTML Format
                    About Cookies On This Site

                    We use cookies to ensure that we give you the best experience on our website.

                    Learn more

                    Got it!