skip to main content
research-article

Improved Word Sense Determination in Malayalam using Latent Dirichlet Allocation and Semantic Features

Authors Info & Claims
Published:03 November 2021Publication History
Skip Abstract Section

Abstract

Recent years have witnessed phenomenal developments worldwide in the field of NLP. But developments in Indian regional languages are very few compared to them. This work is a step towards the construction of a target word sense disambiguation system in Malayalam, which is the regional language of the state of Kerala, India. Word Sense Disambiguation/Determination refers to the task of correctly identifying the sense of an ambiguous word from its context. This is considered an AI-Complete problem in the field of Natural Language Processing. For this purpose, an exclusive corpus of 1,147 contexts of target ambiguous words has been created, which to the best of our knowledge is the first attempt in Malayalam. This work describes how the performance of an unsupervised LDA-based approach towards WSD could be improved using semantic features like synonyms and co-occurrence information.

REFERENCES

  1. [1] Navigli R.. 2009. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41, 2 (2009), 169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Kim D., Seo D., Cho S., and Kang P.. 2019. Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Information Sciences 477, 1529.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Hingmire S., Chougule S., Palshikar G. K., and Chakraborti S.. 2013. Document classification by topic labeling. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 877880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Blei D. M., Ng A. Y., and Jordan M. I.. 2003. Latent Dirichlet allocation. The Journal of Machine Learning Research 3 (2003), 9931022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Boyd-Graber J., Blei D., and Zhu X.. 2007. A topic model for word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07). 10241033.Google ScholarGoogle Scholar
  6. [6] Preiss J. and Stevenson M.. 2013. Unsupervised domain tuning to improve word sense disambiguation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 680684.Google ScholarGoogle Scholar
  7. [7] Tan L. and Bond F.. 2013. Xling: Matching query sentences to a parallel corpus using topic models for WSD. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval’13). 167170.Google ScholarGoogle Scholar
  8. [8] Izquierdo R., Postma M., and Vossen P.. 2015. Topic modeling and word sense disambiguation on the Ancora corpus. Procesamiento Del Lenguaje Natural 55 (2015), 1522.Google ScholarGoogle Scholar
  9. [9] Niu L., Dai X., Zhang J., and Chen J.. 2015. Topic2Vec: Learning distributed representations of topics. In 2015 International Conference on Asian Language Processing (IALP’15). IEEE, 193196.Google ScholarGoogle Scholar
  10. [10] Jia L., Tang J., Li M., You J., Ding J., and Chen Y.. 2021. TWE-WSD: An effective topical word embedding based word sense disambiguation. CAAI Transactions on Intelligence Technology 6, 1 (2021), 7279.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Mikolov T., Sutskever I., Chen K., Corrado G., and Dean J.. 2013. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Lindgren J.. 2020. Evaluating Hierarchical LDA Topic Models for Article Categorization.Google ScholarGoogle Scholar
  13. [13] Rouhizadeh H., Shamsfard M., and Rouhizadeh M.. 2020. Knowledge based word sense disambiguation with distributional semantic expansion for the Persian language. In 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE’20). IEEE, 329335.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Rodriguez M. Y. and Storer H.. 2020. A computational social science perspective on qualitative data exploration: Using topic models for the descriptive analysis of social media data. Journal of Technology in Human Services 38, 1 (2020), 5486.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Navigli R. and Ponzetto S. P.. 2010. BabelNet: Building a very large multilingual semantic network. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 216225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Ekinci E. and İlhan Omurca S.. 2020. Concept-LDA: Incorporating Babelfy into LDA for aspect extraction. Journal of Information Science 46, 3 (2020), 406418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Li W. and Suzuki E.. 2020. Hybrid context-aware word sense disambiguation in topic modeling based document representation. In 2020 IEEE International Conference on Data Mining (ICDM’20). IEEE, 332341.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Li W. and Suzuki E.. 2021. Adaptive and hybrid context-aware fine-grained word sense disambiguation in topic modeling based document representation. Information Processing & Management 58, 4 (2021), 102592.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Peters M. E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., and Zettlemoyer L.. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.Google ScholarGoogle Scholar
  20. [20] Radford A., Wu J., Child R., Luan D., Amodei D., and Sutskever I.. 2019. Language models are unsupervised multi task learners. OpenAI Blog 1, 8 (2019), 9.Google ScholarGoogle Scholar
  21. [21] Devlin J., Chang M. W., Lee K., and Toutanova K.. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.Google ScholarGoogle Scholar
  22. [22] Yap B. P., Jie A. K. J., and Chng E. S.. 2020. Adapting BERT for word sense disambiguation with gloss selection objective and example sentences. arXiv preprint arXiv:2009.11795.Google ScholarGoogle Scholar
  23. [23] Bolshina A. and Loukachevitch N.. 2020. Automatic labelling of genre-specific collections for word sense disambiguation in Russian. In Russian Conference on Artificial Intelligence. Springer, Cham, 215227.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] El-Razzaz M., Fakhr M. W., and Maghraby F. A.. 2021. Arabic gloss WSD using BERT. Applied Sciences 11, 6 (2021), 2567.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] KP S. S., Raj P. R., and Jayan V.. 2016. Unsupervised approach to word sense disambiguation in Malayalam. Procedia Technology 24, 15071513.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Junaida M. K., Jayan J. P., and Sherly E.. 2017. Word sense disambiguation for Malayalam in a conditional random field framework. In Proceedings of the 14th International Conference on Natural Language Processing (ICON’17). 495502.Google ScholarGoogle Scholar
  27. [27] Raj S. M., Kumar S. S., Rajendran S., and Soman K. P.. 2019. Word sense disambiguation of Malayalam nouns. Recent Advances in Computational Intelligence. Springer, Cham, 291314.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Gundert H.. 1872. A Malayalam and English dictionary. C. Stolz.Google ScholarGoogle Scholar
  29. [29] Zhang Y., Jin R., and Zhou Z. H.. 2010. Understanding Bag-of-Words model: A statistical framework. International Journal of Machine Learning and Cybernetics 1, 1–4 (2010), 4352.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Sruthi S., Balakrishnan K., and Paul B.. 2020. An LDA-based approach towards word sense disambiguation in Malayalam. In Proceedings of International Conference on Machine Intelligence and Data Science Applications: MIDAS 2020. Springer Nature, 457.Google ScholarGoogle Scholar

Index Terms

  1. Improved Word Sense Determination in Malayalam using Latent Dirichlet Allocation and Semantic Features

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 2
      March 2022
      413 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3494070
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 November 2021
      • Accepted: 1 July 2021
      • Revised: 1 May 2021
      • Received: 1 August 2020
      Published in tallip Volume 21, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)28
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!