skip to main content
research-article

A Lemmatizer for Low-resource Languages: WSD and Its Role in the Assamese Language

Published:17 May 2022Publication History
Skip Abstract Section

Abstract

The morphological variations of highly inflected languages that appear in a text impede the progress of computer processing and root word determination tasks while extracting an abstract. As a remedy to this difficulty, a lemmatization algorithm is developed, and its effectiveness is evaluated for Word Sense Disambiguation (WSD). Having observed its usefulness, lemmatizer is considered for developing Natural Language Processing tools for languages rich in morphological variations. Among various Indian highly inflected languages, Assamese, spoken by over 14 million people in the North-Eastern region of India, is also one of them. In this present work, after a detailed study on the possible transformations through which surface words are created from lemmas, we have designed an Assamese lemmatizer in such a manner that suitable reverse transformations can be employed on a surface word to derive the co-relative (similar) lemma back. And it has been observed that the lemmatizer is competent to deal with inflectional and derivational morphology in Assamese, and the same was evaluated on various Assamese articles extracted from the Assamese Corpus consisting of 50,000 surface words (excluding proper nouns), and the result that it yielded with 82% accuracy was quite encouraging and satisfying, as Assamese is a low-level language and no research work has been done in the Assamese language regarding the lemmatization of words. Considering the result obtained, the lemmatizer is then evaluated for Assamese WSD. For this purpose, 10 highly polysemous Assamese words are taken into account for sense disambiguation. We have also regarded varied WSD systems and observed that such systems enhance the effectiveness of all the WSD systems, which is statistically significant.

REFERENCES

  1. [1] N. Saharia, K. Konwar, and J. Kalita. 2013. An improved stemming approach using HMM for a highly inflectional language. In Computational Linguistics and Intelligent Text Processing, 7816.Google ScholarGoogle Scholar
  2. [2] Chatterji. 1926. The Origin and Development of the Bengali Language.Google ScholarGoogle Scholar
  3. [3] M. Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries. In Proceedings of the 5th Annual International Conference on Systems. 24–26.Google ScholarGoogle Scholar
  4. [4] A. Kilgarriff and J. Rosenzweig. 2000. Framework and results for english SENSEVAL. Comput. Human. 34, 1 (2000), 15–4.Google ScholarGoogle Scholar
  5. [5] S. Seal and N. Joshi. 2019. Design of an inflectional rule-based assamese stemmer. Int. J. Innov. Technol. Explor. Eng. 8, 6 (2019), 1651–1655.Google ScholarGoogle Scholar
  6. [6] N. Saharia, U. Sarmah, and J. Kalita. 2012. Analysis and evaluation of stemming algorithms. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics. 842–846.Google ScholarGoogle Scholar
  7. [7] M. Rahman and S. K. Sarma. 2016. Analysing Morphology of Assamese Words using Finite State Transducer. Int. J. Innov. Res. Comput. Commun. Eng. 4, 12 (2016), 21801–21807.Google ScholarGoogle Scholar
  8. [8] P. Sharma, U. Sarmah, and J. Kalita. 2012. Suffix stripping based NER in assamese for location names. In Proceedings of the 2nd National Conference on Computational Intelligence and Signal Processing. 91–94.Google ScholarGoogle Scholar
  9. [9] S. K. Sarma, R. Medhi, M. Gogoi, and U. Saikia. 2010. Foundation and structure of developing an assamese WordNet. In Proceedings of the 5th International Global WordNet Conference (GWC’10).Google ScholarGoogle Scholar
  10. [10] K. Koskenniemi. 1984. A general computational model for word-form recognition and production. In Proceedings of the 10th International Conference on Computational Linguistics and 22nd Annual Meeting of the Association for Computational Linguistics. 178–181.Google ScholarGoogle Scholar
  11. [11] R. Wicentowski and D. Yarowsky. 2002. Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Ph.D. Dissertation. Johns Hopkins University, Baltimore, Maryland.Google ScholarGoogle Scholar
  12. [12] K. Toutanova and C. Cherry. 2009. A global model for joint lemmatization and part-of-speech pre-diction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 486–494.Google ScholarGoogle Scholar
  13. [13] A. Loponen and K. Järvelin. 2010. A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. In Multilingual and Multimodal Information Access Evaluation, Springer, 3–14.Google ScholarGoogle Scholar
  14. [14] A. Gesmundo and T. Samardžić. 2012. Lemmatisation as a tagging task. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics 2, 368–372. http://www.aclweb.org/anthology/P12-2072.Google ScholarGoogle Scholar
  15. [15] Thomas Müller, R. Cotterell, A. Fraser, and H. Schütze. 2015. Joint lemmatization and morphological tagging with lemming. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. 2268–2274. http://aclweb.org/anthology/D15-1272.Google ScholarGoogle Scholar
  16. [16] A. Chakrabarty and U. Garain. 2016. BenLem (a Bengali lemmatizer) and its role in WSD. ACM Trans. Asian Low-Resour. Lang. Inf. Process.Google ScholarGoogle Scholar
  17. [17] N. Baruah, S. K. Sarma, and S. Borkakoty. 2020. Evaluation of content compaction in assamese language. Proc. Comput. Sci. 171, 2275–2285.Google ScholarGoogle Scholar
  18. [18] T. Tabassum, J. Jahan, and S. Ismail. 2016. A corpus based unsupervised Bangla word stemming using N-gram language model. In Proceedings of the International Conference on Informatics, Electronics and Vision (ICIEV’16). Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] M. R. Mahmud, M. Afrin, M. Razzaque, E. Miller, and J. Iwashige. 2014. A rule based bengali stemmer. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI’14), 2750–2756. Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] J. H. Paik, M. Mitra, S. K. Parui, and K. Järvelin. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4 (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] U. Mishra and C. Meena. 2012. MAULIK: An effective stemmer for Hindi language. Int. J. Comput. Sci. Eng. 4, 5 (2012), 711–717.Google ScholarGoogle Scholar
  22. [22] R. J. Pratibha and M. C. Padma. 2015. Design of rule based lemmatizer for Kannada inflectional words. In Proceedings of the International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT’15).Google ScholarGoogle Scholar
  23. [23] B. Nathani and G. Purohit. 2019. Design and development of lemmatizer for Sindhi language in devanagri script. J. Stat. Manage. Syst.Google ScholarGoogle Scholar
  24. [24] H. Patel and B. Patel. 2019. Stemmatizer-stemmer-based lemmatizer for Gujarati text. In Emerging Trends in Expert Applications and Security. 667–674.Google ScholarGoogle Scholar
  25. [25] J. H. Paik, M. Mitra, S. K. Parui, and K. Järvelin. 2011. GRAS An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] N. Baruah, A. Gogoi, and S. K. Sarma. 2020. Utizing copus statistics for Assamese Word Sense Disambiguation. In Proceedings of the 4th International Conference on Computing and Network Communications.Google ScholarGoogle Scholar
  27. [27] P. Kalita and A. K. Barman. 2019. Implementation of walker algorithm in word sense disambiguation for assamese language. In Proceedings of the International Symposium on Advanced Computing and Communication (ISACC’19). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] R. Navigli. 2009. Word sense disambiguation: A survey. ACM Comput. Surv. 41 (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database, MIT Press.Google ScholarGoogle Scholar
  30. [30] A. G. Miller, M. Chodorow, S. Landes, C. Leacock, and R. G. Thomas. 1994. Using a semantic concordance for sense identification. In Proceedings of the Workshop on Human Language Technology, Association for Computational Linguistics. 240–243.Google ScholarGoogle Scholar

Index Terms

  1. A Lemmatizer for Low-resource Languages: WSD and Its Role in the Assamese Language

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 4
      July 2022
      464 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3511099
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 May 2022
      • Accepted: 1 November 2021
      • Revised: 1 September 2021
      • Received: 1 November 2020
      Published in tallip Volume 21, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!