skip to main content
note

BenLem (A Bengali Lemmatizer) and Its Role in WSD

Published:26 February 2016Publication History
Skip Abstract Section

Abstract

A lemmatization algorithm for Bengali has been developed and evaluated. Its effectiveness for word sense disambiguation (WSD) is also investigated. One of the key challenges for computer processing of highly inflected languages is to deal with the frequent morphological variations of the root words appearing in the text. Therefore, a lemmatizer is essential for developing natural language processing (NLP) tools for such languages. In this experiment, Bengali, which is the national language of Bangladesh and the second most popular language in the Indian subcontinent, has been taken as a reference. In order to design the Bengali lemmatizer (named as BenLem), possible transformations through which surface words are formed from lemmas are studied so that appropriate reverse transformations can be applied on a surface word to get the corresponding lemma back. BenLem is found to be capable of handling both inflectional and derivational morphology in Bengali. It is evaluated on a set of 18 news articles taken from the FIRE Bengali News Corpus consisting of 3,342 surface words (excluding proper nouns) and found to be 81.95% accurate. The role of the lemmatizer is then investigated for Bengali WSD. Ten highly polysemous Bengali words are considered for sense disambiguation. The FIRE corpus and a collection of Tagore’s short stories are considered for creating the WSD dataset. Different WSD systems are considered for this experiment, and it is noticed that BenLem improves the performance of all the WSD systems and the improvements are statistically significant.

References

  1. Samit Bhattacharya, Monojit Choudhury, Sudeshna Sarkar, and Anupam Basu. 2005. Inflectional morphology synthesis for bengali noun, pronoun and verb systems. Proc. of NCCPB 8 (2005), 34--43.Google ScholarGoogle Scholar
  2. Pushpak Bhattacharyya, Ankit Bahuguna, Lavita Talukdar, and Bornali Phukan. 2014. Facilitating multi-lingual sense annotation: Human mediated lemmatizer. In Proceedings of the Global WordNet Conference.Google ScholarGoogle Scholar
  3. Sajib Dasgupta and Vincent Ng. 2007. Unsupervised morphological parsing of bengali. Language Resources and Evaluation 40, (2007), 311--330. DOI:http://dx.doi.org/10.1007/s10579-007-9031-yGoogle ScholarGoogle ScholarCross RefCross Ref
  4. Niladri Sekhar Dash. 2015. A Descriptive Study of Bengali Words. Cambridge University Press.Google ScholarGoogle Scholar
  5. Ljiljana Dolamic and Jacques Savoy. 2010. Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages. ACM Transactions on Asian Language Information Processing (TALIP) 9, 3, Article 11 (Sept. 2010), 24 pages. DOI:http://dx.doi.org/10.1145/1838745.1838748 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Abu Zaher Md Faridee, Francis M. Tyers, and others. 2009. Development of a morphological analyser for bengali. In Proceedings of the 1st International Workshop on Free/Open-Source Rule-Based Machine Translation. Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos.Google ScholarGoogle Scholar
  7. Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.Google ScholarGoogle Scholar
  8. Debasis Ganguly, Johannes Leveling, and Gareth J. F. Jones. 2012. DCU@ FIRE-2012: Rule-based stemmers for bengali and hindi. In Working Notes for the FIRE 2012 Workshop.Google ScholarGoogle Scholar
  9. Andrea Gesmundo and Tanja Samardžić. 2012. Lemmatisation as a tagging task. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, Jeju Island, Korea, 368--372. http://www.aclweb.org/anthology/P12-2072. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Adam Kilgarriff and Joseph Rosenzweig. 2000. English Senseval: Report and results. In LREC, Vol. 6. 2.Google ScholarGoogle Scholar
  11. Kimmo Koskenniemi. 1984. A general computational model for word-form recognition and production. In Proceedings of the 10th International Conference on Computational Linguistics and 22nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 178--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation. ACM, 24--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Krister Lindén. 2008. A probabilistic model for guessing base forms of new words by analogy. In Computational Linguistics and Intelligent Text Processing. Springer, 106--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Aki Loponen and Kalervo Järvelin. 2010. A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. In Multilingual and Multimodal Information Access Evaluation. Springer, 3--14. DOI:http://dx.doi.org/10.1007/978-3-642-15998-5_3 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Aki Loponen, Jiaul H. Paik, and Kalervo Järvelin. 2013. UTA stemming and lemmatization experiments in the FIRE bengali Ad Hoc task. In Multilingual Information Access in South Asian Languages. Springer, 258--268.Google ScholarGoogle Scholar
  16. Prasenjit Majumder, Mandar Mitra, Dipasree Pal, Ayan Bandyopadhyay, Samaresh Maiti, Sukomal Pal, Deboshree Modak, and Sucharita Sanyal. 2010. The FIRE 2008 evaluation exercise. ACM Transactions on Asian Language Information Processing (TALIP) 9, 3, Article 10 (Sept. 2010), 24 pages. DOI:http://dx.doi.org/10.1145/1838745.1838747 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Prasenjit Majumder, Mandar Mitra, Swapan K. Parui, Gobinda Kole, Pabitra Mitra, and Kalyankumar Datta. 2007. YASS: Yet another suffix stripper. ACM Transactions on Information Systems (TOIS) 25, 4, Article 18 (Oct. 2007), 20 pages. DOI:http://dx.doi.org/10.1145/1281485.1281489 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. George A. Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G. Thomas. 1994. Using a semantic concordance for sense identification. In Proceedings of the Workshop on Human Language Technology. Association for Computational Linguistics, 240--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Thomas Müller, Ryan Cotterell, Alexander Fraser, and Hinrich Schütze. 2015. Joint lemmatization and morphological tagging with lemming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 2268--2274. http://aclweb.org/anthology/D15-1272.Google ScholarGoogle ScholarCross RefCross Ref
  21. Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41, 2, Article 10 (Feb. 2009), 69 pages. DOI:http://dx.doi.org/10.1145/1459352.1459355 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jiaul H. Paik, Mandar Mitra, Swapan K. Parui, and Kalervo Järvelin. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Transactions on Information Systems (TOIS) 29, 4, Article 19 (Dec. 2011), 24 pages. DOI:http://dx.doi.org/10.1145/2037661.2037664 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jiaul H. Paik and Swapan K. Parui. 2011. A fast corpus-based stemmer. ACM Transactions on Asian Language Information Processing (TALIP) 10, 2, Article 8 (June 2011), 16 pages. DOI:http://dx.doi.org/10.1145/1967293.1967295 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Joël Plisson, Nada Lavrac, Dunja Mladenic, and others. 2004. A rule based approach to word lemmatization. Proceedings of IS-2004 (2004), 83--86.Google ScholarGoogle Scholar
  25. Sandipan Sarkar and Sivaji Bandyopadhyay. 2012a. FIRE 2012 working notes: Morpheme extraction task using mulaadhaar--a rule-based stemmer for bengali. In Working Notes for the FIRE 2012 Workshop.Google ScholarGoogle Scholar
  26. Sandipan Sarkar and Sivaji Bandyopadhyay. 2012b. On the evolution of stemmers: A study in the context of bengali language. International Journal of Computational Linguistics and Natural Language Processing 1, 2 (2012), 51--59.Google ScholarGoogle Scholar
  27. Apurbalal Senapati and Utpal Garain. 2012. Bangla Morphological Analyzer using Finite Automata: ISI@ FIRE MET 2012. In Working Notes for the FIRE 2012 Workshop.Google ScholarGoogle Scholar
  28. Kristina Toutanova and Colin Cherry. 2009. A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, 486--494. http://www.aclweb.org/anthology/P/P09/P09-1055. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Richard Wicentowski and David Yarowsky. 2002. Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Ph.D. Dissertation. Ph. D. Thesis. Johns Hopkins University, Baltimore, Maryland. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. BenLem (A Bengali Lemmatizer) and Its Role in WSD

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 3
      March 2016
      220 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/2876004
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 February 2016
      • Revised: 1 October 2015
      • Accepted: 1 October 2015
      • Received: 1 August 2014
      Published in tallip Volume 15, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • note
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!