Abstract
A lemmatization algorithm for Bengali has been developed and evaluated. Its effectiveness for word sense disambiguation (WSD) is also investigated. One of the key challenges for computer processing of highly inflected languages is to deal with the frequent morphological variations of the root words appearing in the text. Therefore, a lemmatizer is essential for developing natural language processing (NLP) tools for such languages. In this experiment, Bengali, which is the national language of Bangladesh and the second most popular language in the Indian subcontinent, has been taken as a reference. In order to design the Bengali lemmatizer (named as BenLem), possible transformations through which surface words are formed from lemmas are studied so that appropriate reverse transformations can be applied on a surface word to get the corresponding lemma back. BenLem is found to be capable of handling both inflectional and derivational morphology in Bengali. It is evaluated on a set of 18 news articles taken from the FIRE Bengali News Corpus consisting of 3,342 surface words (excluding proper nouns) and found to be 81.95% accurate. The role of the lemmatizer is then investigated for Bengali WSD. Ten highly polysemous Bengali words are considered for sense disambiguation. The FIRE corpus and a collection of Tagore’s short stories are considered for creating the WSD dataset. Different WSD systems are considered for this experiment, and it is noticed that BenLem improves the performance of all the WSD systems and the improvements are statistically significant.
- Samit Bhattacharya, Monojit Choudhury, Sudeshna Sarkar, and Anupam Basu. 2005. Inflectional morphology synthesis for bengali noun, pronoun and verb systems. Proc. of NCCPB 8 (2005), 34--43.Google Scholar
- Pushpak Bhattacharyya, Ankit Bahuguna, Lavita Talukdar, and Bornali Phukan. 2014. Facilitating multi-lingual sense annotation: Human mediated lemmatizer. In Proceedings of the Global WordNet Conference.Google Scholar
- Sajib Dasgupta and Vincent Ng. 2007. Unsupervised morphological parsing of bengali. Language Resources and Evaluation 40, (2007), 311--330. DOI:http://dx.doi.org/10.1007/s10579-007-9031-yGoogle Scholar
Cross Ref
- Niladri Sekhar Dash. 2015. A Descriptive Study of Bengali Words. Cambridge University Press.Google Scholar
- Ljiljana Dolamic and Jacques Savoy. 2010. Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages. ACM Transactions on Asian Language Information Processing (TALIP) 9, 3, Article 11 (Sept. 2010), 24 pages. DOI:http://dx.doi.org/10.1145/1838745.1838748 Google Scholar
Digital Library
- Abu Zaher Md Faridee, Francis M. Tyers, and others. 2009. Development of a morphological analyser for bengali. In Proceedings of the 1st International Workshop on Free/Open-Source Rule-Based Machine Translation. Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos.Google Scholar
- Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.Google Scholar
- Debasis Ganguly, Johannes Leveling, and Gareth J. F. Jones. 2012. DCU@ FIRE-2012: Rule-based stemmers for bengali and hindi. In Working Notes for the FIRE 2012 Workshop.Google Scholar
- Andrea Gesmundo and Tanja Samardžić. 2012. Lemmatisation as a tagging task. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, Jeju Island, Korea, 368--372. http://www.aclweb.org/anthology/P12-2072. Google Scholar
Digital Library
- Adam Kilgarriff and Joseph Rosenzweig. 2000. English Senseval: Report and results. In LREC, Vol. 6. 2.Google Scholar
- Kimmo Koskenniemi. 1984. A general computational model for word-form recognition and production. In Proceedings of the 10th International Conference on Computational Linguistics and 22nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 178--181. Google Scholar
Digital Library
- Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation. ACM, 24--26. Google Scholar
Digital Library
- Krister Lindén. 2008. A probabilistic model for guessing base forms of new words by analogy. In Computational Linguistics and Intelligent Text Processing. Springer, 106--116. Google Scholar
Digital Library
- Aki Loponen and Kalervo Järvelin. 2010. A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. In Multilingual and Multimodal Information Access Evaluation. Springer, 3--14. DOI:http://dx.doi.org/10.1007/978-3-642-15998-5_3 Google Scholar
Digital Library
- Aki Loponen, Jiaul H. Paik, and Kalervo Järvelin. 2013. UTA stemming and lemmatization experiments in the FIRE bengali Ad Hoc task. In Multilingual Information Access in South Asian Languages. Springer, 258--268.Google Scholar
- Prasenjit Majumder, Mandar Mitra, Dipasree Pal, Ayan Bandyopadhyay, Samaresh Maiti, Sukomal Pal, Deboshree Modak, and Sucharita Sanyal. 2010. The FIRE 2008 evaluation exercise. ACM Transactions on Asian Language Information Processing (TALIP) 9, 3, Article 10 (Sept. 2010), 24 pages. DOI:http://dx.doi.org/10.1145/1838745.1838747 Google Scholar
Digital Library
- Prasenjit Majumder, Mandar Mitra, Swapan K. Parui, Gobinda Kole, Pabitra Mitra, and Kalyankumar Datta. 2007. YASS: Yet another suffix stripper. ACM Transactions on Information Systems (TOIS) 25, 4, Article 18 (Oct. 2007), 20 pages. DOI:http://dx.doi.org/10.1145/1281485.1281489 Google Scholar
Digital Library
- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press. Google Scholar
Digital Library
- George A. Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G. Thomas. 1994. Using a semantic concordance for sense identification. In Proceedings of the Workshop on Human Language Technology. Association for Computational Linguistics, 240--243. Google Scholar
Digital Library
- Thomas Müller, Ryan Cotterell, Alexander Fraser, and Hinrich Schütze. 2015. Joint lemmatization and morphological tagging with lemming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 2268--2274. http://aclweb.org/anthology/D15-1272.Google Scholar
Cross Ref
- Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41, 2, Article 10 (Feb. 2009), 69 pages. DOI:http://dx.doi.org/10.1145/1459352.1459355 Google Scholar
Digital Library
- Jiaul H. Paik, Mandar Mitra, Swapan K. Parui, and Kalervo Järvelin. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Transactions on Information Systems (TOIS) 29, 4, Article 19 (Dec. 2011), 24 pages. DOI:http://dx.doi.org/10.1145/2037661.2037664 Google Scholar
Digital Library
- Jiaul H. Paik and Swapan K. Parui. 2011. A fast corpus-based stemmer. ACM Transactions on Asian Language Information Processing (TALIP) 10, 2, Article 8 (June 2011), 16 pages. DOI:http://dx.doi.org/10.1145/1967293.1967295 Google Scholar
Digital Library
- Joël Plisson, Nada Lavrac, Dunja Mladenic, and others. 2004. A rule based approach to word lemmatization. Proceedings of IS-2004 (2004), 83--86.Google Scholar
- Sandipan Sarkar and Sivaji Bandyopadhyay. 2012a. FIRE 2012 working notes: Morpheme extraction task using mulaadhaar--a rule-based stemmer for bengali. In Working Notes for the FIRE 2012 Workshop.Google Scholar
- Sandipan Sarkar and Sivaji Bandyopadhyay. 2012b. On the evolution of stemmers: A study in the context of bengali language. International Journal of Computational Linguistics and Natural Language Processing 1, 2 (2012), 51--59.Google Scholar
- Apurbalal Senapati and Utpal Garain. 2012. Bangla Morphological Analyzer using Finite Automata: ISI@ FIRE MET 2012. In Working Notes for the FIRE 2012 Workshop.Google Scholar
- Kristina Toutanova and Colin Cherry. 2009. A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, 486--494. http://www.aclweb.org/anthology/P/P09/P09-1055. Google Scholar
Digital Library
- Richard Wicentowski and David Yarowsky. 2002. Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Ph.D. Dissertation. Ph. D. Thesis. Johns Hopkins University, Baltimore, Maryland. Google Scholar
Digital Library
Index Terms
BenLem (A Bengali Lemmatizer) and Its Role in WSD
Recommendations
A Lemmatizer for Low-resource Languages: WSD and Its Role in the Assamese Language
The morphological variations of highly inflected languages that appear in a text impede the progress of computer processing and root word determination tasks while extracting an abstract. As a remedy to this difficulty, a lemmatization algorithm is ...
Part of Speech Tagging in Bengali Using Support Vector Machine
ICIT '08: Proceedings of the 2008 International Conference on Information TechnologyPart of Speech (POS) tagging is the task of labeling each word in a sentence with its appropriate syntactic category called part of speech. POS tagging is a very important preprocessing task for language processing activities. This paper reports about ...
Acquiring Verb Subcategorization Frames in Bengali from Corpora
ICCPOL '09: Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based EconomySubcategorization frames acquisition of a phrase can be described as a mechanism to extract different types of relevant arguments that are associated with that phrase in a sentence. This paper presents the acquisition of different subcategory frames for ...






Comments