Abstract
The morphological variations of highly inflected languages that appear in a text impede the progress of computer processing and root word determination tasks while extracting an abstract. As a remedy to this difficulty, a lemmatization algorithm is developed, and its effectiveness is evaluated for Word Sense Disambiguation (WSD). Having observed its usefulness, lemmatizer is considered for developing Natural Language Processing tools for languages rich in morphological variations. Among various Indian highly inflected languages, Assamese, spoken by over 14 million people in the North-Eastern region of India, is also one of them. In this present work, after a detailed study on the possible transformations through which surface words are created from lemmas, we have designed an Assamese lemmatizer in such a manner that suitable reverse transformations can be employed on a surface word to derive the co-relative (similar) lemma back. And it has been observed that the lemmatizer is competent to deal with inflectional and derivational morphology in Assamese, and the same was evaluated on various Assamese articles extracted from the Assamese Corpus consisting of 50,000 surface words (excluding proper nouns), and the result that it yielded with 82% accuracy was quite encouraging and satisfying, as Assamese is a low-level language and no research work has been done in the Assamese language regarding the lemmatization of words. Considering the result obtained, the lemmatizer is then evaluated for Assamese WSD. For this purpose, 10 highly polysemous Assamese words are taken into account for sense disambiguation. We have also regarded varied WSD systems and observed that such systems enhance the effectiveness of all the WSD systems, which is statistically significant.
- [1] N. Saharia, K. Konwar, and J. Kalita. 2013. An improved stemming approach using HMM for a highly inflectional language. In Computational Linguistics and Intelligent Text Processing, 7816.Google Scholar
- [2] Chatterji. 1926. The Origin and Development of the Bengali Language.Google Scholar
- [3] M. Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries. In Proceedings of the 5th Annual International Conference on Systems. 24–26.Google Scholar
- [4] A. Kilgarriff and J. Rosenzweig. 2000. Framework and results for english SENSEVAL. Comput. Human. 34, 1 (2000), 15–4.Google Scholar
- [5] S. Seal and N. Joshi. 2019. Design of an inflectional rule-based assamese stemmer. Int. J. Innov. Technol. Explor. Eng. 8, 6 (2019), 1651–1655.Google Scholar
- [6] N. Saharia, U. Sarmah, and J. Kalita. 2012. Analysis and evaluation of stemming algorithms. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics. 842–846.Google Scholar
- [7] M. Rahman and S. K. Sarma. 2016. Analysing Morphology of Assamese Words using Finite State Transducer. Int. J. Innov. Res. Comput. Commun. Eng. 4, 12 (2016), 21801–21807.Google Scholar
- [8] P. Sharma, U. Sarmah, and J. Kalita. 2012. Suffix stripping based NER in assamese for location names. In Proceedings of the 2nd National Conference on Computational Intelligence and Signal Processing. 91–94.Google Scholar
- [9] S. K. Sarma, R. Medhi, M. Gogoi, and U. Saikia. 2010. Foundation and structure of developing an assamese WordNet. In Proceedings of the 5th International Global WordNet Conference (GWC’10).Google Scholar
- [10] K. Koskenniemi. 1984. A general computational model for word-form recognition and production. In Proceedings of the 10th International Conference on Computational Linguistics and 22nd Annual Meeting of the Association for Computational Linguistics. 178–181.Google Scholar
- [11] R. Wicentowski and D. Yarowsky. 2002. Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Ph.D. Dissertation. Johns Hopkins University, Baltimore, Maryland.Google Scholar
- [12] K. Toutanova and C. Cherry. 2009. A global model for joint lemmatization and part-of-speech pre-diction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 486–494.Google Scholar
- [13] A. Loponen and K. Järvelin. 2010. A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. In Multilingual and Multimodal Information Access Evaluation, Springer, 3–14.Google Scholar
- [14] A. Gesmundo and T. Samardžić. 2012. Lemmatisation as a tagging task. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics 2, 368–372. http://www.aclweb.org/anthology/P12-2072.Google Scholar
- [15] Thomas Müller, R. Cotterell, A. Fraser, and H. Schütze. 2015. Joint lemmatization and morphological tagging with lemming. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. 2268–2274. http://aclweb.org/anthology/D15-1272.Google Scholar
- [16] A. Chakrabarty and U. Garain. 2016. BenLem (a Bengali lemmatizer) and its role in WSD. ACM Trans. Asian Low-Resour. Lang. Inf. Process.Google Scholar
- [17] N. Baruah, S. K. Sarma, and S. Borkakoty. 2020. Evaluation of content compaction in assamese language. Proc. Comput. Sci. 171, 2275–2285.Google Scholar
- [18] T. Tabassum, J. Jahan, and S. Ismail. 2016. A corpus based unsupervised Bangla word stemming using N-gram language model. In Proceedings of the International Conference on Informatics, Electronics and Vision (ICIEV’16). Google Scholar
Cross Ref
- [19] M. R. Mahmud, M. Afrin, M. Razzaque, E. Miller, and J. Iwashige. 2014. A rule based bengali stemmer. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI’14), 2750–2756. Google Scholar
Cross Ref
- [20] J. H. Paik, M. Mitra, S. K. Parui, and K. Järvelin. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4 (2011). Google Scholar
Digital Library
- [21] U. Mishra and C. Meena. 2012. MAULIK: An effective stemmer for Hindi language. Int. J. Comput. Sci. Eng. 4, 5 (2012), 711–717.Google Scholar
- [22] R. J. Pratibha and M. C. Padma. 2015. Design of rule based lemmatizer for Kannada inflectional words. In Proceedings of the International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT’15).Google Scholar
- [23] B. Nathani and G. Purohit. 2019. Design and development of lemmatizer for Sindhi language in devanagri script. J. Stat. Manage. Syst.Google Scholar
- [24] H. Patel and B. Patel. 2019. Stemmatizer-stemmer-based lemmatizer for Gujarati text. In Emerging Trends in Expert Applications and Security. 667–674.Google Scholar
- [25] J. H. Paik, M. Mitra, S. K. Parui, and K. Järvelin. 2011. GRAS An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst.
DOI: Google ScholarDigital Library
- [26] N. Baruah, A. Gogoi, and S. K. Sarma. 2020. Utizing copus statistics for Assamese Word Sense Disambiguation. In Proceedings of the 4th International Conference on Computing and Network Communications.Google Scholar
- [27] P. Kalita and A. K. Barman. 2019. Implementation of walker algorithm in word sense disambiguation for assamese language. In Proceedings of the International Symposium on Advanced Computing and Communication (ISACC’19).
DOI: Google ScholarCross Ref
- [28] R. Navigli. 2009. Word sense disambiguation: A survey. ACM Comput. Surv. 41 (2009). Google Scholar
Digital Library
- [29] Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database, MIT Press.Google Scholar
- [30] A. G. Miller, M. Chodorow, S. Landes, C. Leacock, and R. G. Thomas. 1994. Using a semantic concordance for sense identification. In Proceedings of the Workshop on Human Language Technology, Association for Computational Linguistics. 240–243.Google Scholar
Index Terms
A Lemmatizer for Low-resource Languages: WSD and Its Role in the Assamese Language
Recommendations
BenLem (A Bengali Lemmatizer) and Its Role in WSD
A lemmatization algorithm for Bengali has been developed and evaluated. Its effectiveness for word sense disambiguation (WSD) is also investigated. One of the key challenges for computer processing of highly inflected languages is to deal with the ...
Stemming resource-poor Indian languages
Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...
POS Tagging of Assamese Language and Performance Analysis of CRF++ and fnTBL Approaches
UKSIM '13: Proceedings of the 2013 UKSim 15th International Conference on Computer Modelling and SimulationAssamese is one of the regional languages of India spoken by the people of Assam and other north eastern states of India. Parts Of Speech (POS) tagging is one of the most important research issue as it is the basic need for any Natural Language ...






Comments