Abstract
Hindi is the third most-spoken language in the world (615 million speakers) and has the fourth highest native speakers (341 million). It is an inflectionally rich and relatively free word-order language with an immense vocabulary set. Despite being such a celebrated language across the globe, very few Natural Language Processing (NLP) applications and tools have been developed to support it computationally. Moreover, most of the existing ones are not efficient enough due to the lack of semantic information (or contextual knowledge). Hindi grammar is based on Paninian grammar and derives most of its rules from it. Paninian grammar very aggressively highlights the role of karaka theory in free-word order languages. In this article, we present an application that extracts all possible karakas from simple Hindi sentences with an accuracy of 84.2% and an F1 score of 88.5%. We consider features such as Parts of Speech tags, post-position markers (vibhaktis), semantic tags for nouns and syntactic structure to grab the context in different-sized word windows within a sentence. With the help of these features, we built a rule-based inference engine to extract karakas from a sentence. The application takes in a text file with clean (without punctuation) simple Hindi sentences and gives back karaka tagged sentences in a separate text file as output.
- [1] . 2010. Tracing the Boundaries between Hindi and Urdu: Lost and Added in Translation between 20th Century Short Stories, Vol. 32, Brill.Google Scholar
- [2] . 1994. Paninian framework and its application to Anusaraka. Sadhana. 19, 1 (1994), 113–127.Google Scholar
Cross Ref
- [3] . 1996. Natural Language Processing: A Paninian Perspective. Prentice-Hall of India.Google Scholar
- [4] . 2013. PoS tagging and CYK Parsing for Indian Languages. Retrieved October 18, 2021 from https://github.com/rajesh-iiith/POS-Tagging-and-CYK-Parsing-for-Indian-Languages.Google Scholar
- [5] . 2012. Manak Hindi Vyakaran Evam Rachana—Class 9 and 10 (Course-A). New Saraswati House India Pvt. Ltd., New Delhi.Google Scholar
- [6] . 2015. Natural language interface for databases in Hindi based on karaka theory. International Journal of Computer Applications 122, 7 (2015), 39–43.Google Scholar
Cross Ref
- [7] . 1990. A karaka-based approach to parsing of Indian languages. In Proceedings of the 13th International Conference on Computational Linguistics. ACL, Helsinki (Finland) (COLNG'90). Vol. 3, 25–29. Google Scholar
Digital Library
- [8] . 2004. Relative clauses in Hindi and Arabic: A Paninian dependency grammar analysis. In Proceedings of the Workshop on Recent Advances in Dependency Grammar. ACL, Geneva (Switzerland), 9–16.Google Scholar
- [9] . 2010. On the role of morphosyntactic features in Hindi dependency parsing. In Proceedings of the NAACL HLT 2010 1st Workshop on Statistical Parsing of Morphologically-Rich Languages. ACL, Los Angeles, 94–102. Google Scholar
Digital Library
- [10] . 2019. Hindi question generation using dependency structures. arXiv preprint arXiv:1906.08570.Google Scholar
- [11] . 2016. Towards building semantic role labeler for Indian languages. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). ELRA, Portoroz, 4588–4595.Google Scholar
- [12] . 2008. A rule based approach for automatic annotation of a Hindi treebank. In Proeedings of the 6th International Conference on Natural Language Processing (ICON.08), NLPAI, CDAC Pune (India). 1–10.Google Scholar
- [13] . 2020. Improving neural machine translation for low-resource Indian languages using rule-based feature extraction. In Neural Computing & Applications, Vol. 33. Springer, 1103–1122.Google Scholar
- [14] . 2009. A multi-representational and multi-layered treebank for Hindi/Urdu. In Proceedings of the Third Linguistic Annotation Workshop (LAW III). ACL, Suntec (Singapore), 186–189. Google Scholar
Digital Library
- [15] . 2019. Towards automated semantic role labelling of hindi-english code-mixed tweets. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT'19). ACL, Hong Kong, 291–296.Google Scholar
Cross Ref
- [16] . 2015. Role of karaka relations in Hindi word sense disambiguation. Journal of Information Technology Research 8(3), 21–42. Google Scholar
Digital Library
- [17] . 2007. Sanskrit karaka analyzer for machine translation. SPLASH Proceedings of iSTRANS. 224–225.Google Scholar
- [18] . 2017. Karaka analysis of complicated Sanskrit sentences. Vagarthah: An International Journal of Sanskrit Research I(II). 4–7.Google Scholar
- [19] . 2007. Semantic processing in pāṇini's kāraka system. In Sanskrit Computational Linguistics. Springer, Berlin, 239–252. Google Scholar
Digital Library
- [20] . 2009. Hindi syntax: Annotating dependency, lexical predicate-argument structure, and phrase structure. In 7th International Conference on Natural Language Processing. NLPAI, Hyderabad (India), 14–17.Google Scholar
- [21] . 2019. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. 3rd Edition Draft. Prentice Hall, New Jersey. Google Scholar
Digital Library
Index Terms
Development of Automatic Rule-based Semantic Tagger and Karaka Analyzer for Hindi
Recommendations
Role of Karaka Relations in Hindi Word Sense Disambiguation
Karakas are an important constituent of Hindi language. Karaka relations express syntactico-semantic or semantico-syntactic relationship between verbs and nouns or pronouns in a sentence. They capture certain level of semantics closer to thematic ...
Hindi CCGbank: A CCG treebank from the Hindi dependency treebank
In this paper, we present an approach for automatically creating a combinatory categorial grammar (CCG) treebank from a dependency treebank for the subject---object---verb language Hindi. Rather than a direct conversion from dependency trees to CCG ...
Intelligent Part of Speech tagger for Hindi
AbstractEnglish Part of Speech like noun, verb, adverb, adjective, pronoun, preposition, interjection, conjunction is somewhat similar in Hindi but not exactly the same. Hindi grammar has different Part of Speech (POS) based on its morphological features ...






Comments