Abstract
Konkani is one of the languages included in the eighth schedule of the Indian constitution. It is the official language of Goa and is spoken mainly in Goa and some places in Karnataka and Kerala. Konkani WordNet or Konkani Shabdamalem (kōṁkanī śabdamālēṁ) as it has been referred to, was developed under the Indradhanush WordNet Project Consortium during the period from August 2010 to October 2013. This project was funded by Technology Development for Indian Languages (TDIL), Department of Electronics & Information Technology (Deity), and Ministry of Communication and Information Technology (MCIT). The work on Konkani WordNet has halted since the end of the project. Currently, the Konkani WordNet contains around 32,370 synsets. However, to make it a powerful resource for NLP applications in the Konkani language, a need is felt for research work toward enhancement of the Konkani WordNet via community involvement. Crowdsourcing is a technique in which the knowledge of the crowd is utilized to accomplish a particular task.
In this article, we have presented the details of the crowdsourcing platform named “Konkani Shabdarth” (kōṁkanī śabdārth). Konkani Shabdarth attempts to use the knowledge of Konkani speaking people for creating new synsets and perform the quantitative enhancement of the wordnet. It also intends to work toward enhancing the overall quality of the Konkani WordNet by validating the existing synsets, and adding the missing words to the existing synsets. A text corpus named “Konkani Shabdarth Corpus”, has been created from the Konkani literature while implementing the Konkani Shabdarth tool. Using this corpus, 572 root words that are missing from the Konkani WordNet have been identified which are given as input to Konkani Shabdarth. As of now, total 94 users have registered on the platform, out of which 25 users have actually played the game. Currently, 71 new synsets have been obtained for 21 words. For some of the words, multiple entries for the concept definition have been received. This overlap is essential for automating the process of validating the synsets. Due to the pandemic period, it has been difficult to train and get players to actually play the game and contribute. We studied the impact of adding missing words from other existing Konkani text corpus on the coverage of Konkani WordNet. The expected increase in the percentage coverage of Konkani WordNet has been found to be in the range 20–27 after adding the missing words from the Konkani Shabdarth corpus in comparison to the other corpora for which the increase is in the range 1–10.
- [1] . Available: Retrieved from https://en.wikipedia.org/wiki/Konkanilanguage, [Accessed August 15, 2020].Google Scholar
- [2] Available: https://www.goa365.tv/general/N/are-konkani-speakers-declining-no-rising-in-konkani-states/03857.html, 05 Jul 2018, [Accessed August 17, 2020].Google Scholar
- [3] . 2012. The TDIL program and the Indian language corpora initiative (ILCI). In Proceedings of the Language Resources and Evaluation Conference.Google Scholar
- [4] . 2012. Automated paradigm selection for FSA based konkani verb morphologic al analyzer. In Proceedings of the COLING 10-14 (Dec, 2012).Google Scholar
- [5] . 2012. BIS annotation standards with reference to konkani language. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing. pages 145–152. COLING 2012, Mumbai, December.Google Scholar
- [6] . 2018. PARTS OF SPEECH TAGGING FOR KONKANI LANGUAGE. In Proceedings of the 2nd International Conference on Computing Methodologies and Communication.Google Scholar
Cross Ref
- [7] . 2010. Experiences in building the konkani wordnet using the expansion approach. In Proceedings of the 5th Global WordNet Conference on Principles, Construction and Application of Multilingual WordNets (Mumbai-India), 2010.Google Scholar
- [8] . An efficient database design for indowordnet development using hybrid approach. In Proceedings of the COLING 2012, Mumbai, India. (229).Google Scholar
- [9] . 2010. Tools for IndoWordNet Development. In Proceedings of the International Conference on Natural Language Processing.Google Scholar
- [10] . 2012. IndoWordNet application programming interfaces. COLING, Mumbai, India, 237–244.Google Scholar
- [11] . 2010. IndoWordnet. In Proceedings of LREC-10, Valletta, Malta. European Language Resources Association (ELRA).Google Scholar
- [12] available at: Retrieved from http://www.cfilt.iitb.ac.in/indowordnet/, [Accessed July 03, 2020].Google Scholar
- [13] . 2019. Indian language wordnets and their linkages with princeton wordnet. In Proceedings of the 11th International Conference on Language Resources and Evaluation [Online]. Available: Retrieved from https://www.aclweb.org/anthology/L18-1728.pdf, [Accessed Oct. 17, 2019].Google Scholar
- [14] . Word Association Thesaurus As a Resource for Building WordNet. In Proceedings of 2nd International WordNet Conference, Brno. 199–205.Google Scholar
- [15] . Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets. In Proceedings of Advances in Web Intelligence, Lodz, Poland.Google Scholar
- [16] . 2013. Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation 47, 1 (2013), 9–31.Google Scholar
Digital Library
- [17] . 2015. A crowdsourcing engine for mechanized labor. Proceedings of the Institute for System Programming 27, 3 (2015), 351–364.Google Scholar
Cross Ref
- [18] . Available: Retrieved from https://www.mturk.com/worker/help. [Accessed March 31, 2020].Google Scholar
- [19] . Games for crowds: A crowdsourcing game platform for the enterprise. In Proceedings of the ACM Conference on Computer Supported Cooperative Work & Social Computing (Vancouver, Canada).Google Scholar
- [20] . 2014. Corpus annotation through crowdsourcing: Towards best practice guidelines. In Proceedings of the 9th International Conference On Language Resources And Evaluation. 859–866.Google Scholar
- [21] . 2013. The employment of crowdsourcing workers for tasks that violate privacy and ethics. Security and Privacy in Social Networks 2013. 67–83.Google Scholar
- [22] . 6 Great Advantages of Crowdsourcing you can Benefit From, available at: Retrieved from https://www.braineet.com/blog/crowdsourcing-benefits/. [Accessed July 4, 2020].Google Scholar
- [23] , Crowdsourcing. Retrieved from https://www.investopedia.com/terms/c/crowdsourcing.asp, [Accessed July 3, 2020]Google Scholar
- [24] . Crowdsourcing: 9 Hidden pitfalls of this new method of generating your new business name. Retrieved from https://www.yudkin.com/crowdsourcing.htm, [Accessed 2020].Google Scholar
- [25] . Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the 7th Workshop on Statistical Machine Translation. 401–409, Montreal, Canada, June. Association for Computational Linguistics.Google Scholar
- [26] . 2013. Creating a system for lexical substitutions from scratch using crowdsourcing. Language Resources and Evaluation: Special Issue on Collaboratively Constructed Language Resources. 47, 1 (2013, March), 97–122.Google Scholar
Digital Library
- [27] . 2010. CrowdsourcingWordNet. In Proceedings of the 5th International Conference of the Global WordNet Association.Google Scholar
- [28] . 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the ACM SIGMOD International Conference on Management of data. 61–72, June.Google Scholar
Digital Library
- [29] . 2011. CrowdForge: Crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology, Santa Barbara, CA. 43–52, (October 2011).Google Scholar
Digital Library
- [30] . 2012. Experiences in resource generation for machine translation through crowdsourcing. In Proceedings of the International Conference on Language Resources and Evaluation. 384–391, 2012.Google Scholar
- [31] . 2005. A brief survey of text mining. Ldv Forum. 20. 1 (2005), 19–62.Google Scholar
- [32] . 2015. Natural language processing using NLTK and WordNet. (IJCSIT) International Journal of Computer Science and Information Technologies 6, 6 (2015), 5465–546981.Google Scholar
- [33] . Retrieved from http://annierajan.com/intag/. [Accessed March 31, 2020].Google Scholar
- [34] . 2017. A brief survey of text mining: Classification, clustering and extraction techniques. arXiv:1707.02919. Retrieved from https://arxiv.org/abs/1707.02919.Google Scholar
- [35] . 2015. Short text clustering; challenges & solutions: A literature review. International Journal of Mathematics and Computer Research 3, 6 (2015, June), 1025–1031.Google Scholar
- [36] . 2011. On ontology-driven document clustering using core semantic features. Knowledge and Information Systems 28, 2 (2011), 395–421.Google Scholar
Digital Library
- [37] . 2008. Medical document clustering using ontology-based term similarity measures. International Journal of Data Warehousing and Mining 4, 1 (2008), 62-73.Google Scholar
Cross Ref
- [38] . 2015. A semantic approach for text clustering using WordNet and lexical chains. Expert Systems with Applications 42, 4 (2015, March), 2264–2275.Google Scholar
Digital Library
- [39] . 2009. Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Systems with Applications 36, 5 (2009), 9095–9104.Google Scholar
Digital Library
- [40] . 2007. Clustering short texts using wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 787–788, July.Google Scholar
Digital Library
- [41] . 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. (Ed.), In Proceedings of the 20th International Joint Conference on Artificial Intelligence. 1606–1611.Google Scholar
Digital Library
- [42] . 2010. Knowledge-based vector space model for text clustering. Knowledge and Information Systems 25, 1 (2010), 35–55.Google Scholar
Digital Library
- [43] . 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 919–928.Google Scholar
Digital Library
- [44] . 2011. Document clustering using K-means, Heuristic K-means and Fuzzy C-means. In Proceedings of the International Conference on Computational Intelligence and Communication Systems. 287–301.Google Scholar
Digital Library
- [45] . 2018. A comprehensive survey on word recognition for non-Indic and Indic scripts. Pattern Analysis and Applications 21, 4 (2018), 897–929.Google Scholar
Digital Library
- [46] . 2019. Character and numeral recognition for non-Indic and Indic scripts: a survey. Artificial Intelligence Review 52, 4 (2019), 2235–2261.Google Scholar
Digital Library
- [47] . 2019. A survey of deep learning and its applications: A new paradigm to machine learning. Archives of Computational Methods in Engineering. 27, 4 (2019), 1--22.Google Scholar
- [48] . 2021. Offline handwritten Gurumukhi word recognition using eXtreme Gradient Boosting methodology. Soft Computing 25, 6 (2021), 4451–4464.Google Scholar
Digital Library
- [49] . 2021. On the recognition of offline handwritten word using holistic approach and AdaBoost methodology. Multimedia Tools and Applications 80, 7 (2021), 11155–11175.Google Scholar
Cross Ref
- [50] . 2020. On the recognition of Devanagari ancient handwritten characters using SIFT and Gabor features. Soft Computing 24, 22 (2020), 17279–17289.Google Scholar
Digital Library
- [51] . 2020. A study on recognition of pre-segmented handwritten multi-lingual characters. Archives of Computational Methods in Engineering 27, 2 (2020), 577–589.Google Scholar
Cross Ref
Index Terms
Konkani WordNet: Corpus-Based Enhancement using Crowdsourcing
Recommendations
Annotating words using wordnet semantic glosses
ICONIP'12: Proceedings of the 19th international conference on Neural Information Processing - Volume Part IVAn approach to the word sense disambiguation (WSD) relaying on the WordNet synsets is proposed. The method uses semantically tagged glosses to perform a process similar to the spreading activation in semantic network, creating ranking of the most ...
Improving Vietnamese WordNet using word embedding
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information RetrievalThis paper presents a simple but effective method to improve the quality of WordNet synsets and extract glosses for synsets. We translate the Princeton WordNet and other intermediate WordNets to a target language using a machine translator, then the ...
Enriching the adjective domain in the Japanese wordnet
IceTAL'10: Proceedings of the 7th international conference on Advances in natural language processingWe released Japanese WordNet Version 1.0 in March 2010, and are continuing to enrich the Japanese WordNet in several directions. The current version of the Japanese WordNet is a kind of translation of Princeton WordNet 3.0 and we used WordNets of ...






Comments