skip to main content
research-article

Konkani WordNet: Corpus-Based Enhancement using Crowdsourcing

Authors Info & Claims
Published:04 March 2022Publication History
Skip Abstract Section

Abstract

Konkani is one of the languages included in the eighth schedule of the Indian constitution. It is the official language of Goa and is spoken mainly in Goa and some places in Karnataka and Kerala. Konkani WordNet or Konkani Shabdamalem (kōṁkanī śabdamālēṁ) as it has been referred to, was developed under the Indradhanush WordNet Project Consortium during the period from August 2010 to October 2013. This project was funded by Technology Development for Indian Languages (TDIL), Department of Electronics & Information Technology (Deity), and Ministry of Communication and Information Technology (MCIT). The work on Konkani WordNet has halted since the end of the project. Currently, the Konkani WordNet contains around 32,370 synsets. However, to make it a powerful resource for NLP applications in the Konkani language, a need is felt for research work toward enhancement of the Konkani WordNet via community involvement. Crowdsourcing is a technique in which the knowledge of the crowd is utilized to accomplish a particular task.

In this article, we have presented the details of the crowdsourcing platform named “Konkani Shabdarth” (kōṁkanī śabdārth). Konkani Shabdarth attempts to use the knowledge of Konkani speaking people for creating new synsets and perform the quantitative enhancement of the wordnet. It also intends to work toward enhancing the overall quality of the Konkani WordNet by validating the existing synsets, and adding the missing words to the existing synsets. A text corpus named “Konkani Shabdarth Corpus”, has been created from the Konkani literature while implementing the Konkani Shabdarth tool. Using this corpus, 572 root words that are missing from the Konkani WordNet have been identified which are given as input to Konkani Shabdarth. As of now, total 94 users have registered on the platform, out of which 25 users have actually played the game. Currently, 71 new synsets have been obtained for 21 words. For some of the words, multiple entries for the concept definition have been received. This overlap is essential for automating the process of validating the synsets. Due to the pandemic period, it has been difficult to train and get players to actually play the game and contribute. We studied the impact of adding missing words from other existing Konkani text corpus on the coverage of Konkani WordNet. The expected increase in the percentage coverage of Konkani WordNet has been found to be in the range 20–27 after adding the missing words from the Konkani Shabdarth corpus in comparison to the other corpora for which the increase is in the range 1–10.

REFERENCES

  1. [1] Konkani language. Available: Retrieved from https://en.wikipedia.org/wiki/Konkanilanguage, [Accessed August 15, 2020].Google ScholarGoogle Scholar
  2. [2] Are Konkani speakers declining? Available: https://www.goa365.tv/general/N/are-konkani-speakers-declining-no-rising-in-konkani-states/03857.html, 05 Jul 2018, [Accessed August 17, 2020].Google ScholarGoogle Scholar
  3. [3] Jha Girish Nath. 2012. The TDIL program and the Indian language corpora initiative (ILCI). In Proceedings of the Language Resources and Evaluation Conference.Google ScholarGoogle Scholar
  4. [4] Desai Shilpa, Pawar Jyoti, and Bhattacharya Pushpak. 2012. Automated paradigm selection for FSA based konkani verb morphologic al analyzer. In Proceedings of the COLING 10-14 (Dec, 2012).Google ScholarGoogle Scholar
  5. [5] Vaz Edna, Walawalikar Shantaram V., Pawar Jyoti, and Sardesai Madhavi. 2012. BIS annotation standards with reference to konkani language. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing. pages 145152. COLING 2012, Mumbai, December.Google ScholarGoogle Scholar
  6. [6] Prabhu Khorjuvenkar Diksha N., Ainapurkar Megha, and Chagas Sufola. 2018. PARTS OF SPEECH TAGGING FOR KONKANI LANGUAGE. In Proceedings of the 2nd International Conference on Computing Methodologies and Communication.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Walawalikar Shantaram, Desai Shilpa, Karmali Ramdas, Naik Sushant, Ghanekar Damodar, D'Souza Chandralekha, and Pawar Jyoti. 2010. Experiences in building the konkani wordnet using the expansion approach. In Proceedings of the 5th Global WordNet Conference on Principles, Construction and Application of Multilingual WordNets (Mumbai-India), 2010.Google ScholarGoogle Scholar
  8. [8] Prabhu Venkatesh, Desai Shilpa, Redkar Hanumant, Prabhugaonkar Neha, Nagvenkar Apurva, and Karmali Ramdas. An efficient database design for indowordnet development using hybrid approach. In Proceedings of the COLING 2012, Mumbai, India. (229).Google ScholarGoogle Scholar
  9. [9] Desai Shilpa, Karmali Ramdas, Naik Sushant, Walawalikar Shantaram, and Ghanekar Damodar. 2010. Tools for IndoWordNet Development. In Proceedings of the International Conference on Natural Language Processing.Google ScholarGoogle Scholar
  10. [10] Prabhugaonkar Neha R., Nagvenkar Apurva S., and Karmali Ramdas N.. 2012. IndoWordNet application programming interfaces. COLING, Mumbai, India, 237244.Google ScholarGoogle Scholar
  11. [11] Bhattacharyya Pushpak. 2010. IndoWordnet. In Proceedings of LREC-10, Valletta, Malta. European Language Resources Association (ELRA).Google ScholarGoogle Scholar
  12. [12] IndoWordNet available at: Retrieved from http://www.cfilt.iitb.ac.in/indowordnet/, [Accessed July 03, 2020].Google ScholarGoogle Scholar
  13. [13] Kanojia Diptesh, Patel Kevin, and Bhattacharyya Pushpak. 2019. Indian language wordnets and their linkages with princeton wordnet. In Proceedings of the 11th International Conference on Language Resources and Evaluation [Online]. Available: Retrieved from https://www.aclweb.org/anthology/L18-1728.pdf, [Accessed Oct. 17, 2019].Google ScholarGoogle Scholar
  14. [14] Sinopalnikova Anna. Word Association Thesaurus As a Resource for Building WordNet. In Proceedings of 2nd International WordNet Conference, Brno. 199205.Google ScholarGoogle Scholar
  15. [15] Ruiz-Casado Maria, Alfonseca Enrique, and Castells Pablo. Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets. In Proceedings of Advances in Web Intelligence, Lodz, Poland.Google ScholarGoogle Scholar
  16. [16] Wang Aobo, Hoang Cong Duy Vu, and Kan Min-Yen. 2013. Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation 47, 1 (2013), 931.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Ustalov D. A.. 2015. A crowdsourcing engine for mechanized labor. Proceedings of the Institute for System Programming 27, 3 (2015), 351364.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Amazon Mechanical Turk. Available: Retrieved from https://www.mturk.com/worker/help. [Accessed March 31, 2020].Google ScholarGoogle Scholar
  19. [19] Guy Ido, Hashavit Anat, and Corem Yaniv. Games for crowds: A crowdsourcing game platform for the enterprise. In Proceedings of the ACM Conference on Computer Supported Cooperative Work & Social Computing (Vancouver, Canada).Google ScholarGoogle Scholar
  20. [20] Sabou Marta, Bontcheva Kalina, Derczynski Leon, and Scharl Arno. 2014. Corpus annotation through crowdsourcing: Towards best practice guidelines. In Proceedings of the 9th International Conference On Language Resources And Evaluation. 859866.Google ScholarGoogle Scholar
  21. [21] Harris Christopher G., and Srinivasan Padmini. 2013. The employment of crowdsourcing workers for tasks that violate privacy and ethics. Security and Privacy in Social Networks 2013. 6783.Google ScholarGoogle Scholar
  22. [22] Fournier Alexis. 6 Great Advantages of Crowdsourcing you can Benefit From, available at: Retrieved from https://www.braineet.com/blog/crowdsourcing-benefits/. [Accessed July 4, 2020].Google ScholarGoogle Scholar
  23. [23] Hargrave Marshall, Crowdsourcing. Retrieved from https://www.investopedia.com/terms/c/crowdsourcing.asp, [Accessed July 3, 2020]Google ScholarGoogle Scholar
  24. [24] Yudkin Marcia. Crowdsourcing: 9 Hidden pitfalls of this new method of generating your new business name. Retrieved from https://www.yudkin.com/crowdsourcing.htm, [Accessed 2020].Google ScholarGoogle Scholar
  25. [25] Post Matt, Callison-Burch Chris, Osborne Miles. Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the 7th Workshop on Statistical Machine Translation. 401409, Montreal, Canada, June. Association for Computational Linguistics.Google ScholarGoogle Scholar
  26. [26] Biemann Chris. 2013. Creating a system for lexical substitutions from scratch using crowdsourcing. Language Resources and Evaluation: Special Issue on Collaboratively Constructed Language Resources. 47, 1 (2013, March), 97122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Biemann Chris and Nygaard Valerie. 2010. CrowdsourcingWordNet. In Proceedings of the 5th International Conference of the Global WordNet Association.Google ScholarGoogle Scholar
  28. [28] Franklin Michael J., Kossmann Donald, Kraska Tim, Ramesh Sukriti, and Xin Reynold. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the ACM SIGMOD International Conference on Management of data. 6172, June.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Kittur Aniket, Smus Boris, Khamkar Susheel, and Kraut Robert E.. 2011. CrowdForge: Crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology, Santa Barbara, CA. 4352, (October 2011).Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Kunchukuttan Anoop, Roy Shourya, Patel Pratik, Ladha Kushal, Gupta Somya, Khapra Mitesh, and Bhattacharyya Pushpak. 2012. Experiences in resource generation for machine translation through crowdsourcing. In Proceedings of the International Conference on Language Resources and Evaluation. 384391, 2012.Google ScholarGoogle Scholar
  31. [31] Hotho Andreas, Nurnberger Andreas, and Paa Gerhard. 2005. A brief survey of text mining. Ldv Forum. 20. 1 (2005), 1962.Google ScholarGoogle Scholar
  32. [32] Farkiya Alabhya, Saini Prashant, Sinha Shubham, and Desai Sharmishta. 2015. Natural language processing using NLTK and WordNet. (IJCSIT) International Journal of Computer Science and Information Technologies 6, 6 (2015), 5465546981.Google ScholarGoogle Scholar
  33. [33] Konkani POS Tagger. Retrieved from http://annierajan.com/intag/. [Accessed March 31, 2020].Google ScholarGoogle Scholar
  34. [34] Allahyari M., Pouriyeh S., Assefi M., Safaei S., Trippe E. D., Gutierrez J. B., and Kochut K.. 2017. A brief survey of text mining: Classification, clustering and extraction techniques. arXiv:1707.02919. Retrieved from https://arxiv.org/abs/1707.02919.Google ScholarGoogle Scholar
  35. [35] Siddiqui Tamanna and Aalam Parvej. 2015. Short text clustering; challenges & solutions: A literature review. International Journal of Mathematics and Computer Research 3, 6 (2015, June), 10251031.Google ScholarGoogle Scholar
  36. [36] Fodeh Samah, Punch Bill, and Tan Pang-Ning. 2011. On ontology-driven document clustering using core semantic features. Knowledge and Information Systems 28, 2 (2011), 395421.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Zhang Xiaodan, Jing Liping, Hu Xiaohua, Ng Michael, Xia Jiali, and Zhou Xiaohua. 2008. Medical document clustering using ontology-based term similarity measures. International Journal of Data Warehousing and Mining 4, 1 (2008), 62-73.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wei Tingting, Lu Yonghe, Chang Huiyou, Zhou Qiang, and Bao Xianyu. 2015. A semantic approach for text clustering using WordNet and lexical chains. Expert Systems with Applications 42, 4 (2015, March), 22642275.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Song Wei, Hua Li Cheng, and Cheol Park Soon. 2009. Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Systems with Applications 36, 5 (2009), 90959104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Banerjee Somnath, Ramanathan Krishnan, and Gupta Ajay. 2007. Clustering short texts using wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 787788, July.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Gabrilovich Evgeniy and Markovitch Shaul. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. Veloso M. M. (Ed.), In Proceedings of the 20th International Joint Conference on Artificial Intelligence. 16061611.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Jing Liping, Ng Michael K., and Huang Joshua Z.. 2010. Knowledge-based vector space model for text clustering. Knowledge and Information Systems 25, 1 (2010), 3555.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Hu Xia, Sun Nan, Zhang Chao, and Chua Tat-Seng. 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 919928.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Singh Vivek Kumar, Tiwari Nisha, and Garg Shekhar. 2011. Document clustering using K-means, Heuristic K-means and Fuzzy C-means. In Proceedings of the International Conference on Computational Intelligence and Communication Systems. 287301.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Kaur Harmandeep and Kumar Munish. 2018. A comprehensive survey on word recognition for non-Indic and Indic scripts. Pattern Analysis and Applications 21, 4 (2018), 897929.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Kumar Munish, Jindal M. K., Sharma R. K., and Jindal Simpel Rani. 2019. Character and numeral recognition for non-Indic and Indic scripts: a survey. Artificial Intelligence Review 52, 4 (2019), 22352261.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Dargan Shaveta, Kumar Munish, Rohit Ayyagari Maruthi, and Kumar Gulshan. 2019. A survey of deep learning and its applications: A new paradigm to machine learning. Archives of Computational Methods in Engineering. 27, 4 (2019), 1--22.Google ScholarGoogle Scholar
  48. [48] Kaur Harmandeep and Kumar Munish. 2021. Offline handwritten Gurumukhi word recognition using eXtreme Gradient Boosting methodology. Soft Computing 25, 6 (2021), 44514464.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Kaur Harmandeep and Kumar Munish. 2021. On the recognition of offline handwritten word using holistic approach and AdaBoost methodology. Multimedia Tools and Applications 80, 7 (2021), 1115511175.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Narang S. R., Jindal M. K., Ahuja S., and Kumar M.. 2020. On the recognition of Devanagari ancient handwritten characters using SIFT and Gabor features. Soft Computing 24, 22 (2020), 1727917289.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Kumar Munish and Jindal Simpel Rani. 2020. A study on recognition of pre-segmented handwritten multi-lingual characters. Archives of Computational Methods in Engineering 27, 2 (2020), 577589.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Konkani WordNet: Corpus-Based Enhancement using Crowdsourcing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 4
      July 2022
      464 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3511099
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 March 2022
      • Accepted: 1 November 2021
      • Revised: 1 October 2021
      • Received: 1 September 2020
      Published in tallip Volume 21, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)122
      • Downloads (Last 6 weeks)8

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!