skip to main content
short-paper

Named Entity Recognition and Classification for Punjabi Shahmukhi

Authors Info & Claims
Published:17 April 2020Publication History
Skip Abstract Section

Abstract

Named entity recognition (NER) refers to the identification of proper nouns from natural language text and classifying them into named entity types, such as person, location, and organization. Due to the widespread applications of NER, numerous NER techniques and benchmark datasets have been developed for both Western and Asian languages. Even though Shahmukhi script of the Punjabi language has been used by nearly three fourths of the Punjabi speakers worldwide, Gurmukhi has been the main focus of research activities. Specifically, a benchmark NER corpus for Shahmukhi is non-existent, which has thwarted the commencement of NER research for the Shahmukhi script. To this end, this article presents the development and specifications of the first-ever NER corpus for Shahmukhi. The newly developed corpus is composed of 318,275 tokens and 16,300 named entities, including 11,147 persons, 3,140 locations, and 2,013 organizations. To establish the strength of our corpus, we have compared the specifications of our corpus with its Gurmukhi counterparts. Furthermore, we have demonstrated the usability of our corpus using five supervised learning techniques, including two state-of-the-art deep learning techniques. The results are compared, and valuable insights about the behaviors of the most effective technique are discussed.

References

  1. Karun Verma and R. K. Sharma. 2017. Recognition of online handwritten Gurmukhi characters based on zone and stroke identification. Sādhanā 42, 5 (2017), 701--712.Google ScholarGoogle ScholarCross RefCross Ref
  2. Gurpreet Singh Lehal. 2009. A Gurmukhi to Shahmukhi transliteration system. In Proceedings of ICON-2009: 7th International Conference on Natural Language Processing. 167--173.Google ScholarGoogle Scholar
  3. Asif Ekbal and Sriparna Saha. 2011. Weighted vote-based classifier ensemble for named entity recognition: A genetic algorithm-based approach. ACM Transactions on Asian Language Information Processing 10, 2 (2011), 9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Khaled Shaalan. 2014. A survey of Arabic named entity recognition and classification. Computational Linguistics 40, 2 (2014), 469--510.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Wikipedia. 2018. Punjabi Language. Retrieved June 29, 2018 from https://en.wikipedia.org/wiki/Punjabi_languageGoogle ScholarGoogle Scholar
  6. Wikipedia. 2018. Punjabis. Retrieved June 29, 2018 from https://en.wikipedia.org/wiki/PunjabisGoogle ScholarGoogle Scholar
  7. David Amess. 2018. Punjabi Community. Retrieved June 29, 2018 from https://publications.parliament.uk/pa/cm200607/cmhansrd/cm061205/halltext/61205h0001.htmGoogle ScholarGoogle Scholar
  8. Statistics Canada. 2013. NHS Profile, 2011. Retrieved June 29, 2018 from https://www12.statcan.gc.ca/nhs-enm/2011/dp-pd/prof/details/page.cfm?Lang=E8Geo1=PR8Code1=018Data=Count8SearchText=canada8SearchType=Begins8SearchPR=018A1=Non-official%20language8B1=All8Custom=8TABID=1Google ScholarGoogle Scholar
  9. Census Bureau. 2015. Detailed Languages Spoken at Home and Ability to Speak English for the Population 5 Years and Over for United States: 2009--2013. Retrieved June 29, 2018 from http://www2.census.gov/library/data/tables/2008/demo/language-use/2009-2013-acs-lang-tables-nation.xlsGoogle ScholarGoogle Scholar
  10. Manpreet K. Singh. 2018. Punjabi is the most spoken language among India-born Australians. Retrieved June 29, 2018 from https://www.sbs.com.au/yourlanguage/punjabi/en/article/2017/07/07/punjabi-most-spoken-language-among-india-born-australiansGoogle ScholarGoogle Scholar
  11. Raveesh Motlani, Francis Tyers, and Dipti Misra Sharma. 2016. A finite-state morphological analyser for Sindhi. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 2572--2577.Google ScholarGoogle Scholar
  12. Muhammad Kamran Malik and Syed Mansoor Sarwar. 2016. Named entity recognition system for postpositional languages: Urdu as a case study. International Journal of Advanced Computer Science and Applications 7, 10 (2016), 141--147.Google ScholarGoogle Scholar
  13. Muhammad Kamran Malik. 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Transactions on Asian and Low-Resource Language Information Processing 17, 1 (2017), 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Asif Ekbal and Sivaji Bandyopadhyay. 2008. A web-based Bengali news corpus for named entity recognition. Language Resources and Evaluation 42, 2 (2008), 173--182.Google ScholarGoogle ScholarCross RefCross Ref
  15. Safia Kanwal, Kamran Malik, Khurram Shahzad, Faisal Aslam, and Zubair Nawaz. 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 1 (2019), 8.Google ScholarGoogle Scholar
  16. Nancy Chinchor and Elaine Marsh. 1998. Muc-7 information extraction task definition. In Proceeding of the 7th Message Understanding Conference (MUC-7), Appendices. 359--367.Google ScholarGoogle Scholar
  17. Hege Fromreide, Dirk Hovy, and Anders Søgaard. 2014. Crowdsourcing and annotating NER for Twitter #drift. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 2544--2547.Google ScholarGoogle Scholar
  18. Rob J. B. Vanwersch, Khurram Shahzad, Irene Vanderfeesten, Kris Vanhaecht, Paul Grefen, Liliane Pintelon, Jan Mendling, Godefridus G. van Merode, and Hajo A. Reijers. 2016. A critical evaluation and framework of business process improvement methods. Business 8 Information Systems Engineering 58, 1 (2016), 43--53.Google ScholarGoogle Scholar
  19. Muhammad Kamran Malik and Syed Mansoor Sarwar. 2017. Urdu named entity recognition system using hidden Markov model. Pakistan Journal of Engineering and Applied Sciences 21, 2 (2017), 15--22.Google ScholarGoogle Scholar
  20. Amandeep Kaur and Gurpreet Singh Josan. 2015. Evaluation of named entity features for Punjabi language. Procedia Computer Science 46 (2015), 159--166.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Amandeep Kaur, G. Josan, and Jagroop Kaur. 2009. Named entity recognition for Punjabi: A conditional random field approach. In Proceedings of 7th International Conference on Natural Language Processing (ICON’09).Google ScholarGoogle Scholar
  22. Deepti Chopra and Sudha Morwal. 2012. Named entity recognition in Punjabi using hidden Markov model. International Journal of Computer Science 8 Engineering Technology 3, 12 (2012), 616–620.Google ScholarGoogle Scholar
  23. Kuljot Singh. 2013. Name entity recognition on Punjabi language. International Journal of Computer Science Engineering and Information Technology Research 3, 5 (2013), 95--102.Google ScholarGoogle Scholar
  24. Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003, Volume 4. 188--191.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Thorsten Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. 224--231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Andrew Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation. New York University.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lars Kai Hansen and Peter Salamon. 1990. Neural network ensembles. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 12, 10 (1990), 993--1001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Larry Medsker and Lakhmi C. Jain. 1999. Recurrent Neural Networks: Design and Applications. CRC Press, Boca Raton, FL.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.Google ScholarGoogle Scholar

Index Terms

  1. Named Entity Recognition and Classification for Punjabi Shahmukhi

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!