Abstract
Named entity recognition (NER) refers to the identification of proper nouns from natural language text and classifying them into named entity types, such as person, location, and organization. Due to the widespread applications of NER, numerous NER techniques and benchmark datasets have been developed for both Western and Asian languages. Even though Shahmukhi script of the Punjabi language has been used by nearly three fourths of the Punjabi speakers worldwide, Gurmukhi has been the main focus of research activities. Specifically, a benchmark NER corpus for Shahmukhi is non-existent, which has thwarted the commencement of NER research for the Shahmukhi script. To this end, this article presents the development and specifications of the first-ever NER corpus for Shahmukhi. The newly developed corpus is composed of 318,275 tokens and 16,300 named entities, including 11,147 persons, 3,140 locations, and 2,013 organizations. To establish the strength of our corpus, we have compared the specifications of our corpus with its Gurmukhi counterparts. Furthermore, we have demonstrated the usability of our corpus using five supervised learning techniques, including two state-of-the-art deep learning techniques. The results are compared, and valuable insights about the behaviors of the most effective technique are discussed.
- Karun Verma and R. K. Sharma. 2017. Recognition of online handwritten Gurmukhi characters based on zone and stroke identification. Sādhanā 42, 5 (2017), 701--712.Google Scholar
Cross Ref
- Gurpreet Singh Lehal. 2009. A Gurmukhi to Shahmukhi transliteration system. In Proceedings of ICON-2009: 7th International Conference on Natural Language Processing. 167--173.Google Scholar
- Asif Ekbal and Sriparna Saha. 2011. Weighted vote-based classifier ensemble for named entity recognition: A genetic algorithm-based approach. ACM Transactions on Asian Language Information Processing 10, 2 (2011), 9.Google Scholar
Digital Library
- Khaled Shaalan. 2014. A survey of Arabic named entity recognition and classification. Computational Linguistics 40, 2 (2014), 469--510.Google Scholar
Digital Library
- Wikipedia. 2018. Punjabi Language. Retrieved June 29, 2018 from https://en.wikipedia.org/wiki/Punjabi_languageGoogle Scholar
- Wikipedia. 2018. Punjabis. Retrieved June 29, 2018 from https://en.wikipedia.org/wiki/PunjabisGoogle Scholar
- David Amess. 2018. Punjabi Community. Retrieved June 29, 2018 from https://publications.parliament.uk/pa/cm200607/cmhansrd/cm061205/halltext/61205h0001.htmGoogle Scholar
- Statistics Canada. 2013. NHS Profile, 2011. Retrieved June 29, 2018 from https://www12.statcan.gc.ca/nhs-enm/2011/dp-pd/prof/details/page.cfm?Lang=E8Geo1=PR8Code1=018Data=Count8SearchText=canada8SearchType=Begins8SearchPR=018A1=Non-official%20language8B1=All8Custom=8TABID=1Google Scholar
- Census Bureau. 2015. Detailed Languages Spoken at Home and Ability to Speak English for the Population 5 Years and Over for United States: 2009--2013. Retrieved June 29, 2018 from http://www2.census.gov/library/data/tables/2008/demo/language-use/2009-2013-acs-lang-tables-nation.xlsGoogle Scholar
- Manpreet K. Singh. 2018. Punjabi is the most spoken language among India-born Australians. Retrieved June 29, 2018 from https://www.sbs.com.au/yourlanguage/punjabi/en/article/2017/07/07/punjabi-most-spoken-language-among-india-born-australiansGoogle Scholar
- Raveesh Motlani, Francis Tyers, and Dipti Misra Sharma. 2016. A finite-state morphological analyser for Sindhi. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 2572--2577.Google Scholar
- Muhammad Kamran Malik and Syed Mansoor Sarwar. 2016. Named entity recognition system for postpositional languages: Urdu as a case study. International Journal of Advanced Computer Science and Applications 7, 10 (2016), 141--147.Google Scholar
- Muhammad Kamran Malik. 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Transactions on Asian and Low-Resource Language Information Processing 17, 1 (2017), 2.Google Scholar
Digital Library
- Asif Ekbal and Sivaji Bandyopadhyay. 2008. A web-based Bengali news corpus for named entity recognition. Language Resources and Evaluation 42, 2 (2008), 173--182.Google Scholar
Cross Ref
- Safia Kanwal, Kamran Malik, Khurram Shahzad, Faisal Aslam, and Zubair Nawaz. 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 1 (2019), 8.Google Scholar
- Nancy Chinchor and Elaine Marsh. 1998. Muc-7 information extraction task definition. In Proceeding of the 7th Message Understanding Conference (MUC-7), Appendices. 359--367.Google Scholar
- Hege Fromreide, Dirk Hovy, and Anders Søgaard. 2014. Crowdsourcing and annotating NER for Twitter #drift. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 2544--2547.Google Scholar
- Rob J. B. Vanwersch, Khurram Shahzad, Irene Vanderfeesten, Kris Vanhaecht, Paul Grefen, Liliane Pintelon, Jan Mendling, Godefridus G. van Merode, and Hajo A. Reijers. 2016. A critical evaluation and framework of business process improvement methods. Business 8 Information Systems Engineering 58, 1 (2016), 43--53.Google Scholar
- Muhammad Kamran Malik and Syed Mansoor Sarwar. 2017. Urdu named entity recognition system using hidden Markov model. Pakistan Journal of Engineering and Applied Sciences 21, 2 (2017), 15--22.Google Scholar
- Amandeep Kaur and Gurpreet Singh Josan. 2015. Evaluation of named entity features for Punjabi language. Procedia Computer Science 46 (2015), 159--166.Google Scholar
Digital Library
- Amandeep Kaur, G. Josan, and Jagroop Kaur. 2009. Named entity recognition for Punjabi: A conditional random field approach. In Proceedings of 7th International Conference on Natural Language Processing (ICON’09).Google Scholar
- Deepti Chopra and Sudha Morwal. 2012. Named entity recognition in Punjabi using hidden Markov model. International Journal of Computer Science 8 Engineering Technology 3, 12 (2012), 616–620.Google Scholar
- Kuljot Singh. 2013. Name entity recognition on Punjabi language. International Journal of Computer Science Engineering and Information Technology Research 3, 5 (2013), 95--102.Google Scholar
- Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003, Volume 4. 188--191.Google Scholar
Digital Library
- Thorsten Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. 224--231.Google Scholar
Digital Library
- Andrew Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation. New York University.Google Scholar
Digital Library
- Lars Kai Hansen and Peter Salamon. 1990. Neural network ensembles. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 12, 10 (1990), 993--1001.Google Scholar
Digital Library
- Larry Medsker and Lakhmi C. Jain. 1999. Recurrent Neural Networks: Design and Applications. CRC Press, Boca Raton, FL.Google Scholar
Digital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.Google Scholar
Index Terms
Named Entity Recognition and Classification for Punjabi Shahmukhi
Recommendations
Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity Recognition
Named entity recognition (NER) is a task of proper noun identification from natural language text and classification into various types such as location, person, and organization. Due to NER's applications in different natural language processing (NLP) ...
Shahmukhi named entity recognition by using contextualized word embeddings
AbstractNamed Entity Recognition (NER) is an imperative Natural Language Processing (NLP) task which intents to identify and classify predefined named entities in a given span of text. For many Western and Asian languages, NER is a ...
Highlights- The Shahmukhi NER corpus is prepared via Unicode normalization and cleaning steps.
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...






Comments