Abstract
Named entity recognition (NER) is a task of proper noun identification from natural language text and classification into various types such as location, person, and organization. Due to NER's applications in different natural language processing (NLP) tasks, numerous NER approaches and benchmark datasets have been proposed. However, developing NER techniques for low-resource languages is still limited due to the absence of substantial training datasets. Punjabi is a classic example of low resource language. Although various researchers have worked on Punjabi, they focused on the Gurmukhi script. To overcome the challenges in developing NER for the Shahmukhi script, we present an improved technique for Punjabi NER for the Shahmukhi script in this paper. We firstly extend the existing dataset by adding new NER classes by leveraging a novel Pool of Words data augmentation strategy. Our extended dataset has 11,31,509 tokens and 1,25,789 labeled entities with more named entities (NEs) than the older dataset. In the next step, we fine-tuned a transformer model known as Bidirectional Encoder Representations from Transformers (BERT) for the NER task. We performed experiments using the proposed approach on a new and older dataset version, showing that our method achieved competitive results.
- [1] , and others. 2022. Named entity recognition in natural language processing: A systematic review. In Proceedings of Second Doctoral Symposium on Computational Intelligence. 817–828.Google Scholar
Cross Ref
- [2] . 2017. Long short-term memory RNN for biomedical named entity recognition. BMC Bioinformatics 18, 1 (2017), 1–11.Google Scholar
Cross Ref
- [3] . 2010. Named entity recognition using support vector machine: A language independent approach. International Journal of Electrical and Computer Engineering 4, 3 (2010), 589–604.Google Scholar
- [4] . 2021. Neural machine translation for low-resource languages: A survey. arXiv preprint arXiv:2106.15115.Google Scholar
- [5] . 2021. Low-resource named entity recognition via the pre-training model. Symmetry 13, 5 (2021), 786.Google Scholar
Cross Ref
- [6] . 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 1 (2019), 1–13.Google Scholar
Digital Library
- [7] 2020. Named entity recognition and classification for Punjabi Shahmukhi. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 4 (2020), 1–13.Google Scholar
Digital Library
- [8] . 2011. Named entity recognition for Punjabi language text summarization. International Journal of Computer Applications 33, 3 (2011), 28–32.Google Scholar
- [9] . 2022. DeepSpacy-NER: An efficient deep learning model for named entity recognition for Punjabi language. Evolving Systems (2022). 1–11.Google Scholar
- [10] . 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop. 126–135.Google Scholar
Digital Library
- [11] . 2012. Named entity recognition system for Urdu. In Proceedings of COLING 2012. 2507–2518.Google Scholar
- [12] . 2012. N-gram and gazetteer list based named entity recognition for Urdu: A scarce resourced language. In Proceedings of the 10th Workshop on Asian Language Resources. 95–104.Google Scholar
- [13] . 2015. Named entity recognition for Mongolian language. In International Conference on Text, Speech, and Dialogue. 243–251.Google Scholar
Digital Library
- [14] . 2017. Urdu named entity recognition system using hidden Markov model. Pakistan Journal of Engineering and Applied Sciences (2017).Google Scholar
- [15] 2016. Named entity recognition over electronic health records through a combined dictionary-based approach. Procedia Computer Science 100 (2016), 55–61.Google Scholar
Cross Ref
- [16] . 2016. Named entity recognition from unstructured handwritten document images. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS) (2016), 375–380.Google Scholar
Cross Ref
- [17] . 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 17, 1 (2017), 1–13.Google Scholar
Digital Library
- [18] . 2020. Deep recurrent neural networks with word embeddings for Urdu named entity recognition. ETRI Journal 42, 1 (2020), 90–100.Google Scholar
Cross Ref
- [19] . 2022. Telugu named entity recognition using BERT. International Journal of Data Science and Analytics (2022), 1–14.Google Scholar
- [20] . 2021. Arabic named entity recognition: A BERT-BGRU approach. Comput. Mater. Continua 68 (2021), 471–485.Google Scholar
Cross Ref
- [21] et al. 2021. Deep learning strategy to recognize Kannada named entities. Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, 10 (2021), 5731–5737.Google Scholar
- [22] . 2018. Kannada named entity recognition and classification using bidirectional long short-term memory networks. In 2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT). 65–71.Google Scholar
Cross Ref
- [23] . 2022. Named entity recognition using conditional random fields. Applied Sciences 12, 13 (2022), 6391.Google Scholar
Cross Ref
- [24] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.Google Scholar
- [25] . 2022. Common linguistic patterns in Punjabi and Persian Sufi literature. Pakistan Journal of Social Sciences 42, 2 (2022), 299–306.Google Scholar
- [26] . 2017. Urdu language processing: A survey. Artificial Intelligence Review 47, 3 (2017), 279–311.Google Scholar
Digital Library
- [27] . 2022. Neural machine translation of Spanish-English food recipes using LSTM. JOIV: International Journal on Informatics Visualization 6, 2 (2022), 290–297.Google Scholar
Cross Ref
- [28] . 2022. Multilayer encoder and single-layer decoder for abstractive Arabic text summarization. Knowledge-Based Systems 237 (2022), 107791.Google Scholar
Digital Library
- [29] . 2019. ALBERT: A Lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.Google Scholar
- [30] 2019. RoBERTa: A robustly optimized BERT pre-training approach. arXiv preprint arXiv:1907.11692.Google Scholar
- [31] 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.Google Scholar
- [32] . 2019. XLNet: Generalized autoregressive pre-training for language understanding. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
- [33] . 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8 (2020), 64–77.Google Scholar
Cross Ref
- [34] 2019. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351.Google Scholar
- [35] . 2019. Building Machine Learning and Deep Learning Models on Google Cloud Platform. Springer, 2019.Google Scholar
Cross Ref
- [36] . 2020. An analysis of simple data augmentation for named entity recognition. arXiv preprint arXiv:2010.11683.Google Scholar
- [37] . 2021. A survey on data augmentation for text classification. ACM Computing Surveys (2021).Google Scholar
Index Terms
Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity Recognition
Recommendations
Named Entity Recognition and Classification for Punjabi Shahmukhi
Named entity recognition (NER) refers to the identification of proper nouns from natural language text and classifying them into named entity types, such as person, location, and organization. Due to the widespread applications of NER, numerous NER ...
Shahmukhi named entity recognition by using contextualized word embeddings
AbstractNamed Entity Recognition (NER) is an imperative Natural Language Processing (NLP) task which intents to identify and classify predefined named entities in a given span of text. For many Western and Asian languages, NER is a ...
Highlights- The Shahmukhi NER corpus is prepared via Unicode normalization and cleaning steps.
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...






Comments