skip to main content
research-article

Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity Recognition

Published:16 June 2023Publication History
Skip Abstract Section

Abstract

Named entity recognition (NER) is a task of proper noun identification from natural language text and classification into various types such as location, person, and organization. Due to NER's applications in different natural language processing (NLP) tasks, numerous NER approaches and benchmark datasets have been proposed. However, developing NER techniques for low-resource languages is still limited due to the absence of substantial training datasets. Punjabi is a classic example of low resource language. Although various researchers have worked on Punjabi, they focused on the Gurmukhi script. To overcome the challenges in developing NER for the Shahmukhi script, we present an improved technique for Punjabi NER for the Shahmukhi script in this paper. We firstly extend the existing dataset by adding new NER classes by leveraging a novel Pool of Words data augmentation strategy. Our extended dataset has 11,31,509 tokens and 1,25,789 labeled entities with more named entities (NEs) than the older dataset. In the next step, we fine-tuned a transformer model known as Bidirectional Encoder Representations from Transformers (BERT) for the NER task. We performed experiments using the proposed approach on a new and older dataset version, showing that our method achieved competitive results.

REFERENCES

  1. [1] Sharma A., Chakraborty S., Kumar S., and others. 2022. Named entity recognition in natural language processing: A systematic review. In Proceedings of Second Doctoral Symposium on Computational Intelligence. 817828.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Lyu C., Chen B., Ren Y., and Ji D.. 2017. Long short-term memory RNN for biomedical named entity recognition. BMC Bioinformatics 18, 1 (2017), 111.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Ekbal A. and Bandyopadhyay S.. 2010. Named entity recognition using support vector machine: A language independent approach. International Journal of Electrical and Computer Engineering 4, 3 (2010), 589604.Google ScholarGoogle Scholar
  4. [4] Ranathunga S., Lee E.-S. A., Skenduli M. P., Shekhar R., Alam M., and Kaur R.. 2021. Neural machine translation for low-resource languages: A survey. arXiv preprint arXiv:2106.15115.Google ScholarGoogle Scholar
  5. [5] Chen S., Pei Y., Ke Z., and Silamu W.. 2021. Low-resource named entity recognition via the pre-training model. Symmetry 13, 5 (2021), 786.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Kanwal S., Malik K., Shahzad K., Aslam F., and Nawaz Z.. 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 1 (2019), 113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Ahmad M. T. et al. 2020. Named entity recognition and classification for Punjabi Shahmukhi. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 4 (2020), 113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Gupta V. and Lehal G. S.. 2011. Named entity recognition for Punjabi language text summarization. International Journal of Computer Applications 33, 3 (2011), 2832.Google ScholarGoogle Scholar
  9. [9] Singh N., Kumar M., Singh B., and Singh J.. 2022. DeepSpacy-NER: An efficient deep learning model for named entity recognition for Punjabi language. Evolving Systems (2022). 111.Google ScholarGoogle Scholar
  10. [10] Riaz K.. 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop. 126135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Singh U., Goyal V., and Lehal G. S.. 2012. Named entity recognition system for Urdu. In Proceedings of COLING 2012. 25072518.Google ScholarGoogle Scholar
  12. [12] Jahangir F., Anwar W., Bajwa U. I., and Wang X.. 2012. N-gram and gazetteer list based named entity recognition for Urdu: A scarce resourced language. In Proceedings of the 10th Workshop on Asian Language Resources. 95104.Google ScholarGoogle Scholar
  13. [13] Munkhjargal Z., Bella G., Chagnaa A., and Giunchiglia F.. 2015. Named entity recognition for Mongolian language. In International Conference on Text, Speech, and Dialogue. 243251.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Malik M. K. and Sarwar S. M.. 2017. Urdu named entity recognition system using hidden Markov model. Pakistan Journal of Engineering and Applied Sciences (2017).Google ScholarGoogle Scholar
  15. [15] Quimbaya A. P. et al. 2016. Named entity recognition over electronic health records through a combined dictionary-based approach. Procedia Computer Science 100 (2016), 5561.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Adak C., Chaudhuri B. B., and Blumenstein M.. 2016. Named entity recognition from unstructured handwritten document images. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS) (2016), 375380.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Malik M. K.. 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 17, 1 (2017), 113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Khan W., Daud A., Alotaibi F., Aljohani N., and Arafat S.. 2020. Deep recurrent neural networks with word embeddings for Urdu named entity recognition. ETRI Journal 42, 1 (2020), 90100.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Gorla S., Tangeda S. S., Neti L. B. M., and Malapati A.. 2022. Telugu named entity recognition using BERT. International Journal of Data Science and Analytics (2022), 114.Google ScholarGoogle Scholar
  20. [20] Alsaaran N. and Alrabiah M.. 2021. Arabic named entity recognition: A BERT-BGRU approach. Comput. Mater. Continua 68 (2021), 471485.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Pushpalatha M. et al. 2021. Deep learning strategy to recognize Kannada named entities. Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, 10 (2021), 57315737.Google ScholarGoogle Scholar
  22. [22] Sathyanarayanan D., Ashok A., Mishra D., Chimalamarri S., and Sitaram D.. 2018. Kannada named entity recognition and classification using bidirectional long short-term memory networks. In 2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT). 6571.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Khan W., Daud A., Shahzad K., Amjad T., Banjar A., and Fasihuddin H.. 2022. Named entity recognition using conditional random fields. Applied Sciences 12, 13 (2022), 6391.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Devlin J., Chang M.-W., Lee K., and Toutanova K.. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.Google ScholarGoogle Scholar
  25. [25] Zafar A. and Jabeen A.. 2022. Common linguistic patterns in Punjabi and Persian Sufi literature. Pakistan Journal of Social Sciences 42, 2 (2022), 299306.Google ScholarGoogle Scholar
  26. [26] Daud A., Khan W., and Che D.. 2017. Urdu language processing: A survey. Artificial Intelligence Review 47, 3 (2017), 279311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Dedes K., Utama A. B. P., Wibawa A. P., Afandi A. N., Handayani A. N., and Hernandez L.. 2022. Neural machine translation of Spanish-English food recipes using LSTM. JOIV: International Journal on Informatics Visualization 6, 2 (2022), 290297.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Suleiman D. and Awajan A.. 2022. Multilayer encoder and single-layer decoder for abstractive Arabic text summarization. Knowledge-Based Systems 237 (2022), 107791.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Lan Z., Chen M., Goodman S., Gimpel K., Sharma P., and Soricut R.. 2019. ALBERT: A Lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.Google ScholarGoogle Scholar
  30. [30] Liu Y. et al. 2019. RoBERTa: A robustly optimized BERT pre-training approach. arXiv preprint arXiv:1907.11692.Google ScholarGoogle Scholar
  31. [31] Brown T. et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 18771901.Google ScholarGoogle Scholar
  32. [32] Yang Z., Dai Z., Yang Y., Carbonell J., Salakhutdinov R. R., and Le Q. V.. 2019. XLNet: Generalized autoregressive pre-training for language understanding. Advances in Neural Information Processing Systems 32 (2019).Google ScholarGoogle Scholar
  33. [33] Joshi M., Chen D., Liu Y., Weld D. S., Zettlemoyer L., and Levy O.. 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8 (2020), 6477.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Jiao X. et al. 2019. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351.Google ScholarGoogle Scholar
  35. [35] Bisong E.. 2019. Building Machine Learning and Deep Learning Models on Google Cloud Platform. Springer, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Dai X. and Adel H.. 2020. An analysis of simple data augmentation for named entity recognition. arXiv preprint arXiv:2010.11683.Google ScholarGoogle Scholar
  37. [37] Bayer M., Kaufhold M.-A., and Reuter C.. 2021. A survey on data augmentation for text classification. ACM Computing Surveys (2021).Google ScholarGoogle Scholar

Index Terms

  1. Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity Recognition

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
        June 2023
        635 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3604597
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 June 2023
        • Online AM: 4 May 2023
        • Accepted: 19 April 2023
        • Revised: 8 April 2023
        • Received: 15 September 2022
        Published in tallip Volume 22, Issue 6

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)105
        • Downloads (Last 6 weeks)27

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!