skip to main content
short-paper

Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

Authors Info & Claims
Published:20 June 2019Publication History
Skip Abstract Section

Abstract

Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this article makes four key contributions. First, we have developed the largest Urdu NER corpus, which contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has at least doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpora. Second, we have generated six new word embeddings using three different techniques, fastText, Word2vec, and Glove, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Third, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Finally, we have performed 10-folds of 32 different experiments using the combinations of a traditional supervised learning and deep learning techniques, seven types of word embeddings, and two different Urdu NER datasets. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques, the impact of word embeddings, and variations of datasets.

References

  1. Nita Patil, Ajay S. Patil, and B. V. Pawar. 2016. Survey of named entity recognition systems with respect to indian and foreign languages. International Journal of Computer Applications (0975--8887) 134, 16 (2016), 6.Google ScholarGoogle Scholar
  2. Asif Ekbal and Sriparna Saha. 2011. Weighted vote-based classifier ensemble for named entity recognition: A genetic algorithm-based approach. ACM Transactions on Asian Language Information Processing (TALIP) 10, 2 (2011), 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Potey and L. Patawar. 2015. Approaches to named entity recognition: A survey. International Journal of Innovative Research in Computer and Communication Engineering (An ISO 3297: 2007 Certified Organization) 3, 12, (2015), 8.Google ScholarGoogle Scholar
  4. Amir Gandomi and Murtaza Haider. 2015. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management 35, 2 (2015), 137--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Wikipedia contributors. 2018. Urdu. (2018). https://en.wikipedia.org/w/index.php?title=Urdu8oldid=844110134 {Online; accessed 10-June-2018}.Google ScholarGoogle Scholar
  6. Ali Daud, Wahab Khan, and Dunren Che. 2017. Urdu language processing: A survey. Artificial Intelligence Review 47, 3 (2017), 279--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Muhammad Kamran Malik. 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 17, 1 (2017), 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bushra Jawaid and Tafseer Ahmed. 2009. Hindi to Urdu conversion: Beyond simple transliteration. In Conference on Language and Technology. National University of Computer and emerging Science, Lahore, Pakistan.Google ScholarGoogle Scholar
  9. Sarmad Hussain. 2008. Resources for Urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources.Google ScholarGoogle Scholar
  10. Mehreen Alam and Sibt ul Hussain. 2017. Sequence to sequence networks for Roman-Urdu to Urdu transliteration. In 2017 International Multi-topic Conference (INMIC). IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  11. IITHyderabad. Urdu NER dataset Raw UTF-8. http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5. ({n.d.}). Online; accessed 19 Nov 2018.Google ScholarGoogle Scholar
  12. Farah Adeeba and Sarmad Hussain. 2011. Experiences in building Urdu WordNet. In Proceedings of the 9th Workshop on Asian Language Resources. 31--35.Google ScholarGoogle Scholar
  13. Kashif Riaz. 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop. Association for Computational Linguistics, 126--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Smruthi Mukund, Rohini Srihari, and Erik Peterson. 2010. An information-extraction system for Urdu—A resource-poor language. ACM Transactions on Asian Language Information Processing (TALIP) 9, 4 (2010), 15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Muhammad Abid, Asad Habib, Jawad Ashraf, and Abdul Shahid. 2017. Urdu word sense disambiguation using machine learning approach. Cluster Computing (2017), 1--8.Google ScholarGoogle Scholar
  16. Muhammad Kamran Malik and Syed Mansoor Sarwar. 2017. Urdu named entity recognition system using hidden Markov model. Pakistan Journal of Engineering and Applied Sciences (2017).Google ScholarGoogle Scholar
  17. Dara Becker and Kashif Riaz. 2002. A study in Urdu corpus construction. In Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization-Volume 12. Association for Computational Linguistics, 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Paul Baker, Andrew Hardie, Tony McEnery, and B. D. Jayaram. 2003. Corpus data for South Asian language processing. In Proceedings of the 10th Annual Workshop for South Asian Language Processing, EACL.Google ScholarGoogle Scholar
  19. Erik F. Tjong, Kim Sang, and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 142--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Anton Dmitriev. Annotated Corpus for Named Entity Recognition: Corpus Annotated with BIO and POS Tags. https://www.kaggle.com/velavok/nercorpus. ({n.d.}). Online; accessed 10 June 2018.Google ScholarGoogle Scholar
  21. Nancy Chinchor and Elaine Marsh. 1998. Muc-7 information extraction task definition. In Proceedings of the 7th Message Understanding Conference (MUC-7), Appendices. 359--367.Google ScholarGoogle Scholar
  22. Rob J. B. Vanwersch, Khurram Shahzad, Irene Vanderfeesten, Kris Vanhaecht, Paul Grefen, Liliane Pintelon, Jan Mendling, Godefridus G. van Merode, and Hajo A. Reijers. 2016. A critical evaluation and framework of business process improvement methods. Business 8 Information Systems Engineering 58, 1 (2016), 43--53.Google ScholarGoogle Scholar
  23. Anton Dmitriev. Reuters-21578 Text Categorization Collection. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. ({n.d.}). Online; accessed 10 June 2018.Google ScholarGoogle Scholar
  24. Danuta Ploch, Leonhard Hennig, Angelina Duka, Ernesto William De Luca, and Sahin Albayrak. 2012. GerNED: A German corpus for named entity disambiguation. In International Conference on Language Resources and Evaluation. 3886--3893.Google ScholarGoogle Scholar
  25. Darina Benikova, Chris Biemann, and Marc Reznicek. 2014. NoSta-D named entity annotation for German: Guidelines and dataset. In LREC. 2524--2531.Google ScholarGoogle Scholar
  26. Sameer S. Pradhan and Nianwen Xue. 2009. OntoNotes: The 90% solution. In HLT-NAACL (Tutorial Abstracts). 11--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Clemens Neudecker. 2016. An open corpus for named entity recognition in historic newspapers. In International Conference on Language Resources and Evaluation.Google ScholarGoogle Scholar
  28. Yanqing Chen, Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2013. The expressive power of word embeddings. arXiv preprint arXiv:1301.3226 (2013).Google ScholarGoogle Scholar
  29. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yoav Goldberg and Omer Levy. 2014. word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014).Google ScholarGoogle Scholar
  31. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  32. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.Google ScholarGoogle ScholarCross RefCross Ref
  33. Facebook. 2018. Word vectors for 157 languages. https://fasttext.cc/docs/en/crawl-vectors.html. (2018). Online; accessed 10 June 2018.Google ScholarGoogle Scholar
  34. Lars Kai Hansen and Peter Salamon. 1990. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 10 (1990), 993--1001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. R. Medsker and L. C. Jain. 2001. Recurrent neural networks. Design and Applications 5 (2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Abhyuday N. Jagannatha and Hong Yu. 2016. Structured prediction models for RNN based sequence labeling in clinical text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2016. NIH Public Access, 856.Google ScholarGoogle Scholar

Index Terms

  1. Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!