Abstract
Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this article makes four key contributions. First, we have developed the largest Urdu NER corpus, which contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has at least doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpora. Second, we have generated six new word embeddings using three different techniques, fastText, Word2vec, and Glove, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Third, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Finally, we have performed 10-folds of 32 different experiments using the combinations of a traditional supervised learning and deep learning techniques, seven types of word embeddings, and two different Urdu NER datasets. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques, the impact of word embeddings, and variations of datasets.
- Nita Patil, Ajay S. Patil, and B. V. Pawar. 2016. Survey of named entity recognition systems with respect to indian and foreign languages. International Journal of Computer Applications (0975--8887) 134, 16 (2016), 6.Google Scholar
- Asif Ekbal and Sriparna Saha. 2011. Weighted vote-based classifier ensemble for named entity recognition: A genetic algorithm-based approach. ACM Transactions on Asian Language Information Processing (TALIP) 10, 2 (2011), 9. Google Scholar
Digital Library
- A. Potey and L. Patawar. 2015. Approaches to named entity recognition: A survey. International Journal of Innovative Research in Computer and Communication Engineering (An ISO 3297: 2007 Certified Organization) 3, 12, (2015), 8.Google Scholar
- Amir Gandomi and Murtaza Haider. 2015. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management 35, 2 (2015), 137--144. Google Scholar
Digital Library
- Wikipedia contributors. 2018. Urdu. (2018). https://en.wikipedia.org/w/index.php?title=Urdu8oldid=844110134 {Online; accessed 10-June-2018}.Google Scholar
- Ali Daud, Wahab Khan, and Dunren Che. 2017. Urdu language processing: A survey. Artificial Intelligence Review 47, 3 (2017), 279--311. Google Scholar
Digital Library
- Muhammad Kamran Malik. 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 17, 1 (2017), 2.Google Scholar
Digital Library
- Bushra Jawaid and Tafseer Ahmed. 2009. Hindi to Urdu conversion: Beyond simple transliteration. In Conference on Language and Technology. National University of Computer and emerging Science, Lahore, Pakistan.Google Scholar
- Sarmad Hussain. 2008. Resources for Urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources.Google Scholar
- Mehreen Alam and Sibt ul Hussain. 2017. Sequence to sequence networks for Roman-Urdu to Urdu transliteration. In 2017 International Multi-topic Conference (INMIC). IEEE, 1--7.Google Scholar
Cross Ref
- IITHyderabad. Urdu NER dataset Raw UTF-8. http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5. ({n.d.}). Online; accessed 19 Nov 2018.Google Scholar
- Farah Adeeba and Sarmad Hussain. 2011. Experiences in building Urdu WordNet. In Proceedings of the 9th Workshop on Asian Language Resources. 31--35.Google Scholar
- Kashif Riaz. 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop. Association for Computational Linguistics, 126--135. Google Scholar
Digital Library
- Smruthi Mukund, Rohini Srihari, and Erik Peterson. 2010. An information-extraction system for Urdu—A resource-poor language. ACM Transactions on Asian Language Information Processing (TALIP) 9, 4 (2010), 15.Google Scholar
Digital Library
- Muhammad Abid, Asad Habib, Jawad Ashraf, and Abdul Shahid. 2017. Urdu word sense disambiguation using machine learning approach. Cluster Computing (2017), 1--8.Google Scholar
- Muhammad Kamran Malik and Syed Mansoor Sarwar. 2017. Urdu named entity recognition system using hidden Markov model. Pakistan Journal of Engineering and Applied Sciences (2017).Google Scholar
- Dara Becker and Kashif Riaz. 2002. A study in Urdu corpus construction. In Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization-Volume 12. Association for Computational Linguistics, 1--5. Google Scholar
Digital Library
- Paul Baker, Andrew Hardie, Tony McEnery, and B. D. Jayaram. 2003. Corpus data for South Asian language processing. In Proceedings of the 10th Annual Workshop for South Asian Language Processing, EACL.Google Scholar
- Erik F. Tjong, Kim Sang, and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 142--147. Google Scholar
Digital Library
- Anton Dmitriev. Annotated Corpus for Named Entity Recognition: Corpus Annotated with BIO and POS Tags. https://www.kaggle.com/velavok/nercorpus. ({n.d.}). Online; accessed 10 June 2018.Google Scholar
- Nancy Chinchor and Elaine Marsh. 1998. Muc-7 information extraction task definition. In Proceedings of the 7th Message Understanding Conference (MUC-7), Appendices. 359--367.Google Scholar
- Rob J. B. Vanwersch, Khurram Shahzad, Irene Vanderfeesten, Kris Vanhaecht, Paul Grefen, Liliane Pintelon, Jan Mendling, Godefridus G. van Merode, and Hajo A. Reijers. 2016. A critical evaluation and framework of business process improvement methods. Business 8 Information Systems Engineering 58, 1 (2016), 43--53.Google Scholar
- Anton Dmitriev. Reuters-21578 Text Categorization Collection. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. ({n.d.}). Online; accessed 10 June 2018.Google Scholar
- Danuta Ploch, Leonhard Hennig, Angelina Duka, Ernesto William De Luca, and Sahin Albayrak. 2012. GerNED: A German corpus for named entity disambiguation. In International Conference on Language Resources and Evaluation. 3886--3893.Google Scholar
- Darina Benikova, Chris Biemann, and Marc Reznicek. 2014. NoSta-D named entity annotation for German: Guidelines and dataset. In LREC. 2524--2531.Google Scholar
- Sameer S. Pradhan and Nianwen Xue. 2009. OntoNotes: The 90% solution. In HLT-NAACL (Tutorial Abstracts). 11--12. Google Scholar
Digital Library
- Clemens Neudecker. 2016. An open corpus for named entity recognition in historic newspapers. In International Conference on Language Resources and Evaluation.Google Scholar
- Yanqing Chen, Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2013. The expressive power of word embeddings. arXiv preprint arXiv:1301.3226 (2013).Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119. Google Scholar
Digital Library
- Yoav Goldberg and Omer Levy. 2014. word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014).Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.Google Scholar
Cross Ref
- Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.Google Scholar
Cross Ref
- Facebook. 2018. Word vectors for 157 languages. https://fasttext.cc/docs/en/crawl-vectors.html. (2018). Online; accessed 10 June 2018.Google Scholar
- Lars Kai Hansen and Peter Salamon. 1990. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 10 (1990), 993--1001. Google Scholar
Digital Library
- L. R. Medsker and L. C. Jain. 2001. Recurrent neural networks. Design and Applications 5 (2001). Google Scholar
Digital Library
- Abhyuday N. Jagannatha and Hong Yu. 2016. Structured prediction models for RNN based sequence labeling in clinical text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2016. NIH Public Access, 856.Google Scholar
Index Terms
Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications
Recommendations
Urdu Named Entity Recognition and Classification System Using Artificial Neural Network
Named Entity Recognition and Classification (NERC) is a process of identifying words and classifying them into person names, location names, organization names, and so on. In this article, we discuss the development of an Urdu Named Entity (NE) corpus, ...
A deep learning-based bilingual Hindi and Punjabi named entity recognition system using enhanced word embeddings
AbstractThe increasing availability of information on the web makes the task of named entity recognition (NER) more challenging. Named entity recognition is an important pre-processor tool that is concerned with the extraction of entities of ...
Highlights- Development of enhanced word embeddings for bilingual NER system is a novel attempt.
Improving Named Entity Recognition for Morphologically Rich Languages Using Word Embeddings
ICMLA '14: Proceedings of the 2014 13th International Conference on Machine Learning and ApplicationsIn this paper, we addressed the Named Entity Recognition (NER) problem for morphologically rich languages by employing a semi-supervised learning approach based on neural networks. We adopted a fast unsupervised method for learning continuous vector ...






Comments