Abstract
Named Entity Recognition and Classification (NERC) is a process of identifying words and classifying them into person names, location names, organization names, and so on. In this article, we discuss the development of an Urdu Named Entity (NE) corpus, called the Kamran-PU-NE (KPU-NE) corpus, for three entity types, that is, Person, Organization, and Location, and marking the remaining tokens as Others (O). We use two supervised learning algorithms, Hidden Markov Model (HMM) and Artificial Neural Network (ANN), for the development of the Urdu NERC system. We annotate the 652852-token corpus taken from 15 different genres with a total of 44480 NEs. The inter-annotator agreement between the two annotators in terms of Kappa k statistic is 73.41%. With HMM, the highest recorded precision, recall, and f-measure values are 55.98%, 83.11%, and 66.90%, respectively, and with ANN, they are 81.05%, 87.54%, and 84.17%, respectively.
- S. Hussain. 2003. In Proceedings of the 12th AMIC Annual Conference on E-Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore.Google Scholar
- A. BBC-Languages. Guide to Urdu—10 Facts, Key Phrases and the Alphabet Retrieved May 2, 2012 from from http://www.bbc.co.uk/languages/other/urdu/guide.Google Scholar
- S. Hussain. 2008. Resources for urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources. 99--100.Google Scholar
- R. Grishman and B. Sundheim. 1996. Message understanding conference--6: A brief history. In Proceedings of the International Conference on Computational Linguistics. 466--471. Google Scholar
Digital Library
- P. Baker, A. Hardie, T. McEnery, and B. D. Jayaram. 2003. Corpus data for south asian language processing. In Proceedings of the 10th Annual Workshop for South Asian Language Processing. EACL.Google Scholar
- D. Becker and K. Riaz. 2002. A study in urdu corpus construction. In Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization-Volume 12 (1--5). Association for Computational Linguistics. Google Scholar
Digital Library
- K. Riaz. 2010. Rule-based named entity recognition in urdu. In Proceedings of the 2010 Named Entities Workshop. Association for Computational Linguistics, 126--135. Google Scholar
Digital Library
- S. Mukund, R. Srihari, and E. Peterson. 2010. An information-extraction system for urdu—A resource-poor language. ACM Trans. Asian Lang. Inf. Process. 9, 4, 15. Google Scholar
Digital Library
- D. Farmakiotou, V. Karkaletsis, J. Koutsias, G. Sigletos, C. D. Spyropoulos, and P. Stamatopoulos. 2000. Rule-based named entity recognition for greek financial texts. In Proceedings of the Workshop on Computational lexicography and Multimedia Dictionaries (COMLEX’00). 75--78.Google Scholar
- D. M. Bikel, R. Schwartz, and R. M. Weischedel. 1999. An algorithm that learns what's in a name. Mach. Learn. 34, 1--3, 211--231. Google Scholar
Digital Library
- A. Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation, New York University. Google Scholar
Digital Library
- A. McCallum and W. Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. Association for Computational Linguistics, Volume 4 188--191. Google Scholar
Digital Library
- A. Ekbal, R. Haque, A. Das, V. Poka, and S. Bandyopadhyay. 2008. Language independent named entity recognition in indian languages. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 33--40.Google Scholar
- S. Saha, S. Sarkar, and P. Mitra. 2008. A hybrid feature set based maximum entropy hindi named entity recognition. In Proceedings of the 3rd International Joint Conference on NLP (IJCNLP’08). 343--349.Google Scholar
- K. Gali, H. Surana, A. Vaidya, P. Shishtla, and D. M. Sharma. 2008. Aggregating machine learning and rule based heuristics for named entity recognition. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 25--32.Google Scholar
- P. P. Kumar and V. R. Kiran. 2008. A hybrid named entity recognition system for south asian languages. In Proceedings of the Proceedings of the 3rd International Joint Conference on Natural Language Processing Workshop on NER for South and South East Asian Languages (IJCNLP’08). 83--88.Google Scholar
- U. Singh, V. Goyal, and G. S. Lehal. 2012. Named entity recognition system for urdu. In Proceedings of COLING: Technical Papers. 2507--2518.Google Scholar
- S. Mukund and R. K. Srihari. 2009. NE tagging for urdu based on bootstrap POS learning. In Proceedings of the 3rd International Workshop on Cross Lingual Information: Addressing the Information Need of Multilingual Societies. Association of Computational Linguistics, 61--69. Google Scholar
Digital Library
- F. Jahangir, W. Anwar, U. I. Bajwa, and X. Wang. 2012. N-gram and gazetteer list based named entity recognition for urdu: A scarce resourced language. In Proceedings of the 24th International Conference on Computational Linguistics.Google Scholar
- Retrieved from http://www.cle.org.pk/clestore/urdudigestcorpus100ktagged.htm.Google Scholar
- T. Ahmed, S. Urooj, S. Hussain, A. Mustafa, R. Parveen, F. Adeeba, A. Hautli, and M. Butt. 2014. The CLE urdu POS tagset. In Proceedings of the Language Resources and Evaluation Conference (LERC’14).Google Scholar
- R. Fernández. 2011. Assessing the Reliability of an Annotation Scheme for Indefinites, Measuring Inter-annotator Agreement. MoL Project, Institute for Logic, Language 8 Computation University of Amsterdam.Google Scholar
- Cohen Jacob 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37--46.Google Scholar
Cross Ref
- Arstein Ron and Poesio Massimo. 2008. Survey article: Inter-coder agreement for computational linguistics. Comput. Ling. 34, 4, 555--596. Google Scholar
Digital Library
- L. E. Baum and T. Petrie. 1966. Statistical inference for probabilistic functions of finite state markov chains. Ann. Math. Stat. 37, 6, 1554--1563.Google Scholar
Cross Ref
- P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. 1992. Class-based n-gram models of natural language. Comput. Ling. 18, 4, 467--479. Google Scholar
Digital Library
- C. Samuelsson. 1993. Morphological tagging based entirely on Bayesian inference. In Proceedings of the 9th Nordic Conference on Computational Linguistics.Google Scholar
- T. Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 224--231. Google Scholar
Digital Library
- I. Gallo, E. Binaghi, M. Carullo, and N. Lamberti. 2008. Named entity recognition by neural sliding window. In Proceedings of the 8th IAPR International Workshop on Document Analysis Systems (DAS'08). IEEE, 567--573. Google Scholar
Digital Library
- E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 873--882. Google Scholar
Digital Library
- G. Petasis, S. Petridis, G. Paliouras, V. Karkaletsis, S. J. Perantonis, and C. D. Spyropoulos. 2000. Symbolic and neural learning for named-entity recognition. In Proceedings of the Symposium on Computational Intelligence and Learning. 58--66.Google Scholar
- J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP’14), 12. 1532--1543.Google Scholar
- Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137--1155. Google Scholar
Digital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations Workshop (ICLR’13).Google Scholar
Index Terms
Urdu Named Entity Recognition and Classification System Using Artificial Neural Network
Recommendations
Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications
Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian ...
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
A joint named entity recognition and entity linking system
HYBRID '12: Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual DataWe present a joint system for named entity recognition (NER) and entity linking (EL), allowing for named entities mentions extracted from textual data to be matched to uniquely identifiable entities. Our approach relies on combined NER modules which ...






Comments