skip to main content
short-paper

Urdu Named Entity Recognition and Classification System Using Artificial Neural Network

Published:15 September 2017Publication History
Skip Abstract Section

Abstract

Named Entity Recognition and Classification (NERC) is a process of identifying words and classifying them into person names, location names, organization names, and so on. In this article, we discuss the development of an Urdu Named Entity (NE) corpus, called the Kamran-PU-NE (KPU-NE) corpus, for three entity types, that is, Person, Organization, and Location, and marking the remaining tokens as Others (O). We use two supervised learning algorithms, Hidden Markov Model (HMM) and Artificial Neural Network (ANN), for the development of the Urdu NERC system. We annotate the 652852-token corpus taken from 15 different genres with a total of 44480 NEs. The inter-annotator agreement between the two annotators in terms of Kappa k statistic is 73.41%. With HMM, the highest recorded precision, recall, and f-measure values are 55.98%, 83.11%, and 66.90%, respectively, and with ANN, they are 81.05%, 87.54%, and 84.17%, respectively.

References

  1. S. Hussain. 2003. In Proceedings of the 12th AMIC Annual Conference on E-Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore.Google ScholarGoogle Scholar
  2. A. BBC-Languages. Guide to Urdu—10 Facts, Key Phrases and the Alphabet Retrieved May 2, 2012 from from http://www.bbc.co.uk/languages/other/urdu/guide.Google ScholarGoogle Scholar
  3. S. Hussain. 2008. Resources for urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources. 99--100.Google ScholarGoogle Scholar
  4. R. Grishman and B. Sundheim. 1996. Message understanding conference--6: A brief history. In Proceedings of the International Conference on Computational Linguistics. 466--471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Baker, A. Hardie, T. McEnery, and B. D. Jayaram. 2003. Corpus data for south asian language processing. In Proceedings of the 10th Annual Workshop for South Asian Language Processing. EACL.Google ScholarGoogle Scholar
  6. D. Becker and K. Riaz. 2002. A study in urdu corpus construction. In Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization-Volume 12 (1--5). Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Riaz. 2010. Rule-based named entity recognition in urdu. In Proceedings of the 2010 Named Entities Workshop. Association for Computational Linguistics, 126--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Mukund, R. Srihari, and E. Peterson. 2010. An information-extraction system for urdu—A resource-poor language. ACM Trans. Asian Lang. Inf. Process. 9, 4, 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Farmakiotou, V. Karkaletsis, J. Koutsias, G. Sigletos, C. D. Spyropoulos, and P. Stamatopoulos. 2000. Rule-based named entity recognition for greek financial texts. In Proceedings of the Workshop on Computational lexicography and Multimedia Dictionaries (COMLEX’00). 75--78.Google ScholarGoogle Scholar
  10. D. M. Bikel, R. Schwartz, and R. M. Weischedel. 1999. An algorithm that learns what's in a name. Mach. Learn. 34, 1--3, 211--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation, New York University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. McCallum and W. Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. Association for Computational Linguistics, Volume 4 188--191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Ekbal, R. Haque, A. Das, V. Poka, and S. Bandyopadhyay. 2008. Language independent named entity recognition in indian languages. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 33--40.Google ScholarGoogle Scholar
  14. S. Saha, S. Sarkar, and P. Mitra. 2008. A hybrid feature set based maximum entropy hindi named entity recognition. In Proceedings of the 3rd International Joint Conference on NLP (IJCNLP’08). 343--349.Google ScholarGoogle Scholar
  15. K. Gali, H. Surana, A. Vaidya, P. Shishtla, and D. M. Sharma. 2008. Aggregating machine learning and rule based heuristics for named entity recognition. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 25--32.Google ScholarGoogle Scholar
  16. P. P. Kumar and V. R. Kiran. 2008. A hybrid named entity recognition system for south asian languages. In Proceedings of the Proceedings of the 3rd International Joint Conference on Natural Language Processing Workshop on NER for South and South East Asian Languages (IJCNLP’08). 83--88.Google ScholarGoogle Scholar
  17. U. Singh, V. Goyal, and G. S. Lehal. 2012. Named entity recognition system for urdu. In Proceedings of COLING: Technical Papers. 2507--2518.Google ScholarGoogle Scholar
  18. S. Mukund and R. K. Srihari. 2009. NE tagging for urdu based on bootstrap POS learning. In Proceedings of the 3rd International Workshop on Cross Lingual Information: Addressing the Information Need of Multilingual Societies. Association of Computational Linguistics, 61--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Jahangir, W. Anwar, U. I. Bajwa, and X. Wang. 2012. N-gram and gazetteer list based named entity recognition for urdu: A scarce resourced language. In Proceedings of the 24th International Conference on Computational Linguistics.Google ScholarGoogle Scholar
  20. Retrieved from http://www.cle.org.pk/clestore/urdudigestcorpus100ktagged.htm.Google ScholarGoogle Scholar
  21. T. Ahmed, S. Urooj, S. Hussain, A. Mustafa, R. Parveen, F. Adeeba, A. Hautli, and M. Butt. 2014. The CLE urdu POS tagset. In Proceedings of the Language Resources and Evaluation Conference (LERC’14).Google ScholarGoogle Scholar
  22. R. Fernández. 2011. Assessing the Reliability of an Annotation Scheme for Indefinites, Measuring Inter-annotator Agreement. MoL Project, Institute for Logic, Language 8 Computation University of Amsterdam.Google ScholarGoogle Scholar
  23. Cohen Jacob 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37--46.Google ScholarGoogle ScholarCross RefCross Ref
  24. Arstein Ron and Poesio Massimo. 2008. Survey article: Inter-coder agreement for computational linguistics. Comput. Ling. 34, 4, 555--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. E. Baum and T. Petrie. 1966. Statistical inference for probabilistic functions of finite state markov chains. Ann. Math. Stat. 37, 6, 1554--1563.Google ScholarGoogle ScholarCross RefCross Ref
  26. P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. 1992. Class-based n-gram models of natural language. Comput. Ling. 18, 4, 467--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Samuelsson. 1993. Morphological tagging based entirely on Bayesian inference. In Proceedings of the 9th Nordic Conference on Computational Linguistics.Google ScholarGoogle Scholar
  28. T. Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 224--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. I. Gallo, E. Binaghi, M. Carullo, and N. Lamberti. 2008. Named entity recognition by neural sliding window. In Proceedings of the 8th IAPR International Workshop on Document Analysis Systems (DAS'08). IEEE, 567--573. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 873--882. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Petasis, S. Petridis, G. Paliouras, V. Karkaletsis, S. J. Perantonis, and C. D. Spyropoulos. 2000. Symbolic and neural learning for named-entity recognition. In Proceedings of the Symposium on Computational Intelligence and Learning. 58--66.Google ScholarGoogle Scholar
  32. J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP’14), 12. 1532--1543.Google ScholarGoogle Scholar
  33. Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137--1155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations Workshop (ICLR’13).Google ScholarGoogle Scholar

Index Terms

  1. Urdu Named Entity Recognition and Classification System Using Artificial Neural Network

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!