Abstract
In this article, we propose a word embedding--based named entity recognition (NER) approach. NER is commonly approached as a sequence labeling task with the application of methods such as conditional random field (CRF). However, for low-resource languages without the presence of sufficiently large training data, methods such as CRF do not perform well. In our work, we make use of the proximity of the vector embeddings of words to approach the NER problem. The hypothesis is that word vectors belonging to the same name category, such as a person’s name, occur in close vicinity in the abstract vector space of the embedded words. Assuming that this clustering hypothesis is true, we apply a standard classification approach on the vectors of words to learn a decision boundary between the NER classes. Our NER experiments are conducted on a morphologically rich and low-resource language, namely Bengali. Our approach significantly outperforms standard baseline CRF approaches that use cluster labels of word embeddings and gazetteers constructed from Wikipedia. Further, we propose an unsupervised approach (that uses an automatically created named entity (NE) gazetteer from Wikipedia in the absence of training data). For a low-resource language, the word vectors obtained from Wikipedia are not sufficient to train a classifier. As a result, we propose to make use of the distance measure between the vector embeddings of words to expand the set of Wikipedia training examples with additional NEs extracted from a monolingual corpus that yield significant improvement in the unsupervised NER performance. In fact, our expansion method performs better than the traditional CRF-based (supervised) approach (i.e., F-score of 65.4% vs. 64.2%). Finally, we compare our proposed approach to the official submission for the IJCNLP-2008 Bengali NER shared task and achieve an overall improvement of F-score 11.26% with respect to the best official system.
- Daniel M. Bikel, Richard Schwartz, and Ralph M. Weischedel. 1999. An algorithm that learns what’s in a name. Machine Learning 34, 1--3, 211--231. DOI:http://dx.doi.org/10.1023/A:1007558221122 Google Scholar
Digital Library
- Andrew Eliot Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation. New York University, New York, NY.Google Scholar
- Leo Breiman. 2001. Random forests. Machine Learning 45, 1, 5--32. DOI:http://dx.doi.org/10.1023/A:1010933404324 Google Scholar
Digital Library
- Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). ACM, New York, NY, 160--167. DOI:http://dx.doi.org/10.1145/1390156.1390177 Google Scholar
Digital Library
- Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2493--2537. http://dl.acm.org/citation.cfm?id=1953048.2078186.Google Scholar
- Alessandro Cucchiarelli, Danilo Luzi, and Paola Velardi. 1998. Automatic semantic tagging of unknown proper names. In Proceedings of the 17th International Conference on Computational Linguistics— Volume 1 (COLING’98). 286--292. DOI:http://dx.doi.org/10.3115/980451.980892 Google Scholar
Digital Library
- Hakan Demir and Arzucan Ozgur. 2014. Improving named entity recognition for morphologically rich languages using word embeddings. In Proceedings of the 13th International Conference on Machine Learning and Applications (ICMLA’14). 117--122. Google Scholar
Digital Library
- Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka, and Sivaji Bandyopadhyay. 2008. Language independent named entity recognition in Indian languages. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. http://aclweb.org/anthology/I08-5006.Google Scholar
- Asif Ekbal, Mohammed Hasanuzzaman, and Sivaji Bandyopadhyay. 2009. Voted approach for part of speech tagging in Bengali. In Proceedings of the 23rd Pacific Asia Conference on Language, Information, and Computation (PACLIC 23). 120--129. http://www.aclweb.org/anthology/Y09-1014.Google Scholar
- Asif Ekbal and Sriparna Saha. 2011. Weighted vote-based classifier ensemble for named entity recognition: A genetic algorithm-based approach. ACM Transactions on Asian Language Information Processing 10, 2, Article No. 9. DOI:http://dx.doi.org/10.1145/1967293.1967296 Google Scholar
Digital Library
- Asif Ekbal and Sriparna Saha. 2012. Multiobjective optimization for classifier ensemble and feature selection: An application to named entity recognition. International Journal on Document Analysis and Recognition 15, 2, 143--166. DOI:http://dx.doi.org/10.1007/s10032-011-0155-7 Google Scholar
Digital Library
- Richard J. Evans. 2003. A framework for named entity recognition in the open domain. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’03).Google Scholar
- Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15, 1, 3133--3181. http://dl.acm.org/citation.cfm?id=2627435.2697065.Google Scholar
Digital Library
- Karthik Gali, Harshit Surana, Ashwini Vaidya, Praneeth Shishtla, and Dipti Misra Sharma. 2008. Aggregating machine learning and rule based heuristics for named entity recognition. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. http://aclweb.org/anthology/I08-5005.Google Scholar
- Debasis Ganguly, Johannes Leveling, and Gareth J. F. Jones. 2013. [email protected] extraction task of FIRE-2012: Rule-based stemmers for Bengali and Hindi. In Proceedings of the 5th Forum on Information Retrieval Evaluation (FIRE’13). 12.Google Scholar
- Yoav Goldberg and Omer Levy. 2014. Word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722. http://arxiv.org/abs/1402.3722.Google Scholar
- Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Revisiting embedding features for simple semi-supervised learning. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 110--120. http://www.aclweb.org/anthology/D14-1012.Google Scholar
Cross Ref
- Jun’ichi Kazama and Kentaro Torisawa. 2007. Exploiting Wikipedia as external knowledge for named entity recognition. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07). http://aclweb.org/anthology/D07-1073.Google Scholar
- P. Praveen Kumar and V. Ravi Kiran. 2008. Hybrid named entity recognition system for South and South East Asian languages. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. http://aclweb.org/anthology/I08-5012.Google Scholar
- Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and Web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003, Volume 4 (CoNLL’03). 188--191. DOI:http://dx.doi.org/10.3115/1119176.1119206 Google Scholar
Digital Library
- Andrei Mikheev, Marc Moens, and Claire Grover. 1999. Named entity recognition without gazetteers. In Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics (EACL’99). 1--8. DOI:http://dx.doi.org/10.3115/977035.977037 Google Scholar
Digital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781. http://arxiv.org/abs/1301.3781.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proceedings of the Neural Information Processing Systems Conference (NIPS’13). 3111--3119.Google Scholar
- Andriy Mnih and Geoffrey E. Hinton. 2008. A scalable hierarchical distributed language model. In Proceedings of the Neural Information Processing Systems Conference (NIPS’08). 1081--1088. http://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model.Google Scholar
- Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. In Proceedings of the 18th Conference on Computational Natural Language Learning. 78--86. http://www.aclweb.org/anthology/W/W14/W14-1609.Google Scholar
Cross Ref
- Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL’09). 147--155. http://dl.acm.org/citation.cfm?id=1596374.1596399.Google Scholar
Digital Library
- E. Alexander Richman and Patrick Schone. 2008. Mining Wiki resources for multilingual named entity recognition. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the Human Language Technology Conference (ACL-08: HLT). 1--9. http://aclweb.org/anthology/P08-1001.Google Scholar
- Sujan K. Saha, Sanjay Chatterji, Sandipan Dandapat, Sudeshna Sarkar, and Pabitra Mitra. 2008. A hybrid named entity recognition system for South and South East Asian languages. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. http://aclweb.org/anthology/I08-5004.Google Scholar
- Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—Volume 1 (NAACL’03). 134--141. DOI:http://dx.doi.org/10.3115/1073445.1073473 Google Scholar
Digital Library
- Utpal Kumar Sikdar, Asif Ekbal, and Sriparna Saha. 2012. Differential evolution based feature selection and classifier ensemble for named entity recognition. In Proceedings of the 24th International Conference on Computational Linguistics (COLING’12). 2475--2490. http://dblp.uni-trier.de/db/conf/coling/coling2012.html#SikdarES12.Google Scholar
- Anil K. Singh. 2008. Named entity recognition for South and South East Asian languages: Taking stock. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. http://aclweb.org/anthology/I08-5003.Google Scholar
- Antonio Toral and Rafael Muñoz. 2006. A Proposal to Automatically Build and Maintain Gazetteers for Named Entity Recognition by Using Wikipedia. Technical Report. Available at http://www.aclweb.org/anthology/W06-2809.pdf.Google Scholar
- Joseph Turian, Yoshua Bengi, Lev Ratinov, and Dan Roth. 2009. A preliminary evaluation of word representations for named-entity recognition. In Proceedings of the NIPS Workshop on Grammar Induction, Representation of Language, and Language Learning. http://citeseerx.ist.psu.edu/citeseerx/viewdoc/summary?doi=10.1.1.174.1362.Google Scholar
- L. J. P. van der Maaten and G. E. Hinton. 2008. Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research 9, 2579--2605.Google Scholar
- Ziqi Zhang and José Iria. 2009. A novel approach to automatic gazetteer generation using Wikipedia. In Proceedings of the 2009 Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web’09). 1--9. http://dl.acm.org/citation.cfm?id=1699765.1699766.Google Scholar
Digital Library
- GuoDong Zhou and Jian Su. 2002. Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02). 473--480. DOI:http://dx.doi.org/10.3115/1073083.1073163 Google Scholar
Digital Library
Index Terms
(auto-classified)Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language
Recommendations
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Using Wikipedia for cross-language named entity recognition
MSM/MUSE/SenseML'14: Proceedings of the 5th and 1st International Conference on Big Data Analytics in the Social and Ubiquitous Context - 5th International Workshop on Modeling Social Media, 5th International Workshop on Mining Ubiquitous and Social Environments and First International Workshop on Machine Learning for Urban Sensor DataNamed entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and ...
Named entity recognition in Wikipedia
People's Web '09: Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic ResourcesNamed entity recognition (NER) is used in many domains beyond the newswire text that comprises current gold-standard corpora. Recent work has used Wikipedia's link structure to automatically generate near gold-standard annotations. Until now, these ...






Comments