Abstract
Named entity extraction is a fundamental task for many natural language processing applications on the web. Existing studies rely on annotated training data, which is quite expensive to obtain large datasets, limiting the effectiveness of recognition. In this research, we propose a semisupervised learning approach for web named entity recognition (NER) model construction via automatic labeling and tri-training. The former utilizes structured resources containing known named entities for automatic labeling, while the latter makes use of unlabeled examples to improve the extraction performance. Since this automatically labeled training data may contain noise, a self-testing procedure is used as a follow-up to remove low-confidence annotation and prepare higher-quality training data. Furthermore, we modify tri-training for sequence labeling and derive a proper initialization for large dataset training to improve entity recognition. Finally, we apply this semisupervised learning framework for person name recognition, business organization name recognition, and location name extraction. In the task of Chinese NER, an F-measure of 0.911, 0.849, and 0.845 can be achieved, for person, business organization, and location NER, respectively. The same framework is also applied for English and Japanese business organization name recognition and obtains models with performance of a 0.832 and 0.803 F-measure.
- Joohui An, Seungwoo Lee, and Gary Geunbae Lee. 2003. Automatic acquisition of named entity tagged corpus from world wide web. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2 (ACL’03). Association for Computational Linguistics, Stroudsburg, PA, 165--168. DOI:http://dx.doi.org/10.3115/1075178.1075207 Google Scholar
Digital Library
- Rie Kubota Ando and Tong Zhang. 2005. A high-performance semi-supervised learning method for text chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL’05). Association for Computational Linguistics, Stroudsburg, PA, 1--9. DOI:http://dx.doi.org/10.3115/1219840.1219841 Google Scholar
Digital Library
- Kristin P. Bennett and Ayhan Demiriz. 1999. Semi-supervised support vector machines. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II. MIT Press, Cambridge, MA, 368--374. http://dl.acm.org/citation.cfm?id=340534.340671 Google Scholar
Digital Library
- Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT’98). ACM, New York, NY, 92--100. DOI:http://dx.doi.org/10.1145/279943.279962 Google Scholar
Digital Library
- Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08). ACM, New York, NY, 1247--1250. DOI:http://dx.doi.org/10.1145/1376616.1376746 Google Scholar
Digital Library
- Andrew Eliot Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation. New York, NY. Advisor(s) Grishman, Ralph. AAI9945252.Google Scholar
- John Burger, Claire Cardie, Vinay Chaudhri, Robert Gaizauskas, Sanda Harabagiu, David Israel, Christian Jacquemin, Chin-Yew Lin, Steve Maiorano, George Miller, Dan Moldovan, Bill Ogden, John Prager, Ellen Riloff, Amit Singhal, Rohini Shrihari, Tomek Strzalkowski, Ellen Voorhees, and Ralph Weischedel. 2001. Issues, Tasks and Program Structures to Roadmap Research in Question 8 Answering (Q8A). Technical Report. NIST. http://www-nlpir.nist.gov/projects/duc/roadmapping.html.Google Scholar
- Erik Cambria, Bjorn Schuller, Yunqing Xia, and Catherine Havasi. 2013. New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems 28, 2 (March 2013), 15--21. DOI:http://dx.doi.org/10.1109/MIS.2013.30 Google Scholar
Digital Library
- Chia-Hui Chang and Shu-Ying Li. 2010. MapMarker: Extraction of postal addresses and associated information for general web pages. In Web Intelligence, Jimmy Xiangji Huang, Irwin King, Vijay V. Raghavan, and Stefan Rueger (Eds.). IEEE Computer Society, 105--111. Google Scholar
Digital Library
- Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. 2006. Semi-Supervised Learning. http://www.amazon.com/Semi-Supervised-Learning-Author-Chapelle-Oct-2006/dp/B010DTUKDY/ref=sr_1_4?s=books8ie=UTF88qid=14621162568sr=1-4Google Scholar
- Wenliang Chen, Yujie Zhang, and Hitoshi Isahara. 2006. Chinese chunking with tri-training learning. In Proceedings of the 21st International Conference on Computer Processing of Oriental Languages: Beyond the Orient: The Research Challenges Ahead (ICCPOL’06). Springer-Verlag, Berlin, 466--473. DOI:http://dx.doi.org/10.1007/11940098_49 Google Scholar
Digital Library
- Hsiu-Min Chuang, Chia-Hui Chang, and Ting-Yao Kao. 2014. Effective web crawling for chinese addresses and associated information. In E-Commerce and Web Technologies. Springer, 13--25.Google Scholar
- Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL’05). Association for Computational Linguistics, Stroudsburg, PA, 363--370. DOI:http://dx.doi.org/10.3115/1219840.1219885 Google Scholar
Digital Library
- Ruiji Fu, Bing Qin, and Ting Liu. 2011. Generating Chinese named entity data from a parallel corpus. In Proceedings of 5th International Joint Conference on Natural Language Processing. 264--272.Google Scholar
- Abhishek Gattani, Digvijay S. Lamba, Nikesh Garera, Mitul Tiwari, Xiaoyong Chai, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, and AnHai Doan. 2013. Entity extraction, linking, classification, and tagging for social media: A wikipedia-based approach. Proceedings of the VLDB Endowment 6, 11 (Aug. 2013), 1126--1137. DOI:http://dx.doi.org/10.14778/2536222.2536237 Google Scholar
Digital Library
- Sally A. Goldman and Yan Zhou. 2000. Enhancing supervised learning with unlabeled data. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). Morgan Kaufmann Publishers, San Francisco, CA, 327--334. http://dl.acm.org/citation.cfm?id=645529.658273 Google Scholar
Digital Library
- Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04). ACM, New York, NY, 168--177. DOI:http://dx.doi.org/10.1145/1014052.1014073 Google Scholar
Digital Library
- Taku Kudo. 2005. CRF++: Yet Another CRF toolkit. (2005). http://crfpp.googlecode.com.Google Scholar
- Jimmy Lin. 2002. The web as a resource for question answering: Perspectives and challenges. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC’02).Google Scholar
- Gideon S. Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research 11 (March 2010), 955--984. http://dl.acm.org/citation.cfm?id=1756006.1756038 Google Scholar
Digital Library
- Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). Morgan Kaufmann Publishers, San Francisco, CA, 591--598. http://dl.acm.org/citation.cfm?id=645529.658277 Google Scholar
Digital Library
- Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4 (CONLL’03). Association for Computational Linguistics, Stroudsburg, PA, 188--191. DOI:http://dx.doi.org/10.3115/1119176.1119206 Google Scholar
Digital Library
- Matthew Michelson and Craig A. Knoblock. 2009. Exploiting background knowledge to build reference sets for information extraction. In Proceedings of the 21st International Joint Conference on Artifical Intelligence (IJCAI’09). Morgan Kaufmann Publishers, San Francisco, CA, 2076--2082. http://dl.acm.org/citation.cfm?id=1661445.1661777 Google Scholar
Digital Library
- Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 (ACL’09). Association for Computational Linguistics, Stroudsburg, PA, 1003--1011. http://dl.acm.org/citation.cfm?id=1690219.1690287 Google Scholar
Digital Library
- Tri Thanh Nguyen, Le Minh Nguyen, and Akira Shimazu. 2008. Using semi-supervised learning for question classification. Information and Media Technologies 3, 1 (2008), 112--130. DOI:http://dx.doi.org/10.11185/imt.3.112Google Scholar
- Kamal Nigam and Rayid Ghani. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM’00). ACM, New York, NY, 86--93. DOI:http://dx.doi.org/10.1145/354756.354805 Google Scholar
Digital Library
- Xipeng Qiu, Qi Zhang, and Xuanjing Huang. 2013. FudanNLP: A toolkit for Chinese natural language processing. In Proceedings of Annual Meeting of the Association for Computational Linguistics.Google Scholar
- Adam Rae, Vanessa Murdock, Adrian Popescu, and Hugues Bouchard. 2012. Mining the web for points of interest. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). ACM, New York, NY, 711--720. DOI:http://dx.doi.org/10.1145/2348283.2348379 Google Scholar
Digital Library
- Ellen Riloff, Janyce Wiebe, and Theresa Wilson. 2003. Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4 (CONLL’03). Association for Computational Linguistics, Stroudsburg, PA, 25--32. DOI:http://dx.doi.org/10.3115/1119176.1119180 Google Scholar
Digital Library
- Sunita Sarawagi. 2008. Information extraction. Foundational Trends Databases 1, 3 (March 2008), 261--377. DOI:http://dx.doi.org/10.1561/1900000003 Google Scholar
Digital Library
- L. Satish and B. I. Gururaj. 1993. Use of hidden Markov models for partial discharge pattern classification. IEEE Transactions on Electrical Insulation, 28, 2 (Apr. 1993), 172--182. DOI:http://dx.doi.org/10.1109/14.212242Google Scholar
Cross Ref
- Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou (Eds.). MIT Press, 1297--1304. http://papers.nips.cc/paper/2659-learning-syntactic-patterns-for-automatic-hypernym-discovery.pdf. Google Scholar
Digital Library
- Yueng-Sheng Su. 2012. Associated Information Extraction for Enabling Entity Search on Electronic Map. (2012). http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=955202022.Google Scholar
- Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4 (CONLL’03). Association for Computational Linguistics, Stroudsburg, PA, 142--147. DOI:http://dx.doi.org/10.3115/1119176.1119195 Google Scholar
Digital Library
- L. G. Valiant. 1984. A theory of the learnable. Communications of the ACM 27, 11 (Nov. 1984), 1134--1142. DOI:http://dx.doi.org/10.1145/1968.1972 Google Scholar
Digital Library
- Konstantinos N. Vavliakis, Andreas L. Symeonidis, and Pericles A. Mitkas. 2013. Event identification in web social media through named entity recognition and topic modeling. Data Knowledge Engineering 88 (Nov. 2013), 1--24. DOI:http://dx.doi.org/10.1016/j.datak.2013.08.006 Google Scholar
Digital Library
- V. A. Yatsko, M. S. Starikov, and A. V. Butakov. 2010. Automatic genre recognition and adaptive text summarization. Automatic Documentation and Mathematical Linguistics 44, 3 (June 2010), 111--120. DOI:http://dx.doi.org/10.3103/S0005105510030027 Google Scholar
Digital Library
- Ning Yu and Sandra Kubler. 2010. Semi-supervised learning for opinion detection. In 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI’14) and Intelligent Agent Technologies (IAT’14) - Volume 3, 249--252. DOI:http://dx.doi.org/10.1109/WI-IAT.2010.263 Google Scholar
Digital Library
- Zhi-Hua Zhou and Ming Li. 2005. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 17, 11 (Nov. 2005), 1529--1541. DOI:http://dx.doi.org/10.1109/TKDE.2005.186 Google Scholar
Digital Library
Index Terms
Boosted Web Named Entity Recognition via Tri-Training
Recommendations
BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningWe study the open-domain named entity recognition (NER) problem under distant supervision. The distant supervision, though does not require large amounts of manual annotations, yields highly incomplete and noisy distant labels via external knowledge ...
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Generalisation in named entity recognition
Quantitative study of NER performance in diverse corpora of different genres, including newswire and social media.Multiple state of the art NER approaches are tested.Possible reasons for NER failure are analysed and quantified: NE diversity, unseen NEs ...






Comments