skip to main content
research-article

Boosted Web Named Entity Recognition via Tri-Training

Published:14 October 2016Publication History
Skip Abstract Section

Abstract

Named entity extraction is a fundamental task for many natural language processing applications on the web. Existing studies rely on annotated training data, which is quite expensive to obtain large datasets, limiting the effectiveness of recognition. In this research, we propose a semisupervised learning approach for web named entity recognition (NER) model construction via automatic labeling and tri-training. The former utilizes structured resources containing known named entities for automatic labeling, while the latter makes use of unlabeled examples to improve the extraction performance. Since this automatically labeled training data may contain noise, a self-testing procedure is used as a follow-up to remove low-confidence annotation and prepare higher-quality training data. Furthermore, we modify tri-training for sequence labeling and derive a proper initialization for large dataset training to improve entity recognition. Finally, we apply this semisupervised learning framework for person name recognition, business organization name recognition, and location name extraction. In the task of Chinese NER, an F-measure of 0.911, 0.849, and 0.845 can be achieved, for person, business organization, and location NER, respectively. The same framework is also applied for English and Japanese business organization name recognition and obtains models with performance of a 0.832 and 0.803 F-measure.

References

  1. Joohui An, Seungwoo Lee, and Gary Geunbae Lee. 2003. Automatic acquisition of named entity tagged corpus from world wide web. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2 (ACL’03). Association for Computational Linguistics, Stroudsburg, PA, 165--168. DOI:http://dx.doi.org/10.3115/1075178.1075207 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Rie Kubota Ando and Tong Zhang. 2005. A high-performance semi-supervised learning method for text chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL’05). Association for Computational Linguistics, Stroudsburg, PA, 1--9. DOI:http://dx.doi.org/10.3115/1219840.1219841 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Kristin P. Bennett and Ayhan Demiriz. 1999. Semi-supervised support vector machines. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II. MIT Press, Cambridge, MA, 368--374. http://dl.acm.org/citation.cfm?id=340534.340671 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT’98). ACM, New York, NY, 92--100. DOI:http://dx.doi.org/10.1145/279943.279962 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08). ACM, New York, NY, 1247--1250. DOI:http://dx.doi.org/10.1145/1376616.1376746 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Andrew Eliot Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation. New York, NY. Advisor(s) Grishman, Ralph. AAI9945252.Google ScholarGoogle Scholar
  7. John Burger, Claire Cardie, Vinay Chaudhri, Robert Gaizauskas, Sanda Harabagiu, David Israel, Christian Jacquemin, Chin-Yew Lin, Steve Maiorano, George Miller, Dan Moldovan, Bill Ogden, John Prager, Ellen Riloff, Amit Singhal, Rohini Shrihari, Tomek Strzalkowski, Ellen Voorhees, and Ralph Weischedel. 2001. Issues, Tasks and Program Structures to Roadmap Research in Question 8 Answering (Q8A). Technical Report. NIST. http://www-nlpir.nist.gov/projects/duc/roadmapping.html.Google ScholarGoogle Scholar
  8. Erik Cambria, Bjorn Schuller, Yunqing Xia, and Catherine Havasi. 2013. New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems 28, 2 (March 2013), 15--21. DOI:http://dx.doi.org/10.1109/MIS.2013.30 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chia-Hui Chang and Shu-Ying Li. 2010. MapMarker: Extraction of postal addresses and associated information for general web pages. In Web Intelligence, Jimmy Xiangji Huang, Irwin King, Vijay V. Raghavan, and Stefan Rueger (Eds.). IEEE Computer Society, 105--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. 2006. Semi-Supervised Learning. http://www.amazon.com/Semi-Supervised-Learning-Author-Chapelle-Oct-2006/dp/B010DTUKDY/ref=sr_1_4?s=books8ie=UTF88qid=14621162568sr=1-4Google ScholarGoogle Scholar
  11. Wenliang Chen, Yujie Zhang, and Hitoshi Isahara. 2006. Chinese chunking with tri-training learning. In Proceedings of the 21st International Conference on Computer Processing of Oriental Languages: Beyond the Orient: The Research Challenges Ahead (ICCPOL’06). Springer-Verlag, Berlin, 466--473. DOI:http://dx.doi.org/10.1007/11940098_49 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hsiu-Min Chuang, Chia-Hui Chang, and Ting-Yao Kao. 2014. Effective web crawling for chinese addresses and associated information. In E-Commerce and Web Technologies. Springer, 13--25.Google ScholarGoogle Scholar
  13. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL’05). Association for Computational Linguistics, Stroudsburg, PA, 363--370. DOI:http://dx.doi.org/10.3115/1219840.1219885 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ruiji Fu, Bing Qin, and Ting Liu. 2011. Generating Chinese named entity data from a parallel corpus. In Proceedings of 5th International Joint Conference on Natural Language Processing. 264--272.Google ScholarGoogle Scholar
  15. Abhishek Gattani, Digvijay S. Lamba, Nikesh Garera, Mitul Tiwari, Xiaoyong Chai, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, and AnHai Doan. 2013. Entity extraction, linking, classification, and tagging for social media: A wikipedia-based approach. Proceedings of the VLDB Endowment 6, 11 (Aug. 2013), 1126--1137. DOI:http://dx.doi.org/10.14778/2536222.2536237 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sally A. Goldman and Yan Zhou. 2000. Enhancing supervised learning with unlabeled data. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). Morgan Kaufmann Publishers, San Francisco, CA, 327--334. http://dl.acm.org/citation.cfm?id=645529.658273 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04). ACM, New York, NY, 168--177. DOI:http://dx.doi.org/10.1145/1014052.1014073 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Taku Kudo. 2005. CRF++: Yet Another CRF toolkit. (2005). http://crfpp.googlecode.com.Google ScholarGoogle Scholar
  19. Jimmy Lin. 2002. The web as a resource for question answering: Perspectives and challenges. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC’02).Google ScholarGoogle Scholar
  20. Gideon S. Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research 11 (March 2010), 955--984. http://dl.acm.org/citation.cfm?id=1756006.1756038 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). Morgan Kaufmann Publishers, San Francisco, CA, 591--598. http://dl.acm.org/citation.cfm?id=645529.658277 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4 (CONLL’03). Association for Computational Linguistics, Stroudsburg, PA, 188--191. DOI:http://dx.doi.org/10.3115/1119176.1119206 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Matthew Michelson and Craig A. Knoblock. 2009. Exploiting background knowledge to build reference sets for information extraction. In Proceedings of the 21st International Joint Conference on Artifical Intelligence (IJCAI’09). Morgan Kaufmann Publishers, San Francisco, CA, 2076--2082. http://dl.acm.org/citation.cfm?id=1661445.1661777 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 (ACL’09). Association for Computational Linguistics, Stroudsburg, PA, 1003--1011. http://dl.acm.org/citation.cfm?id=1690219.1690287 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tri Thanh Nguyen, Le Minh Nguyen, and Akira Shimazu. 2008. Using semi-supervised learning for question classification. Information and Media Technologies 3, 1 (2008), 112--130. DOI:http://dx.doi.org/10.11185/imt.3.112Google ScholarGoogle Scholar
  26. Kamal Nigam and Rayid Ghani. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM’00). ACM, New York, NY, 86--93. DOI:http://dx.doi.org/10.1145/354756.354805 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Xipeng Qiu, Qi Zhang, and Xuanjing Huang. 2013. FudanNLP: A toolkit for Chinese natural language processing. In Proceedings of Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  28. Adam Rae, Vanessa Murdock, Adrian Popescu, and Hugues Bouchard. 2012. Mining the web for points of interest. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). ACM, New York, NY, 711--720. DOI:http://dx.doi.org/10.1145/2348283.2348379 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ellen Riloff, Janyce Wiebe, and Theresa Wilson. 2003. Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4 (CONLL’03). Association for Computational Linguistics, Stroudsburg, PA, 25--32. DOI:http://dx.doi.org/10.3115/1119176.1119180 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sunita Sarawagi. 2008. Information extraction. Foundational Trends Databases 1, 3 (March 2008), 261--377. DOI:http://dx.doi.org/10.1561/1900000003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Satish and B. I. Gururaj. 1993. Use of hidden Markov models for partial discharge pattern classification. IEEE Transactions on Electrical Insulation, 28, 2 (Apr. 1993), 172--182. DOI:http://dx.doi.org/10.1109/14.212242Google ScholarGoogle ScholarCross RefCross Ref
  32. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou (Eds.). MIT Press, 1297--1304. http://papers.nips.cc/paper/2659-learning-syntactic-patterns-for-automatic-hypernym-discovery.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yueng-Sheng Su. 2012. Associated Information Extraction for Enabling Entity Search on Electronic Map. (2012). http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=955202022.Google ScholarGoogle Scholar
  34. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4 (CONLL’03). Association for Computational Linguistics, Stroudsburg, PA, 142--147. DOI:http://dx.doi.org/10.3115/1119176.1119195 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. G. Valiant. 1984. A theory of the learnable. Communications of the ACM 27, 11 (Nov. 1984), 1134--1142. DOI:http://dx.doi.org/10.1145/1968.1972 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Konstantinos N. Vavliakis, Andreas L. Symeonidis, and Pericles A. Mitkas. 2013. Event identification in web social media through named entity recognition and topic modeling. Data Knowledge Engineering 88 (Nov. 2013), 1--24. DOI:http://dx.doi.org/10.1016/j.datak.2013.08.006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. V. A. Yatsko, M. S. Starikov, and A. V. Butakov. 2010. Automatic genre recognition and adaptive text summarization. Automatic Documentation and Mathematical Linguistics 44, 3 (June 2010), 111--120. DOI:http://dx.doi.org/10.3103/S0005105510030027 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ning Yu and Sandra Kubler. 2010. Semi-supervised learning for opinion detection. In 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI’14) and Intelligent Agent Technologies (IAT’14) - Volume 3, 249--252. DOI:http://dx.doi.org/10.1109/WI-IAT.2010.263 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zhi-Hua Zhou and Ming Li. 2005. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 17, 11 (Nov. 2005), 1529--1541. DOI:http://dx.doi.org/10.1109/TKDE.2005.186 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Boosted Web Named Entity Recognition via Tri-Training

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!