Abstract
Named entity recognition (NER) is an important task in natural language understanding, as it extracts the key entities (person, organization, location, date, number, etc.) and objects (product, song, movie, activity name, etc.) mentioned in texts. However, existing natural language processing (NLP) tools (such as Stanford NER) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a tool for NER model training is essential for low-resource language or entity information extraction. In this article, we study the problem of developing a tool to prepare training corpus from the Web with known seed entities for custom NER model training via distant supervision. The major challenge of automatic labeling lies in the long labeling time due to large corpus and seed entities as well as the concern to avoid false positive and false negative examples due to short and long seeds. To solve this problem, we adopt locality-sensitive hashing (LSH) for various length of seed entities. We conduct experiments on five types of entity recognition tasks, including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed Web NER model construction tool. Because the training corpus is obtained by automatic labeling of the seed entity–related sentences, one could use either the entire corpus or the positive only sentences for model training. Based on the experimental results, we found the decision should depend on whether traditional linear chained conditional random fields (CRF) or deep neural network–based CRF is used for model training as well as the completeness of the provided seed list.
- Joohui An, Seungwoo Lee, and Gary Geunbae Lee. 2003. Automatic acquisition of named entity tagged corpus from world wide web. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2 (ACL’03). Association for Computational Linguistics, 165--168. DOI:https://doi.org/10.3115/1075178.1075207Google Scholar
Digital Library
- Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’08). ACM, New York, NY, 1247--1250. DOI:https://doi.org/10.1145/1376616.1376746Google Scholar
Digital Library
- Dumitru Brinza, Matthew Schultz, Glenn Tesler, and Vineet Bafna. 2010. RAPID detection of gene-gene interactions in genome-wide association studies. Bioinformatics 26 22 (2010), 2856--62.Google Scholar
- John Burger, Claire Cardie, Vinay Chaudhri, Robert Gaizauskas, Sanda Harabagiu, David Israel, Christian Jacquemin, Chin-Yew Lin, Steve Maiorano, George Miller, Dan Moldovan, Bill Ogden, John Prager, Ellen Riloff, Amit Singhal, Rohini Shrihari, Tomek Strzalkowski, Ellen Voorhees, and Ralph Weischedel. 2001. Issues, Tasks and Program Structures to Roadmap Research in Question 8 Answering (Q8A). Technical Report. NIST.Google Scholar
- Erik Cambria, Bjorn Schuller, Yunqing Xia, and Catherine Havasi. 2013. New avenues in opinion mining and sentiment analysis. IEEE Intell. Syst. 28, 2 (Mar. 2013), 15--21. DOI:https://doi.org/10.1109/MIS.2013.30Google Scholar
Digital Library
- Chia-Feng Chiang, Chia-Hui Chang, and Chih-Hao Liu. 2017. PTT disaster events extraction system. In Proceedings of the Technologies and Applications of Artificial Intelligence (TAAI’17).Google Scholar
- Kuo-Chun Chien and Chia-Hui Chang. 2019. Leveraging memory-enhanced conditional random fields with convolutional and automatic lexical features for Chinese named entity recognition. Int. J. Comput. Ling. Chinese Lang. Proc., Vol. 24. 1--14. Retrieved from http://www.aclclp.org.tw/clclp/v24n1/v24n1a1.pdf.Google Scholar
- Chien-Lung Chou, Chia-Hui Chang, and Ya-Yun Huang. 2016. Boosted web named entity recognition via tri-training. ACM Trans. Asian Low-resour. Lang. Inf. Proc. 16, 2 (Oct. 2016). DOI:https://doi.org/10.1145/2963100Google Scholar
- Hsiu-Min Chuang, Chia-Hui Chang, Ting-Yao Kao, Chung-Ting Cheng, Ya-Yun Huang, and Kuo-Pin Cheong. 2016. Enabling maps/location searches on mobile devices: Constructing a POI database via focused crawling and information extraction. Int. J. Geog. Inf. Sci. 30, 7 (2016), 1405--1425. DOI:https://doi.org/10.1080/13658816.2015.1133820Google Scholar
Digital Library
- Hsiu-Min Chuang, Chia-Hui Chang, and Ting-Yao Kao. 2014. Effective web crawling for Chinese addresses and associated information. In E-Commerce and Web Technologies. Springer, 13--25.Google Scholar
- Chih-Yu Chung, Chien-Lung Chou, and Chia-Hui Chang. 2017. A study of restaurant information and food type extraction from PTT. In Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING’17). The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), 183--196. Retrieved from http://aclweb.org/anthology/O17-1019.Google Scholar
- Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12 (Nov. 2011), 2493--2537. Retrieved from http://dl.acm.org/citation.cfm?id=1953048.2078186.Google Scholar
Digital Library
- Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Valentin Tablan. 2002. GATE: An architecture for development of robust HLT applications. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). Association for Computational Linguistics, 168--175. DOI:https://doi.org/10.3115/1073083.1073112Google Scholar
- Abhinandan S. Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. 2007. Google news personalization: Scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 271--280. DOI:https://doi.org/10.1145/1242572.1242610Google Scholar
Digital Library
- Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL’05). Association for Computational Linguistics, 363--370. DOI:https://doi.org/10.3115/1219840.1219885Google Scholar
Digital Library
- Alex Graves and Jürgen Schmidhuber. 2009. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.). Curran Associates, Inc., 545--552. Retrieved from http://papers.nips.cc/paper/3449-offline-handwriting-recognition-with-multidimensional-recurrent-neural-networks.pdf.Google Scholar
Digital Library
- Kuo-Hsin Hsu, Hsiu-Min Chuang, Chien-Lung Chou, and Chia-Hui Chang. 2017. Mining POIs from Web via POI recognition and relation verification. In ROCLING, Lun-Wei Ku and Yu Tsao (Eds.). The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), 53--67.Google Scholar
- Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04). ACM, New York, NY, 168--177. DOI:https://doi.org/10.1145/1014052.1014073Google Scholar
Digital Library
- Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991 (2015).Google Scholar
- Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC’98). ACM, New York, NY, 604--613. DOI:https://doi.org/10.1145/276698.276876Google Scholar
Digital Library
- Yuhua Jia, Liang Bai, Peng Wang, Jinlin Guo, Yuxiang Xie, and Tianyuan Yu. 2018. Irrelevance reduction with locality-sensitive hash learning for efficient cross-media retrieval. Multimedia Tools Applic. (Feb. 2018). DOI:https://doi.org/10.1007/s11042-018-5692-3Google Scholar
- Sun Junyi. 2013. Jieba. Retrieved from https://github.com/fxsjy/jieba.Google Scholar
- Su Nam Kim, Li Wang, and Timothy Baldwin. 2010. Tagging and linking web forum posts. In Proceedings of the 14th Conference on Computational Natural Language Learning (CoNLL’10). Association for Computational Linguistics, 192--202. Retrieved from http://dl.acm.org/citation.cfm?id=1870568.1870591.Google Scholar
- Hisashi Koga, Tetsuo Ishibashi, and Toshinori Watanabe. 2007. Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl. Inf. Syst. 12, 1 (May 2007), 25--53. DOI:https://doi.org/10.1007/s10115-006-0027-5Google Scholar
Digital Library
- Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. 2000. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30, 2 (Apr. 2000), 457--474. DOI:https://doi.org/10.1137/S0097539798347177Google Scholar
Digital Library
- John Lafferty. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Morgan Kaufmann, 282--289.Google Scholar
- Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). AAAI Press, 2267--2273. Retrieved from http://dl.acm.org/citation.cfm?id=2886521.2886636.Google Scholar
Cross Ref
- Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 260--270. DOI:https://doi.org/10.18653/v1/N16-1030Google Scholar
Cross Ref
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1 (1989), 541--551.Google Scholar
Digital Library
- Chenliang Li, Aixin Sun, Jianshu Weng, and Qi He. 2015. Tweet segmentation and its application to named entity recognition. IEEE Trans. Knowl. Data Eng. 27, 2 (Feb. 2015), 558--570. DOI:https://doi.org/10.1109/TKDE.2014.2327042Google Scholar
Cross Ref
- Jimmy Lin. 2002. The web as a resource for question answering: Perspectives and challenges. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC’02).Google Scholar
- Yuan-Hao Lin and Chia-Hui Chang. 2016. Facebook activity event extraction system. In Proceedings of the 28th Conference on Computational Linguistics and Speech Processing (ROCLING’16). Retrieved from http://aclweb.org/anthology/O/O16/O16-1022.pdf.Google Scholar
- Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Trans. Assoc. Comput. Ling. 4 (2016), 521--535. Retrieved from https://transacl.org/ojs/index.php/tacl/article/view/972.Google Scholar
Cross Ref
- Fei Liu, Timothy Baldwin, and Trevor Cohn. 2017. Capturing long-range contextual dependencies with memory-enhanced conditional random fields. In Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP’17). 555--565.Google Scholar
- Apache Lucene. 1999. Apache Lucene Text Analyzer. Retrieved from https://lucene.apache.org.Google Scholar
- Bingfeng Luo, Yansong Feng, Zheng Wang, Zhanxing Zhu, Songfang Huang, Rui Yan, and Dongyan Zhao. 2017. Learning with noise: Enhance distantly supervised relation extraction with dynamic transition matrix. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 430--439. DOI:https://doi.org/10.18653/v1/P17-1040Google Scholar
Cross Ref
- Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 141--150. DOI:https://doi.org/10.1145/1242572.1242592Google Scholar
Digital Library
- Toan Nguyen Mau and Yasushi Inoguchi. 2018. Audio fingerprint hierarchy searching strategies on GPGPU massively parallel computer. J. Inf. Telecommun. 2, 3 (2018), 265--290. DOI:https://doi.org/10.1080/24751839.2018.1423790Google Scholar
- Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4 (CONLL’03). Association for Computational Linguistics, 188--191. DOI:https://doi.org/10.3115/1119176.1119206Google Scholar
Digital Library
- Matthew Michelson and Craig A. Knoblock. 2009. Exploiting background knowledge to build reference sets for information extraction. In Proceedings of the 21st International Jont Conference on Artifical Intelligence (IJCAI’09). Morgan Kaufmann Publishers Inc., San Francisco, CA, 2076--2082.Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. Retrieved from arxiv:cs.CL/1301.3781.Google Scholar
- Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 (ACL’09). Association for Computational Linguistics, 1003--1011. Retrieved from http://dl.acm.org/citation.cfm?id=1690219.1690287.Google Scholar
Digital Library
- Apache OpenNLP. 2004. Apache Software Foundation. Retrieved from https://opennlp.apache.org.Google Scholar
- Sachin Pawar, Girish K. Palshikar, and Pushpak Bhattacharyya. 2017. Relation extraction: A survey. CoRR abs/1712.05191 (2017).Google Scholar
- Xipeng Qiu, Qi Zhang, and Xuanjing Huang. 2013. FudanNLP: A toolkit for Chinese natural language processing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.Google Scholar
- Adam Rae, Vanessa Murdock, Adrian Popescu, and Hugues Bouchard. 2012. Mining the web for points of interest. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). ACM, New York, NY, 711--720. DOI:https://doi.org/10.1145/2348283.2348379Google Scholar
Digital Library
- Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of Massive Datasets. Cambridge University Press, New York, NY.Google Scholar
Digital Library
- Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, José Luis Balcázar, Francesco Bonchi, Aristides Gionis, and Michle Sebag (Eds.). Springer Berlin, 148--163.Google Scholar
- Sunita Sarawagi. 2008. Information extraction. Found. Trends Datab. 1, 3 (Mar. 2008), 261--377. DOI:https://doi.org/10.1561/1900000003Google Scholar
- Miikka Silfverberg, Teemu Ruokolainen, Krister Lindén, and Mikko Kurimo. 2014. Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, 259--264. DOI:https://doi.org/10.3115/v1/P14-2043Google Scholar
Cross Ref
- Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou (Eds.). The MIT Press, 1297--1304.Google Scholar
Digital Library
- Charles Sutton and Andrew McCallum. 2004. Collective segmentation and labeling of distant entities in information extraction. In ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields.Google Scholar
- Nguyen Mau Toan and I. Yasushi. 2016. Audio fingerprint hierarchy searching on massively parallel with multi-GPGPUs using K-modes and LSH. In Proceedings of the Eighth International Conference on Knowledge and Systems Engineering (KSE’16). 49--54. DOI:https://doi.org/10.1109/KSE.2016.7758028Google Scholar
- Chunqi Wang and Bo Xu. 2017. Convolutional neural network with word embeddings for Chinese word segmentation. CoRR abs/1711.04411 (2017).Google Scholar
- Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014. Hashing for similarity search: A survey. CoRR abs/1408.2927 (2014).Google Scholar
- Li Wang. 2014. Knowledge Discovery and Extraction of Domain-specific Web Data. Ph.D. Dissertation. University of Melbourne. Retrieved from http://hdl.handle.net/11343/45174.Google Scholar
- Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. CoRR abs/1410.3916 (2014).Google Scholar
- Yao Yushi and Huang Zheng. 2015. Combine CRF and MMSEG to boost Chinese word segmentation in social media. CoRR abs/1510.07099 (2015).Google Scholar
- Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1753--1762. DOI:https://doi.org/10.18653/v1/D15-1203Google Scholar
Cross Ref
- Qi Zhang, Yeyun Gong, Jindou Wu, Haoran Huang, and Xuanjing Huang. 2016. Retweet prediction with attention-based deep neural network. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM’16). ACM, New York, NY, 75--84. DOI:https://doi.org/10.1145/2983323.2983809Google Scholar
Digital Library
- Zhi-Hua Zhou and Ming Li. 2005. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17, 11 (Nov. 2005), 1529--1541. DOI:https://doi.org/10.1109/TKDE.2005.186Google Scholar
Index Terms
On the Construction of Web NER Model Training Tool based on Distant Supervision
Recommendations
BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningWe study the open-domain named entity recognition (NER) problem under distant supervision. The distant supervision, though does not require large amounts of manual annotations, yields highly incomplete and noisy distant labels via external knowledge ...
Boosted Web Named Entity Recognition via Tri-Training
TALLIP Notes and Regular PapersNamed entity extraction is a fundamental task for many natural language processing applications on the web. Existing studies rely on annotated training data, which is quite expensive to obtain large datasets, limiting the effectiveness of recognition. ...
Automatic gazette creation for named entity recognition and application to resume processing
COMPUTE '12: Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologiesNamed entities are important content-carrying units within documents. Consequently named entity recognition (NER) is an important part of information extraction. One fast and accurate approach to NER uses a list or gazette consisting of known instances. ...






Comments