ABSTRACT

We focus on knowledge base construction (KBC) from richly formatted data. In contrast to KBC from text or tabular data, KBC from richly formatted data aims to extract relations conveyed jointly via textual, structural, tabular, and visual expressions. We introduce Fonduer, a machine-learning-based KBC system for richly formatted data. Fonduer presents a new data model that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer uses a new deep-learning model to automatically capture the representation (i.e., features) needed to learn how to extract relations from richly formatted data. Finally, Fonduer provides a new programming model that enables users to convert domain expertise, based on multiple modalities of information, to meaningful signals of supervision for training a KBC system. Fonduer-based KBC systems are in production for a range of use cases, including at a major online retailer. We compare Fonduer against state-of-the-art KBC approaches in four different domains. We show that Fonduer achieves an average improvement of 41 F1 points on the quality of the output knowledge base---and in some cases produces up to 1.87x the number of correct entries---compared to expert-curated public knowledge bases. We also conduct a user study to assess the usability of Fonduer's new programming model. We show that after using Fonduer for only 30 minutes, non-domain experts are able to design KBC systems that achieve on average 23 F1 points higher quality than traditional machine-learning-based KBC approaches.
- G. Angeli, S. Gupta, M. Jose, C. D. Manning, C. Ré, J. Tibshirani, J. Y. Wu, S. Wu, and C. Zhang. Stanford's 2014 slot filling systems. TAC KBP, 695, 2014.Google Scholar
- D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.Google Scholar
- D. W. Barowy, S. Gulwani, T. Hart, and B. Zorn. Flashrelate: extracting relational data from semi-structured spreadsheets using examples. In ACM SIGPLAN Notices, volume 50, pages 218--228. ACM, 2015. Google Scholar
Digital Library
- T. Beck, R. K. Hastings, S. Gollapudi, R. C. Free, and A. J. Brookes. Gwas central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. EJHG, 22(7):949--952, 2014.Google Scholar
Cross Ref
- K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250. ACM, 2008. Google Scholar
Digital Library
- E. Brown, E. Epstein, J. W. Murdock, and T.-H. Fin. Tools and methods for building watson. IBM Research. Abgerufen am, 14:2013, 2013.Google Scholar
- A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 5, page 3, 2010. Google Scholar
Digital Library
- M. Cosulschi, N. Constantinescu, and M. Gabroveanu. Classifcation and comparison of information structures from a web page. Annals of the University of Craiova-Mathematics and Computer Science Series, 31, 2004.Google Scholar
- X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, pages 601--610. ACM, 2014. Google Scholar
Digital Library
- D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, et al. Building watson: An overview of the deepqa project. AI magazine, 31(3):59--79, 2010.Google Scholar
Digital Library
- D. Freitag. Information extraction from html: Application of a general machine learning approach. In AAAI/IAAI, pages 517--523, 1998. Google Scholar
Digital Library
- H. Gao, G. Barbier, and R. Goolsby. Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems, 26(3):10--14, 2011. Google Scholar
Digital Library
- W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71--80, 2007. Google Scholar
Digital Library
- V. Govindaraju, C. Zhang, and C. Ré. Understanding tables in context using standard nlp toolkits. In ACL, pages 658--664, 2013.Google Scholar
- A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5):855--868, 2009. Google Scholar
Digital Library
- A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645--6649. IEEE, 2013.Google Scholar
Cross Ref
- M. Hewett, D. E. Oliver, D. L. Rubin, K. L. Easton, J. M. Stuart, R. B. Altman, and T. E. Klein. Pharmgkb: the pharmacogenetics knowledge base. Nucleic acids research, 30(1):163--165, 2002.Google Scholar
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997. Google Scholar
Digital Library
- N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429--449, 2002. Google Scholar
Digital Library
- M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic. Recognition of common areas in a web page using visual information: a possible application in a page classification. In ICDM, pages 250--257, 2002. Google Scholar
Digital Library
- Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436--444, 2015.Google Scholar
Cross Ref
- J. Li, T. Luong, and D. Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. In ACL, pages 1106--1115, 2015.Google Scholar
Cross Ref
- A. Madaan, A. Mittal, G. R. Mausam, G. Ramakrishnan, and S. Sarawagi. Numerical relation extraction with minimal supervision. In AAAI, pages 2764--2771, 2016. Google Scholar
Digital Library
- C. Manning. Representations for language: From word embeddings to sentence meanings. https://simons.berkeley.edu/talks/christopher-manning-2017-3-27, 2017.Google Scholar
- B. Min, R. Grishman, L. Wan, C. Wang, and D. Gondek. Distant supervision for relation extraction with an incomplete knowledge base. In HLT-NAACL, pages 777--782, 2013.Google Scholar
- M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pages 1003--1011, 2009. Google Scholar
Digital Library
- N. Nakashole, M. Theobald, and G. Weikum. Scalable knowledge harvesting with high precision and high recall. In WSDM, pages 227--236. ACM, 2011. Google Scholar
Digital Library
- T.-V. T. Nguyen and A. Moschitti. End-to-end relation extraction using distant supervision from external semantic repositories. In HLT, pages 277--282, 2011. Google Scholar
Digital Library
- P. Pasupat and P. Liang. Zero-shot entity extraction from web pages. In ACL (1), pages 391--401, 2014.Google Scholar
Cross Ref
- G. Penn, J. Hu, H. Luo, and R. T. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices. In ICDAR, volume 1, pages 1074--1078, 2001. Google Scholar
Digital Library
- D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. Quasm: a system for question answering using semi-structured data. In JCDL, pages 46--55, 2002. Google Scholar
Digital Library
- A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. VLDB, 11(3):269--282, 2017. Google Scholar
Digital Library
- A. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In NIPS, pages 3567--3575, 2016. Google Scholar
Digital Library
- C. Ré, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature engineering for knowledge base construction. IEEE Data Engineering Bulletin, 2014.Google Scholar
- B. Settles. Active learning literature survey. 2010. Computer Sciences Technical Report, 1648.Google Scholar
- J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using deepdive. VLDB, 8(11):1310--1321, 2015. Google Scholar
Digital Library
- A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.Google Scholar
- F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A large ontology from wikipedia and wordnet. Web Semantics: Science, Services and Agents on the World Wide Web, 6(3):203--217, 2008. Google Scholar
Digital Library
- A. Tengli, Y. Yang, and N. L. Ma. Learning table extraction from examples. In COLING, page 987, 2004. Google Scholar
Digital Library
- J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, pages 384--394, 2010. Google Scholar
Digital Library
- P. Verga, D. Belanger, E. Strubell, B. Roth, and A. McCallum. Multilingual relation extraction using compositional universal schema. In HLT-NAACL, pages 886--896, 2016.Google Scholar
Cross Ref
- D. Welter, J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio, L. Hindorff, et al. The nhgri gwas catalog, a curated resource of snp-trait associations. Nucleic acids research, 42:D1001--D1006, 2014.Google Scholar
- Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.Google Scholar
- M. Yahya, S. E. Whang, R. Gupta, and A. Halevy. ReNoun: Fact Extraction for Nominal Attributes. EMNLP, pages 325--335, 2014.Google Scholar
Cross Ref
- Y. Yang and H. Zhang. Html page analysis based on visual cues. In ICDAR, pages 859--864, 2001. Google Scholar
Digital Library
- Y. Zhang, A. Chaganty, A. Paranjape, D. Chen, J. Bolton, P. Qi, and C. D. Manning. Stanford at tac kbp 2016: Sealing pipeline leaks and understanding chinese. TAC, 2016.Google Scholar
Index Terms
Fonduer: Knowledge Base Construction from Richly Formatted Data
Recommendations
A simulation framework for knowledge acquisition evaluation
ACSC '05: Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38Knowledge acquisition (KA) plays an important role in building knowledge based systems (KBS). However, evaluating different KA techniques has been difficult because of the costs of using human expertise in experimental studies. In this paper, we first ...
Subjective Knowledge Base Construction Powered By Crowdsourcing and Knowledge Base
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataKnowledge base construction (KBC) has become a hot and in-time topic recently with the increasing application need of large-scale knowledge bases (KBs), such as semantic search, QA systems, the Google Knowledge Graph and IBM Watson QA System. Existing ...
Constructing query-specific knowledge bases
AKBC '13: Proceedings of the 2013 workshop on Automated knowledge base constructionAbstract Large general purpose knowledge bases (KB) support a variety of complex tasks because of their structured relationships. However, these KBs lack coverage for specialized topics or use cases. In these scenarios, users often use keyword search ...





Comments