skip to main content
10.1145/3183713.3183729acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Free Access

Fonduer: Knowledge Base Construction from Richly Formatted Data

Published:27 May 2018Publication History

ABSTRACT

We focus on knowledge base construction (KBC) from richly formatted data. In contrast to KBC from text or tabular data, KBC from richly formatted data aims to extract relations conveyed jointly via textual, structural, tabular, and visual expressions. We introduce Fonduer, a machine-learning-based KBC system for richly formatted data. Fonduer presents a new data model that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer uses a new deep-learning model to automatically capture the representation (i.e., features) needed to learn how to extract relations from richly formatted data. Finally, Fonduer provides a new programming model that enables users to convert domain expertise, based on multiple modalities of information, to meaningful signals of supervision for training a KBC system. Fonduer-based KBC systems are in production for a range of use cases, including at a major online retailer. We compare Fonduer against state-of-the-art KBC approaches in four different domains. We show that Fonduer achieves an average improvement of 41 F1 points on the quality of the output knowledge base---and in some cases produces up to 1.87x the number of correct entries---compared to expert-curated public knowledge bases. We also conduct a user study to assess the usability of Fonduer's new programming model. We show that after using Fonduer for only 30 minutes, non-domain experts are able to design KBC systems that achieve on average 23 F1 points higher quality than traditional machine-learning-based KBC approaches.

References

  1. G. Angeli, S. Gupta, M. Jose, C. D. Manning, C. Ré, J. Tibshirani, J. Y. Wu, S. Wu, and C. Zhang. Stanford's 2014 slot filling systems. TAC KBP, 695, 2014.Google ScholarGoogle Scholar
  2. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.Google ScholarGoogle Scholar
  3. D. W. Barowy, S. Gulwani, T. Hart, and B. Zorn. Flashrelate: extracting relational data from semi-structured spreadsheets using examples. In ACM SIGPLAN Notices, volume 50, pages 218--228. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Beck, R. K. Hastings, S. Gollapudi, R. C. Free, and A. J. Brookes. Gwas central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. EJHG, 22(7):949--952, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  5. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Brown, E. Epstein, J. W. Murdock, and T.-H. Fin. Tools and methods for building watson. IBM Research. Abgerufen am, 14:2013, 2013.Google ScholarGoogle Scholar
  7. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 5, page 3, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Cosulschi, N. Constantinescu, and M. Gabroveanu. Classifcation and comparison of information structures from a web page. Annals of the University of Craiova-Mathematics and Computer Science Series, 31, 2004.Google ScholarGoogle Scholar
  9. X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, pages 601--610. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, et al. Building watson: An overview of the deepqa project. AI magazine, 31(3):59--79, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Freitag. Information extraction from html: Application of a general machine learning approach. In AAAI/IAAI, pages 517--523, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Gao, G. Barbier, and R. Goolsby. Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems, 26(3):10--14, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71--80, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. V. Govindaraju, C. Zhang, and C. Ré. Understanding tables in context using standard nlp toolkits. In ACL, pages 658--664, 2013.Google ScholarGoogle Scholar
  15. A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5):855--868, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645--6649. IEEE, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Hewett, D. E. Oliver, D. L. Rubin, K. L. Easton, J. M. Stuart, R. B. Altman, and T. E. Klein. Pharmgkb: the pharmacogenetics knowledge base. Nucleic acids research, 30(1):163--165, 2002.Google ScholarGoogle Scholar
  18. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429--449, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic. Recognition of common areas in a web page using visual information: a possible application in a page classification. In ICDM, pages 250--257, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436--444, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Li, T. Luong, and D. Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. In ACL, pages 1106--1115, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  23. A. Madaan, A. Mittal, G. R. Mausam, G. Ramakrishnan, and S. Sarawagi. Numerical relation extraction with minimal supervision. In AAAI, pages 2764--2771, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Manning. Representations for language: From word embeddings to sentence meanings. https://simons.berkeley.edu/talks/christopher-manning-2017-3-27, 2017.Google ScholarGoogle Scholar
  25. B. Min, R. Grishman, L. Wan, C. Wang, and D. Gondek. Distant supervision for relation extraction with an incomplete knowledge base. In HLT-NAACL, pages 777--782, 2013.Google ScholarGoogle Scholar
  26. M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pages 1003--1011, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. N. Nakashole, M. Theobald, and G. Weikum. Scalable knowledge harvesting with high precision and high recall. In WSDM, pages 227--236. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T.-V. T. Nguyen and A. Moschitti. End-to-end relation extraction using distant supervision from external semantic repositories. In HLT, pages 277--282, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Pasupat and P. Liang. Zero-shot entity extraction from web pages. In ACL (1), pages 391--401, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  30. G. Penn, J. Hu, H. Luo, and R. T. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices. In ICDAR, volume 1, pages 1074--1078, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. Quasm: a system for question answering using semi-structured data. In JCDL, pages 46--55, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. VLDB, 11(3):269--282, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In NIPS, pages 3567--3575, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Ré, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature engineering for knowledge base construction. IEEE Data Engineering Bulletin, 2014.Google ScholarGoogle Scholar
  35. B. Settles. Active learning literature survey. 2010. Computer Sciences Technical Report, 1648.Google ScholarGoogle Scholar
  36. J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using deepdive. VLDB, 8(11):1310--1321, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.Google ScholarGoogle Scholar
  38. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A large ontology from wikipedia and wordnet. Web Semantics: Science, Services and Agents on the World Wide Web, 6(3):203--217, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Tengli, Y. Yang, and N. L. Ma. Learning table extraction from examples. In COLING, page 987, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, pages 384--394, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. P. Verga, D. Belanger, E. Strubell, B. Roth, and A. McCallum. Multilingual relation extraction using compositional universal schema. In HLT-NAACL, pages 886--896, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  42. D. Welter, J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio, L. Hindorff, et al. The nhgri gwas catalog, a curated resource of snp-trait associations. Nucleic acids research, 42:D1001--D1006, 2014.Google ScholarGoogle Scholar
  43. Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.Google ScholarGoogle Scholar
  44. M. Yahya, S. E. Whang, R. Gupta, and A. Halevy. ReNoun: Fact Extraction for Nominal Attributes. EMNLP, pages 325--335, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  45. Y. Yang and H. Zhang. Html page analysis based on visual cues. In ICDAR, pages 859--864, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Y. Zhang, A. Chaganty, A. Paranjape, D. Chen, J. Bolton, P. Qi, and C. D. Manning. Stanford at tac kbp 2016: Sealing pipeline leaks and understanding chinese. TAC, 2016.Google ScholarGoogle Scholar

Index Terms

  1. Fonduer: Knowledge Base Construction from Richly Formatted Data

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
          May 2018
          1874 pages
          ISBN:9781450347037
          DOI:10.1145/3183713

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 May 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Access Granted

        This article is provided by ACM and the author Luke Hsiao through the ACM Author-Izer service.