skip to main content
research-article

FlashExtract: a framework for data extraction by examples

Published:09 June 2014Publication History
Skip Abstract Section

Abstract

Various document types that combine model and view (e.g., text files, webpages, spreadsheets) make it easy to organize (possibly hierarchical) data, but make it difficult to extract raw data for any further manipulation or querying. We present a general framework FlashExtract to extract relevant data from semi-structured documents using examples. It includes: (a) an interaction model that allows end-users to give examples to extract various fields and to relate them in a hierarchical organization using structure and sequence constructs. (b) an inductive synthesis algorithm to synthesize the intended program from few examples in any underlying domain-specific language for data extraction that has been built using our specified algebra of few core operators (map, filter, merge, and pair). We describe instantiation of our framework to three different domains: text files, webpages, and spreadsheets. On our benchmark comprising 75 documents, FlashExtract is able to extract intended data using an average of 2.36 examples in 0.84 seconds per field.

References

  1. OpenRefine. http://openrefine.org/.Google ScholarGoogle Scholar
  2. R. Abraham and M. Erwig. Header and unit inference for spreadsheets through spatial analyses. In VL/HCC, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Anton. Xpath-wrapper induction by generalizing tree traversal patterns. In LWA, 2005.Google ScholarGoogle Scholar
  4. C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Cunha, J. Saraiva, and J. Visser. From spreadsheets to relational databases and back. In PEPM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Fisher and D. Walker. The pads project: an overview. In ICDT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. In POPL, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Frenz, editor. Pro Perl Parsing. APress, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Gulwani. Synthesis from examples: Interaction models and algorithms. In SYNASC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Commun. ACM, 55(8), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst., 23(9), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. F. Ii and G. Rothermel. The euses spreadsheet corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms. In Workshop on End-User Software Engineering, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In IJCAI (1), 1997.Google ScholarGoogle Scholar
  18. R. C. Miller. Lightweight Structure in Text. PhD Dissertation, Carnegie Mellon University, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Agents, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Oro, M. Ruffolo, and S. Staab. Sxpath: Extending xpath towards spatial querying on web documents. Proc. VLDB Endow., 4(2), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Singh and S. Gulwani. Learning semantic string transformations from examples. PVLDB, 5(8), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Singh and S. Gulwani. Synthesizing number transformations from input-output examples. In CAV, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Q. Xi and D. Walker. A context-free markup language for semi-structured text. In PLDI, pages 221--232, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Yessenov, S. Tulsiani, A. K. Menon, R. C. Miller, S. Gulwani, B. W. Lampson, and A. Kalai. A colorful approach to text processing by example. In UIST, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 49, Issue 6
    PLDI '14
    June 2014
    598 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2666356
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
    • cover image ACM Conferences
      PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation
      June 2014
      619 pages
      ISBN:9781450327848
      DOI:10.1145/2594291

    Copyright © 2014 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 9 June 2014

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!