Abstract
Various document types that combine model and view (e.g., text files, webpages, spreadsheets) make it easy to organize (possibly hierarchical) data, but make it difficult to extract raw data for any further manipulation or querying. We present a general framework FlashExtract to extract relevant data from semi-structured documents using examples. It includes: (a) an interaction model that allows end-users to give examples to extract various fields and to relate them in a hierarchical organization using structure and sequence constructs. (b) an inductive synthesis algorithm to synthesize the intended program from few examples in any underlying domain-specific language for data extraction that has been built using our specified algebra of few core operators (map, filter, merge, and pair). We describe instantiation of our framework to three different domains: text files, webpages, and spreadsheets. On our benchmark comprising 75 documents, FlashExtract is able to extract intended data using an average of 2.36 examples in 0.84 seconds per field.
- OpenRefine. http://openrefine.org/.Google Scholar
- R. Abraham and M. Erwig. Header and unit inference for spreadsheets through spatial analyses. In VL/HCC, 2004. Google Scholar
Digital Library
- T. Anton. Xpath-wrapper induction by generalizing tree traversal patterns. In LWA, 2005.Google Scholar
- C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW, 2001. Google Scholar
Digital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001. Google Scholar
Digital Library
- J. Cunha, J. Saraiva, and J. Visser. From spreadsheets to relational databases and back. In PEPM, 2009. Google Scholar
Digital Library
- K. Fisher and D. Walker. The pads project: an overview. In ICDT, 2011. Google Scholar
Digital Library
- K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. In POPL, 2008. Google Scholar
Digital Library
- C. Frenz, editor. Pro Perl Parsing. APress, 2005. Google Scholar
Digital Library
- S. Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, 2011. Google Scholar
Digital Library
- S. Gulwani. Synthesis from examples: Interaction models and algorithms. In SYNASC, 2012. Google Scholar
Digital Library
- S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Commun. ACM, 55(8), 2012. Google Scholar
Digital Library
- W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011. Google Scholar
Digital Library
- C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst., 23(9), 1998. Google Scholar
Digital Library
- M. F. Ii and G. Rothermel. The euses spreadsheet corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms. In Workshop on End-User Software Engineering, 2005. Google Scholar
Digital Library
- S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, 2011. Google Scholar
Digital Library
- N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In IJCAI (1), 1997.Google Scholar
- R. C. Miller. Lightweight Structure in Text. PhD Dissertation, Carnegie Mellon University, 2002. Google Scholar
Digital Library
- I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Agents, 1999. Google Scholar
Digital Library
- E. Oro, M. Ruffolo, and S. Staab. Sxpath: Extending xpath towards spatial querying on web documents. Proc. VLDB Endow., 4(2), 2010. Google Scholar
Digital Library
- R. Singh and S. Gulwani. Learning semantic string transformations from examples. PVLDB, 5(8), 2012. Google Scholar
Digital Library
- R. Singh and S. Gulwani. Synthesizing number transformations from input-output examples. In CAV, 2012. Google Scholar
Digital Library
- Q. Xi and D. Walker. A context-free markup language for semi-structured text. In PLDI, pages 221--232, 2010. Google Scholar
Digital Library
- K. Yessenov, S. Tulsiani, A. K. Menon, R. C. Miller, S. Gulwani, B. W. Lampson, and A. Kalai. A colorful approach to text processing by example. In UIST, 2013. Google Scholar
Digital Library
Recommendations
FlashExtract: a framework for data extraction by examples
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and ImplementationVarious document types that combine model and view (e.g., text files, webpages, spreadsheets) make it easy to organize (possibly hierarchical) data, but make it difficult to extract raw data for any further manipulation or querying. We present a general ...
Transforming spreadsheet data types using examples
POPL '16Cleaning spreadsheet data types is a common problem faced by millions of spreadsheet users. Data types such as date, time, name, and units are ubiquitous in spreadsheets, and cleaning transformations on these data types involve parsing and pretty ...
Transforming spreadsheet data types using examples
POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesCleaning spreadsheet data types is a common problem faced by millions of spreadsheet users. Data types such as date, time, name, and units are ubiquitous in spreadsheets, and cleaning transformations on these data types involve parsing and pretty ...







Comments