Abstract
With hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a poor man's database, leading to creative solutions for storing high-dimensional data. The trouble arises when users need to answer queries with their data. Data manipulation tools make strong assumptions about data layouts and cannot read these ad-hoc databases. Converting data into the appropriate layout requires programming skills or a major investment in manual reformatting. The effect is that a vast amount of real-world data is "locked-in" to a proliferation of one-off formats. We introduce FlashRelate, a synthesis engine that lets ordinary users extract structured relational data from spreadsheets without programming. Instead, users extract data by supplying examples of output relational tuples. FlashRelate uses these examples to synthesize a program in Flare. Flare is a novel extraction language that extends regular expressions with geometric constructs. An interactive user interface on top of FlashRelate lets end users extract data by point-and-click. We demonstrate that correct Flare programs can be synthesized in seconds from a small set of examples for 43 real-world scenarios. Finally, our case study demonstrates FlashRelate's usefulness addressing the widespread problem of data trapped in corporate and government formats.
- D. Angluin. Learning regular sets from queries and counterexamples. Inf. Comput., 75(2):87–106, 1987. Google Scholar
Digital Library
- M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. CACM, 54(2):72–79, 2011. Google Scholar
Digital Library
- C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW, 2001. Google Scholar
Digital Library
- Z. Chen and M. Cafarella. Automatic web spreadsheet dat extraction. In SSW’13, 2013. Google Scholar
Digital Library
- Z. Chen, M. Cafarella, J. Chen, D. Prevo, and J. Zhuang. Senbazuru: a prototype spreadsheet database management system. PVLDB, 6(12):1202–1205, 2013. Google Scholar
Digital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001. Google Scholar
Digital Library
- J. Cunha, J. Saraiva, and J. Visser. From spreadsheets to relational databases and back. In PEPM 2009, pp. 179–188. ACM, 2009. Google Scholar
Digital Library
- E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: a survey. arXiv preprint arXiv:1207.0246, 2012.Google Scholar
- K. Fisher and D. Walker. The PADS project: an overview. In ICDT, 2011. Google Scholar
Digital Library
- M. I. Fisher and G. Rothermel. The EUSES Spreadsheet Corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms. In 1st WEUSE, pp. 47–51, 2005. Google Scholar
Digital Library
- S. Gulwani. Automating string processing in spreadsheets using inputoutput examples. In POPL, 2011. Google Scholar
Digital Library
- S. Gulwani. Synthesis from examples: Interaction models and algorithms. In SYNASC, 2012. Google Scholar
Digital Library
- W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011. Google Scholar
Digital Library
- F. Hermans, M. Pinzger, and A. van Deursen. Automatically extracting class diagrams from spreadsheets. In ECOOP 2010 - Object-Oriented Programming, 24th European Conference, Maribor, Slovenia, June 21- 25, 2010. Proceedings, pp. 52–75, 2010. Google Scholar
Digital Library
- C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst., 23(9), 1998. Google Scholar
Digital Library
- S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, 2011. Google Scholar
Digital Library
- J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proceedings of the American Mathematical Society, 7(1):48–50, Feb. 1956.Google Scholar
Cross Ref
- N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In IJCAI (1), 1997.Google Scholar
- V. Le and S. Gulwani. FlashExtract: A Framework for Data Extraction by Examples. In PLDI, pp. 542–553, 2014. Google Scholar
Digital Library
- H. Lieberman. Your Wish Is My Command: Programming by Example. Morgan Kaufmann, 2001.Google Scholar
Digital Library
- E. Lu, R. Bodik, and B. Hartmann. Quicksilver: Automatic Synthesis of Relational Queries. Tech. Rep. UCB/EECS-2013-68, UC-Berkeley, May 2013.Google Scholar
- I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Agents, 1999. Google Scholar
Digital Library
- E. Oro and M. Ruffolo. Sila: a spatial instance learning approach for deep webpages. In CIKM 2011, pp. 2329–2332. ACM, 2011. Google Scholar
Digital Library
- E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath towards spatial querying on web documents. PVLDB, 4(2):129–140, 2010. Google Scholar
Digital Library
- ProPublica. Tabula: Extract tables from pdfs, 2014.Google Scholar
- T. Register. Microsoft feeds excel to supercomputer, Nov. 2009.Google Scholar
- R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing, Sept. 2013. Google Scholar
Digital Library
- P. Wegner. A technique for counting ones in a binary computer. Commun. ACM, 3(5):322–, May 1960. Google Scholar
Digital Library
- J. Weiss. How news organizations are using tabula for data journalism, Sept. 2013.Google Scholar
- K. Q. Zhu, K. Fisher, and D. Walker. Learnpads++: Incremental inference of ad hoc data formats. In Proceedings of the 14th International Conference on Practical Aspects of Declarative Languages, PADL’12, pp. 168–182, Berlin, Heidelberg, 2012. Springer-Verlag. Introduction Flare Language FlashRelate Synthesis Algorithm Definitions Algorithm Step 1: Determine Cell Constraints Step 2: Determine Spatial Constraints Step 3: Find a Satisfying Set of Constraints Implementation Details Complexity Analysis Flare Run-Time FlashRelate Run-Time Evaluation Benchmark Spreadsheets and Tasks Experimental Setup Results Case Study Related Work Conclusion Google Scholar
Digital Library
Index Terms
FlashRelate: extracting relational data from semi-structured spreadsheets using examples
Recommendations
FlashRelate: extracting relational data from semi-structured spreadsheets using examples
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and ImplementationWith hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a ...
WebRelate: integrating web data with spreadsheets using examples
Data integration between web sources and relational data is a key challenge faced by data scientists and spreadsheet users. There are two main challenges in programmatically joining web data with relational data. First, most websites do not expose a ...
Transforming spreadsheet data types using examples
POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesCleaning spreadsheet data types is a common problem faced by millions of spreadsheet users. Data types such as date, time, name, and units are ubiquitous in spreadsheets, and cleaning transformations on these data types involve parsing and pretty ...






Comments