skip to main content
research-article

FlashRelate: extracting relational data from semi-structured spreadsheets using examples

Published:03 June 2015Publication History
Skip Abstract Section

Abstract

With hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a poor man's database, leading to creative solutions for storing high-dimensional data. The trouble arises when users need to answer queries with their data. Data manipulation tools make strong assumptions about data layouts and cannot read these ad-hoc databases. Converting data into the appropriate layout requires programming skills or a major investment in manual reformatting. The effect is that a vast amount of real-world data is "locked-in" to a proliferation of one-off formats. We introduce FlashRelate, a synthesis engine that lets ordinary users extract structured relational data from spreadsheets without programming. Instead, users extract data by supplying examples of output relational tuples. FlashRelate uses these examples to synthesize a program in Flare. Flare is a novel extraction language that extends regular expressions with geometric constructs. An interactive user interface on top of FlashRelate lets end users extract data by point-and-click. We demonstrate that correct Flare programs can be synthesized in seconds from a small set of examples for 43 real-world scenarios. Finally, our case study demonstrates FlashRelate's usefulness addressing the widespread problem of data trapped in corporate and government formats.

References

  1. D. Angluin. Learning regular sets from queries and counterexamples. Inf. Comput., 75(2):87–106, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. CACM, 54(2):72–79, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Z. Chen and M. Cafarella. Automatic web spreadsheet dat extraction. In SSW’13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Z. Chen, M. Cafarella, J. Chen, D. Prevo, and J. Zhuang. Senbazuru: a prototype spreadsheet database management system. PVLDB, 6(12):1202–1205, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Cunha, J. Saraiva, and J. Visser. From spreadsheets to relational databases and back. In PEPM 2009, pp. 179–188. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: a survey. arXiv preprint arXiv:1207.0246, 2012.Google ScholarGoogle Scholar
  9. K. Fisher and D. Walker. The PADS project: an overview. In ICDT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. I. Fisher and G. Rothermel. The EUSES Spreadsheet Corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms. In 1st WEUSE, pp. 47–51, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Gulwani. Automating string processing in spreadsheets using inputoutput examples. In POPL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Gulwani. Synthesis from examples: Interaction models and algorithms. In SYNASC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Hermans, M. Pinzger, and A. van Deursen. Automatically extracting class diagrams from spreadsheets. In ECOOP 2010 - Object-Oriented Programming, 24th European Conference, Maribor, Slovenia, June 21- 25, 2010. Proceedings, pp. 52–75, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst., 23(9), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proceedings of the American Mathematical Society, 7(1):48–50, Feb. 1956.Google ScholarGoogle ScholarCross RefCross Ref
  18. N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In IJCAI (1), 1997.Google ScholarGoogle Scholar
  19. V. Le and S. Gulwani. FlashExtract: A Framework for Data Extraction by Examples. In PLDI, pp. 542–553, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Lieberman. Your Wish Is My Command: Programming by Example. Morgan Kaufmann, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Lu, R. Bodik, and B. Hartmann. Quicksilver: Automatic Synthesis of Relational Queries. Tech. Rep. UCB/EECS-2013-68, UC-Berkeley, May 2013.Google ScholarGoogle Scholar
  22. I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Agents, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. Oro and M. Ruffolo. Sila: a spatial instance learning approach for deep webpages. In CIKM 2011, pp. 2329–2332. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath towards spatial querying on web documents. PVLDB, 4(2):129–140, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ProPublica. Tabula: Extract tables from pdfs, 2014.Google ScholarGoogle Scholar
  26. T. Register. Microsoft feeds excel to supercomputer, Nov. 2009.Google ScholarGoogle Scholar
  27. R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing, Sept. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. Wegner. A technique for counting ones in a binary computer. Commun. ACM, 3(5):322–, May 1960. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Weiss. How news organizations are using tabula for data journalism, Sept. 2013.Google ScholarGoogle Scholar
  30. K. Q. Zhu, K. Fisher, and D. Walker. Learnpads++: Incremental inference of ad hoc data formats. In Proceedings of the 14th International Conference on Practical Aspects of Declarative Languages, PADL’12, pp. 168–182, Berlin, Heidelberg, 2012. Springer-Verlag. Introduction Flare Language FlashRelate Synthesis Algorithm Definitions Algorithm Step 1: Determine Cell Constraints Step 2: Determine Spatial Constraints Step 3: Find a Satisfying Set of Constraints Implementation Details Complexity Analysis Flare Run-Time FlashRelate Run-Time Evaluation Benchmark Spreadsheets and Tasks Experimental Setup Results Case Study Related Work Conclusion Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FlashRelate: extracting relational data from semi-structured spreadsheets using examples

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM SIGPLAN Notices
              ACM SIGPLAN Notices  Volume 50, Issue 6
              PLDI '15
              June 2015
              630 pages
              ISSN:0362-1340
              EISSN:1558-1160
              DOI:10.1145/2813885
              • Editor:
              • Andy Gill
              Issue’s Table of Contents
              • cover image ACM Conferences
                PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
                June 2015
                630 pages
                ISBN:9781450334686
                DOI:10.1145/2737924

              Copyright © 2015 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 3 June 2015

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!