skip to main content
research-article
Open Access
Artifacts Evaluated & Functional

WebRelate: integrating web data with spreadsheets using examples

Published:27 December 2017Publication History
Skip Abstract Section

Abstract

Data integration between web sources and relational data is a key challenge faced by data scientists and spreadsheet users. There are two main challenges in programmatically joining web data with relational data. First, most websites do not expose a direct interface to obtain tabular data, so the user needs to formulate a logic to get to different webpages for each input row in the relational table. Second, after reaching the desired webpage, the user needs to write complex scripts to extract the relevant data, which is often conditioned on the input data. Since many data scientists and end-users come from diverse backgrounds, writing such complex regular-expression based logical scripts to perform data integration tasks is unfortunately often beyond their programming expertise.

We present WebRelate, a system that allows users to join semi-structured web data with relational data in spreadsheets using input-output examples. WebRelate decomposes the web data integration task into two sub-tasks of i) URL learning and ii) input-dependent web extraction. We introduce a novel synthesis paradigm called "Output-constrained Programming By Examples", which allows us to use the finite set of possible outputs for the new inputs to efficiently constrain the search in the synthesis algorithm. We instantiate this paradigm for the two sub-tasks in WebRelate. The first sub-task generates the URLs for the webpages containing the desired data for all rows in the relational table. WebRelate achieves this by learning a string transformation program using a few example URLs. The second sub-task uses examples of desired data to be extracted from the corresponding webpages and learns a program to extract the data for the other rows. We design expressive domain-specific languages for URL generation and web data extraction, and present efficient synthesis algorithms for learning programs in these DSLs from few input-output examples. We evaluate WebRelate on 88 real-world web data integration tasks taken from online help forums and Excel product team, and show that WebRelate can learn the desired programs within few seconds using only 1 example for the majority of the tasks.

Skip Supplemental Material Section

Supplemental Material

webrelate.webm

References

  1. Ziawasch Abedjan, John Morcos, Michael N Gubanov, Ihab F Ilyas, Michael Stonebraker, Paolo Papotti, and Mourad Ouzzani. 2015. Dataxformer: Leveraging the Web for Semantic Transformations.. In CIDR.Google ScholarGoogle Scholar
  2. Rajeev Alur, Rastislav Bodík, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. In FMCAD. 1–8. Google ScholarGoogle ScholarCross RefCross Ref
  3. Tobias Anton. 2005. XPath-Wrapper Induction by generalizing tree traversal patterns. In Lernen, Wissensentdeckung und Adaptivitt (LWA) 2005, GI Workshops, Saarbrcken. 126–133.Google ScholarGoogle Scholar
  4. Shaon Barman, Sarah Chasins, Rastislav Bodík, and Sumit Gulwani. 2016. Ringer: web automation by demonstration. In OOPSLA. 748–764. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Anders Berglund, Scott Boag, Don Chamberlin, Mary F Fernandez, Michael Kay, Jonathan Robie, and Jérôme Siméon. 2003. Xml path language (xpath). World Wide Web Consortium (W3C) (2003).Google ScholarGoogle Scholar
  6. Sarah Chasins, Shaon Barman, Rastislav Bodík, and Sumit Gulwani. 2015. Browser Record and Replay as a Building Block for End-User Web Automation Tools. In WWW. 179–182.Google ScholarGoogle Scholar
  7. Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. 2001. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, Vol. 1. 109–118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nilesh Dalvi, Philip Bohannon, and Fei Sha. 2009. Robust Web Extraction: An Approach Based on a Probabilistic Tree-edit Model. In SIGMOD. 335–348.Google ScholarGoogle Scholar
  9. Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. RobustFill: Neural Program Learning under Noisy I/O. In ICML. 990–998.Google ScholarGoogle Scholar
  10. Jonathan Frankle, Peter-Michael Osera, David Walker, and Steve Zdancewic. 2016. Example-directed synthesis: a typetheoretic interpretation. In POPL. 802–815.Google ScholarGoogle Scholar
  11. Mike Gualtieri. 2009. Deputize end-user developers to deliver business agility and reduce costs. Forrester Report for Application Development and Program Management Professionals (2009).Google ScholarGoogle Scholar
  12. Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In POPL. 317–330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet data manipulation using examples. Commun. ACM 55, 8 (2012), 97–105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. Program Synthesis. Foundations and Trends in Programming Languages 4, 1-2 (2017), 1–119. Google ScholarGoogle ScholarCross RefCross Ref
  15. Yeye He, Kris Ganjam, and Xu Chu. 2015. SEMA-JOIN: Joining Semantically-related Tables Using Big Table Corpora. Proc. VLDB Endow. 8, 12 (Aug. 2015), 1358–1369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jeffrey Heer, Joseph M Hellerstein, and Sean Kandel. 2015. Predictive Interaction for Data Transformation.. In CIDR.Google ScholarGoogle Scholar
  17. Chun-Nan Hsu and Ming-Tzung Dung. 1998. Generating finite-state transducers for semi-structured data extraction from the web. Information systems 23, 8 (1998), 521–538. Google ScholarGoogle ScholarCross RefCross Ref
  18. Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2011. Wrangler: interactive visual specification of data transformation scripts. In CHI. 3363–3372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Nicholas Kushmerick. 1997. Wrapper induction for information extraction. Ph.D. Dissertation. Univ. of Washington.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tessa Lau. 2009. Why Programming-By-Demonstration Systems Fail: Lessons Learned for Usable AI. AI Magazine 30, 4 (2009), 65–67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Vu Le and Sumit Gulwani. 2014. FlashExtract: A Framework for Data Extraction by Examples. In PLDI. 542–553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gilly Leshed, Eben M Haber, Tara Matthews, and Tessa Lau. 2008. CoScripter: automating & sharing how-to knowledge in the enterprise. In CHI. 1719–1728.Google ScholarGoogle Scholar
  23. Alan Leung, John Sarracino, and Sorin Lerner. 2015. Interactive parser synthesis by example. In PLDI. 565–574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. James Lin, Jeffrey Wong, Jeffrey Nichols, Allen Cypher, and Tessa A Lau. 2009. End-user programming of mashups with vegemite. In IUI. 97–106.Google ScholarGoogle Scholar
  25. John Morcos, Ziawasch Abedjan, Ihab Francis Ilyas, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. DataXFormer: An Interactive Data Transformation Tool. In SIGMOD.Google ScholarGoogle Scholar
  26. Ion Muslea, Steve Minton, and Craig Knoblock. 1998. Stalker: Learning extraction rules for semistructured, web-based information sources. In Workshop on AI and Information Integration. AAAI, 74–81.Google ScholarGoogle Scholar
  27. Peter-Michael Osera and Steve Zdancewic. 2015. Type-and-example-directed Program Synthesis. In PLDI. 619–630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. 2017. NeuroSymbolic Program Synthesis. In ICLR.Google ScholarGoogle Scholar
  29. Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: a framework for inductive program synthesis. In OOPSLA. 107–126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Eric Schkufza, Rahul Sharma, and Alex Aiken. 2013. Stochastic superoptimization. In ASPLOS. 305–316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Rishabh Singh. 2016. BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. PVLDB 9, 10 (2016), 816–827. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Rishabh Singh and Sumit Gulwani. 2012. Learning Semantic String Transformations from Examples. PVLDB 5, 8 (2012), 740–751. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Rishabh Singh and Pushmeet Kohli. 2017. AP: Artificial Programming. In SNAPL. 16:1–16:12.Google ScholarGoogle Scholar
  34. Rishabh Singh and Armando Solar-Lezama. 2011. Synthesizing data structure manipulations from storyboards. In FSE. 289–299. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. Combinatorial sketching for finite programs. In ASPLOS. 404–415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Abhishek Udupa, Arun Raghavan, Jyotirmoy V. Deshmukh, Sela Mador-Haim, Milo M. K. Martin, and Rajeev Alur. 2013. TRANSIT: specifying protocols with concolic snippets. In PLDI. 287–296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xinyu Wang, Isil Dillig, and Rishabh Singh. 2017. Synthesis of Data Completion Scripts using Finite Tree Automata. In OOPSLA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Xinyu Wang, Isil Dillig, and Rishabh Singh. 2018. Program Synthesis using Abstraction Refinement. In POPL.Google ScholarGoogle Scholar
  39. Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing transformations on hierarchically structured data. In PLDI. 508–521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yanhong Zhai and Bing Liu. 2005. Web Data Extraction Based on Partial Tree Alignment. In WWW. 76–85.Google ScholarGoogle Scholar

Index Terms

  1. WebRelate: integrating web data with spreadsheets using examples

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!