Abstract
Data integration between web sources and relational data is a key challenge faced by data scientists and spreadsheet users. There are two main challenges in programmatically joining web data with relational data. First, most websites do not expose a direct interface to obtain tabular data, so the user needs to formulate a logic to get to different webpages for each input row in the relational table. Second, after reaching the desired webpage, the user needs to write complex scripts to extract the relevant data, which is often conditioned on the input data. Since many data scientists and end-users come from diverse backgrounds, writing such complex regular-expression based logical scripts to perform data integration tasks is unfortunately often beyond their programming expertise.
We present WebRelate, a system that allows users to join semi-structured web data with relational data in spreadsheets using input-output examples. WebRelate decomposes the web data integration task into two sub-tasks of i) URL learning and ii) input-dependent web extraction. We introduce a novel synthesis paradigm called "Output-constrained Programming By Examples", which allows us to use the finite set of possible outputs for the new inputs to efficiently constrain the search in the synthesis algorithm. We instantiate this paradigm for the two sub-tasks in WebRelate. The first sub-task generates the URLs for the webpages containing the desired data for all rows in the relational table. WebRelate achieves this by learning a string transformation program using a few example URLs. The second sub-task uses examples of desired data to be extracted from the corresponding webpages and learns a program to extract the data for the other rows. We design expressive domain-specific languages for URL generation and web data extraction, and present efficient synthesis algorithms for learning programs in these DSLs from few input-output examples. We evaluate WebRelate on 88 real-world web data integration tasks taken from online help forums and Excel product team, and show that WebRelate can learn the desired programs within few seconds using only 1 example for the majority of the tasks.
Supplemental Material
- Ziawasch Abedjan, John Morcos, Michael N Gubanov, Ihab F Ilyas, Michael Stonebraker, Paolo Papotti, and Mourad Ouzzani. 2015. Dataxformer: Leveraging the Web for Semantic Transformations.. In CIDR.Google Scholar
- Rajeev Alur, Rastislav Bodík, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. In FMCAD. 1–8. Google Scholar
Cross Ref
- Tobias Anton. 2005. XPath-Wrapper Induction by generalizing tree traversal patterns. In Lernen, Wissensentdeckung und Adaptivitt (LWA) 2005, GI Workshops, Saarbrcken. 126–133.Google Scholar
- Shaon Barman, Sarah Chasins, Rastislav Bodík, and Sumit Gulwani. 2016. Ringer: web automation by demonstration. In OOPSLA. 748–764. Google Scholar
Digital Library
- Anders Berglund, Scott Boag, Don Chamberlin, Mary F Fernandez, Michael Kay, Jonathan Robie, and Jérôme Siméon. 2003. Xml path language (xpath). World Wide Web Consortium (W3C) (2003).Google Scholar
- Sarah Chasins, Shaon Barman, Rastislav Bodík, and Sumit Gulwani. 2015. Browser Record and Replay as a Building Block for End-User Web Automation Tools. In WWW. 179–182.Google Scholar
- Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. 2001. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, Vol. 1. 109–118.Google Scholar
Digital Library
- Nilesh Dalvi, Philip Bohannon, and Fei Sha. 2009. Robust Web Extraction: An Approach Based on a Probabilistic Tree-edit Model. In SIGMOD. 335–348.Google Scholar
- Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. RobustFill: Neural Program Learning under Noisy I/O. In ICML. 990–998.Google Scholar
- Jonathan Frankle, Peter-Michael Osera, David Walker, and Steve Zdancewic. 2016. Example-directed synthesis: a typetheoretic interpretation. In POPL. 802–815.Google Scholar
- Mike Gualtieri. 2009. Deputize end-user developers to deliver business agility and reduce costs. Forrester Report for Application Development and Program Management Professionals (2009).Google Scholar
- Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In POPL. 317–330. Google Scholar
Digital Library
- Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet data manipulation using examples. Commun. ACM 55, 8 (2012), 97–105. Google Scholar
Digital Library
- Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. Program Synthesis. Foundations and Trends in Programming Languages 4, 1-2 (2017), 1–119. Google Scholar
Cross Ref
- Yeye He, Kris Ganjam, and Xu Chu. 2015. SEMA-JOIN: Joining Semantically-related Tables Using Big Table Corpora. Proc. VLDB Endow. 8, 12 (Aug. 2015), 1358–1369. Google Scholar
Digital Library
- Jeffrey Heer, Joseph M Hellerstein, and Sean Kandel. 2015. Predictive Interaction for Data Transformation.. In CIDR.Google Scholar
- Chun-Nan Hsu and Ming-Tzung Dung. 1998. Generating finite-state transducers for semi-structured data extraction from the web. Information systems 23, 8 (1998), 521–538. Google Scholar
Cross Ref
- Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2011. Wrangler: interactive visual specification of data transformation scripts. In CHI. 3363–3372. Google Scholar
Digital Library
- Nicholas Kushmerick. 1997. Wrapper induction for information extraction. Ph.D. Dissertation. Univ. of Washington.Google Scholar
Digital Library
- Tessa Lau. 2009. Why Programming-By-Demonstration Systems Fail: Lessons Learned for Usable AI. AI Magazine 30, 4 (2009), 65–67. Google Scholar
Digital Library
- Vu Le and Sumit Gulwani. 2014. FlashExtract: A Framework for Data Extraction by Examples. In PLDI. 542–553. Google Scholar
Digital Library
- Gilly Leshed, Eben M Haber, Tara Matthews, and Tessa Lau. 2008. CoScripter: automating & sharing how-to knowledge in the enterprise. In CHI. 1719–1728.Google Scholar
- Alan Leung, John Sarracino, and Sorin Lerner. 2015. Interactive parser synthesis by example. In PLDI. 565–574. Google Scholar
Digital Library
- James Lin, Jeffrey Wong, Jeffrey Nichols, Allen Cypher, and Tessa A Lau. 2009. End-user programming of mashups with vegemite. In IUI. 97–106.Google Scholar
- John Morcos, Ziawasch Abedjan, Ihab Francis Ilyas, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. DataXFormer: An Interactive Data Transformation Tool. In SIGMOD.Google Scholar
- Ion Muslea, Steve Minton, and Craig Knoblock. 1998. Stalker: Learning extraction rules for semistructured, web-based information sources. In Workshop on AI and Information Integration. AAAI, 74–81.Google Scholar
- Peter-Michael Osera and Steve Zdancewic. 2015. Type-and-example-directed Program Synthesis. In PLDI. 619–630. Google Scholar
Digital Library
- Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. 2017. NeuroSymbolic Program Synthesis. In ICLR.Google Scholar
- Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: a framework for inductive program synthesis. In OOPSLA. 107–126. Google Scholar
Digital Library
- Eric Schkufza, Rahul Sharma, and Alex Aiken. 2013. Stochastic superoptimization. In ASPLOS. 305–316. Google Scholar
Digital Library
- Rishabh Singh. 2016. BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. PVLDB 9, 10 (2016), 816–827. Google Scholar
Digital Library
- Rishabh Singh and Sumit Gulwani. 2012. Learning Semantic String Transformations from Examples. PVLDB 5, 8 (2012), 740–751. Google Scholar
Digital Library
- Rishabh Singh and Pushmeet Kohli. 2017. AP: Artificial Programming. In SNAPL. 16:1–16:12.Google Scholar
- Rishabh Singh and Armando Solar-Lezama. 2011. Synthesizing data structure manipulations from storyboards. In FSE. 289–299. Google Scholar
Digital Library
- Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. Combinatorial sketching for finite programs. In ASPLOS. 404–415. Google Scholar
Digital Library
- Abhishek Udupa, Arun Raghavan, Jyotirmoy V. Deshmukh, Sela Mador-Haim, Milo M. K. Martin, and Rajeev Alur. 2013. TRANSIT: specifying protocols with concolic snippets. In PLDI. 287–296. Google Scholar
Digital Library
- Xinyu Wang, Isil Dillig, and Rishabh Singh. 2017. Synthesis of Data Completion Scripts using Finite Tree Automata. In OOPSLA. Google Scholar
Digital Library
- Xinyu Wang, Isil Dillig, and Rishabh Singh. 2018. Program Synthesis using Abstraction Refinement. In POPL.Google Scholar
- Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing transformations on hierarchically structured data. In PLDI. 508–521. Google Scholar
Digital Library
- Yanhong Zhai and Bing Liu. 2005. Web Data Extraction Based on Partial Tree Alignment. In WWW. 76–85.Google Scholar
Index Terms
WebRelate: integrating web data with spreadsheets using examples
Recommendations
FlashRelate: extracting relational data from semi-structured spreadsheets using examples
PLDI '15With hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a ...
FlashRelate: extracting relational data from semi-structured spreadsheets using examples
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and ImplementationWith hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...






Comments