Abstract
With more and more web scripting languages on offer, programmers have access to increasing language support for web scraping tasks. However, in our experiences collaborating with data scientists, we learned that two issues still plague long-running scraping scripts: i) When a network or website goes down mid-scrape, recovery sometimes requires restarting from the beginning, which users find frustratingly slow. ii) Websites do not offer atomic snapshots of their databases; they update their content so frequently that output data is cluttered with slight variations of the same information — e.g., a tweet from profile 1 that is retweeted on profile 2 and scraped from both profiles, once with 52 responses then later with 53 responses.
We introduce the skip block, a language construct that addresses both of these disparate problems. Programmers write lightweight annotations to indicate when the current object can be considered equivalent to a previously scraped object and direct the program to skip over the scraping actions in the block. The construct is hierarchical, so programs can skip over long or short script segments, allowing adaptive reuse of prior work. After network and server failures, skip blocks accelerate failure recovery by 7.9x on average. Even scripts that do not encounter failures benefit; because sites display redundant objects, skipping over them accelerates scraping by up to 2.1x. For longitudinal scraping tasks that aim to fetch only new objects, the second run exhibits an average speedup of 5.2x. Our small user study reveals that programmers can quickly produce skip block annotations.
- Adelberg, Brad. 1998. NoDoSE - a tool for semi-automatically extracting structured and semistructured data from text documents. In: Sigmod record.Google Scholar
- Barman, Shaon, Chasins, Sarah, Bodik, Rastislav, & Gulwani, Sumit. 2016. Ringer: Web automation by demonstration. Pages 748–764 of: Proceedings of the 2016 acm sigplan international conference on object-oriented programming, systems, languages, and applications. OOPSLA 2016. New York, NY, USA: ACM. Google Scholar
Digital Library
- Chang, Chia-Hui, Kayed, Mohammed, Girgis, Moheb Ramzy, & Shaalan, Khaled F. 2006. A survey of web information extraction systems. Ieee trans. on knowl. and data eng., 18(10), 1411–1428.Google Scholar
- Chasins, Sarah. 2017 ( July). schasins/helena: A chrome extension for web automation and web scraping. https://github.com/ schasins/helena .Google Scholar
- Flesca, Sergio, Manco, Giuseppe, Masciari, Elio, Rende, Eugenio, & Tagarelli, Andrea. 2004. Web wrapper induction: A brief survey. Ai commun., 17(2), 57–61.Google Scholar
- Furche, Tim, Guo, Jinsong, Maneth, Sebastian, & Schallhart, Christian. 2016. Robust and noise resistant wrapper induction. Pages 773–784 of: Proceedings of the 2016 international conference on management of data. SIGMOD ’16. New York, NY, USA: ACM. Google Scholar
Digital Library
- Greasemonkey. 2015 (Nov.). Greasemonkey :: Add-ons for firefox. https://addons.mozilla.org/enus/firefox/addon/greasemonkey/.Google Scholar
- Hupp, Darris, & Miller, Robert C. 2007. Smart bookmarks: automatic retroactive macro recording on the web. Pages 81–90 of: Proceedings of the 20th annual acm symposium on user interface software and technology. UIST ’07. New York, NY, USA: ACM. Google Scholar
Digital Library
- Import.io. 2016 (Mar.). Import.io | web data platform & free web scraping tool.Google Scholar
- KimonoLabs. 2016 (Mar.). Kimono: Turn websites into structured APIs from your browser in seconds.Google Scholar
- Koesnandar, Andhy, Elbaum, Sebastian, Rothermel, Gregg, Hochstein, Lorin, Scaffidi, Christopher, & Stolee, Kathryn T. 2008. Using assertions to help end-user programmers create dependable web macros. Pages 124–134 of: Proceedings of the 16th acm sigsoft international symposium on foundations of software engineering. SIGSOFT ’08/FSE-16. New York, NY, USA: ACM.Google Scholar
- Kushmerick, Nicholas. 2000. Wrapper induction: Efficiency and expressiveness. Artificial intelligence, 118(1), 15 – 68. Google Scholar
Digital Library
- Kushmerick, Nicholas, Weld, Daniel S., & Doorenbos, Robert. 1997. Wrapper induction for information extraction. In: Proc. ijcai-97.Google Scholar
- Le, Vu, & Gulwani, Sumit. 2014. FlashExtract: A framework for data extraction by examples. Pages 542–553 of: Proceedings of the 35th acm sigplan conference on programming language design and implementation. PLDI ’14. New York, NY, USA: ACM. Google Scholar
Digital Library
- Leshed, Gilly, Haber, Eben M., Matthews, Tara, & Lau, Tessa. 2008. CoScripter: automating & sharing how-to knowledge in the enterprise. Pages 1719–1728 of: Proceedings of the sigchi conference on human factors in computing systems. CHI ’08. New York, NY, USA: ACM. Google Scholar
Digital Library
- Li, Ian, Nichols, Jeffrey, Lau, Tessa, Drews, Clemens, & Cypher, Allen. 2010. Here’s what i did: Sharing and reusing web activity with actionshot. Pages 723–732 of: Proceedings of the sigchi conference on human factors in computing systems. CHI ’10. New York, NY, USA: ACM. Google Scholar
Digital Library
- Lin, James, Wong, Jeffrey, Nichols, Jeffrey, Cypher, Allen, & Lau, Tessa A. 2009. End-user programming of mashups with Vegemite. Pages 97–106 of: Proceedings of the 14th international conference on intelligent user interfaces. IUI ’09. New York, NY, USA: ACM.Google Scholar
- Mahmud, Jalal, & Lau, Tessa. 2010. Lowering the barriers to website testing with cotester. Pages 169–178 of: Proceedings of the 15th international conference on intelligent user interfaces. IUI ’10. New York, NY, USA: ACM. Google Scholar
Digital Library
- Mayer, Mikaël, Soares, Gustavo, Grechkin, Maxim, Le, Vu, Marron, Mark, Polozov, Oleksandr, Singh, Rishabh, Zorn, Benjamin, & Gulwani, Sumit. 2015. User interaction models for disambiguation in programming by example. Pages 291–301 of: Proceedings of the 28th annual acm symposium on user interface software & technology. UIST ’15. New York, NY, USA: ACM.Google Scholar
- Muslea, Ion, Minton, Steve, & Knoblock, Craig. 1999. A hierarchical approach to wrapper induction. Pages 190–197 of: Proceedings of the third annual conference on autonomous agents. AGENTS ’99. New York, NY, USA: ACM. Google Scholar
Digital Library
- Ni, Yang, Menon, Vijay S., Adl-Tabatabai, Ali-Reza, Hosking, Antony L., Hudson, Richard L., Moss, J. Eliot B., Saha, Bratin, & Shpeisman, Tatiana. 2007. Open nesting in software transactional memory. Pages 68–78 of: Proceedings of the 12th acm sigplan symposium on principles and practice of parallel programming. PPoPP ’07. New York, NY, USA: ACM. Google Scholar
Digital Library
- Nokogiri. 2016 (Nov.). Tutorials - nokogiri. http://www.nokogiri.org/ .Google Scholar
- Omari, Adi, Shoham, Sharon, & Yahav, Eran. 2017. Synthesis of forgiving data extractors. Pages 385–394 of: Proceedings of the tenth acm international conference on web search and data mining. WSDM ’17. New York, NY, USA: ACM. Google Scholar
Digital Library
- Platypus. 2013 (Nov.). Platypus. http://platypus.mozdev.org/ .Google Scholar
- Richardson, Leonard. 2016 (Mar.). Beautiful Soup: We called him Tortoise because he taught us. http://www.crummy.com/ software/BeautifulSoup/ .Google Scholar
- Scrapy. 2013 ( July). Scrapy. http://scrapy.org/ .Google Scholar
- Selenium. 2013 ( July). Selenium-web browser automation. http://seleniumhq.org/ .Google Scholar
- Selenium. 2016 (Mar.). Selenium IDE plugins. http://www.seleniumhq.org/projects/ide/ .Google Scholar
- StackOverflow. 2017. Posts containing “incremental scraping” - stack overflow.Google Scholar
- VisualWebRipper. 2017 (Apr.). Visual web ripper | data extraction software. http://visualwebripper.com/ .Google Scholar
- Zheng, Shuyi, Song, Ruihua, Wen, Ji-Rong, & Giles, C. Lee. 2009. Efficient record-level wrapper induction. Pages 47–56 of: Proceedings of the 18th acm conference on information and knowledge management. CIKM ’09. New York, NY, USA: ACM. Google Scholar
Digital Library
Index Terms
Skip blocks: reusing execution history to accelerate web scripts
Recommendations
Rousillon: Scraping Distributed Hierarchical Web Data
UIST '18: Proceedings of the 31st Annual ACM Symposium on User Interface Software and TechnologyProgramming by Demonstration (PBD) promises to enable data scientists to collect web data. However, in formative interviews with social scientists, we learned that current PBD tools are insufficient for many real-world web scraping tasks. The missing ...
Internet scrapbook: creating personalized world wide web pages
CHI EA '97: CHI '97 Extended Abstracts on Human Factors in Computing SystemsThis paper describes an information personalization system, called Internet Scrapbook, which enables users to create a personal page by clipping and merging their necessary data gathered from multiple Web pages. Even when the source Web pages are ...
Web Operation Recorder and Player
ICPADS '00: Proceedings of the Seventh International Conference on Parallel and Distributed SystemsThis paper describes mechanisms for recording and playing back Web browser operations. A recorder detects a user's operations on a Web browser and saves them as an event sequence called a scenario. A player plays back the scenario by controlling an ...






Comments