skip to main content
research-article
Open Access

Skip blocks: reusing execution history to accelerate web scripts

Published:12 October 2017Publication History
Skip Abstract Section

Abstract

With more and more web scripting languages on offer, programmers have access to increasing language support for web scraping tasks. However, in our experiences collaborating with data scientists, we learned that two issues still plague long-running scraping scripts: i) When a network or website goes down mid-scrape, recovery sometimes requires restarting from the beginning, which users find frustratingly slow. ii) Websites do not offer atomic snapshots of their databases; they update their content so frequently that output data is cluttered with slight variations of the same information — e.g., a tweet from profile 1 that is retweeted on profile 2 and scraped from both profiles, once with 52 responses then later with 53 responses.

We introduce the skip block, a language construct that addresses both of these disparate problems. Programmers write lightweight annotations to indicate when the current object can be considered equivalent to a previously scraped object and direct the program to skip over the scraping actions in the block. The construct is hierarchical, so programs can skip over long or short script segments, allowing adaptive reuse of prior work. After network and server failures, skip blocks accelerate failure recovery by 7.9x on average. Even scripts that do not encounter failures benefit; because sites display redundant objects, skipping over them accelerates scraping by up to 2.1x. For longitudinal scraping tasks that aim to fetch only new objects, the second run exhibits an average speedup of 5.2x. Our small user study reveals that programmers can quickly produce skip block annotations.

References

  1. Adelberg, Brad. 1998. NoDoSE - a tool for semi-automatically extracting structured and semistructured data from text documents. In: Sigmod record.Google ScholarGoogle Scholar
  2. Barman, Shaon, Chasins, Sarah, Bodik, Rastislav, & Gulwani, Sumit. 2016. Ringer: Web automation by demonstration. Pages 748–764 of: Proceedings of the 2016 acm sigplan international conference on object-oriented programming, systems, languages, and applications. OOPSLA 2016. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chang, Chia-Hui, Kayed, Mohammed, Girgis, Moheb Ramzy, & Shaalan, Khaled F. 2006. A survey of web information extraction systems. Ieee trans. on knowl. and data eng., 18(10), 1411–1428.Google ScholarGoogle Scholar
  4. Chasins, Sarah. 2017 ( July). schasins/helena: A chrome extension for web automation and web scraping. https://github.com/ schasins/helena .Google ScholarGoogle Scholar
  5. Flesca, Sergio, Manco, Giuseppe, Masciari, Elio, Rende, Eugenio, & Tagarelli, Andrea. 2004. Web wrapper induction: A brief survey. Ai commun., 17(2), 57–61.Google ScholarGoogle Scholar
  6. Furche, Tim, Guo, Jinsong, Maneth, Sebastian, & Schallhart, Christian. 2016. Robust and noise resistant wrapper induction. Pages 773–784 of: Proceedings of the 2016 international conference on management of data. SIGMOD ’16. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Greasemonkey. 2015 (Nov.). Greasemonkey :: Add-ons for firefox. https://addons.mozilla.org/enus/firefox/addon/greasemonkey/.Google ScholarGoogle Scholar
  8. Hupp, Darris, & Miller, Robert C. 2007. Smart bookmarks: automatic retroactive macro recording on the web. Pages 81–90 of: Proceedings of the 20th annual acm symposium on user interface software and technology. UIST ’07. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Import.io. 2016 (Mar.). Import.io | web data platform & free web scraping tool.Google ScholarGoogle Scholar
  10. KimonoLabs. 2016 (Mar.). Kimono: Turn websites into structured APIs from your browser in seconds.Google ScholarGoogle Scholar
  11. Koesnandar, Andhy, Elbaum, Sebastian, Rothermel, Gregg, Hochstein, Lorin, Scaffidi, Christopher, & Stolee, Kathryn T. 2008. Using assertions to help end-user programmers create dependable web macros. Pages 124–134 of: Proceedings of the 16th acm sigsoft international symposium on foundations of software engineering. SIGSOFT ’08/FSE-16. New York, NY, USA: ACM.Google ScholarGoogle Scholar
  12. Kushmerick, Nicholas. 2000. Wrapper induction: Efficiency and expressiveness. Artificial intelligence, 118(1), 15 – 68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kushmerick, Nicholas, Weld, Daniel S., & Doorenbos, Robert. 1997. Wrapper induction for information extraction. In: Proc. ijcai-97.Google ScholarGoogle Scholar
  14. Le, Vu, & Gulwani, Sumit. 2014. FlashExtract: A framework for data extraction by examples. Pages 542–553 of: Proceedings of the 35th acm sigplan conference on programming language design and implementation. PLDI ’14. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Leshed, Gilly, Haber, Eben M., Matthews, Tara, & Lau, Tessa. 2008. CoScripter: automating & sharing how-to knowledge in the enterprise. Pages 1719–1728 of: Proceedings of the sigchi conference on human factors in computing systems. CHI ’08. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Li, Ian, Nichols, Jeffrey, Lau, Tessa, Drews, Clemens, & Cypher, Allen. 2010. Here’s what i did: Sharing and reusing web activity with actionshot. Pages 723–732 of: Proceedings of the sigchi conference on human factors in computing systems. CHI ’10. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lin, James, Wong, Jeffrey, Nichols, Jeffrey, Cypher, Allen, & Lau, Tessa A. 2009. End-user programming of mashups with Vegemite. Pages 97–106 of: Proceedings of the 14th international conference on intelligent user interfaces. IUI ’09. New York, NY, USA: ACM.Google ScholarGoogle Scholar
  18. Mahmud, Jalal, & Lau, Tessa. 2010. Lowering the barriers to website testing with cotester. Pages 169–178 of: Proceedings of the 15th international conference on intelligent user interfaces. IUI ’10. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mayer, Mikaël, Soares, Gustavo, Grechkin, Maxim, Le, Vu, Marron, Mark, Polozov, Oleksandr, Singh, Rishabh, Zorn, Benjamin, & Gulwani, Sumit. 2015. User interaction models for disambiguation in programming by example. Pages 291–301 of: Proceedings of the 28th annual acm symposium on user interface software & technology. UIST ’15. New York, NY, USA: ACM.Google ScholarGoogle Scholar
  20. Muslea, Ion, Minton, Steve, & Knoblock, Craig. 1999. A hierarchical approach to wrapper induction. Pages 190–197 of: Proceedings of the third annual conference on autonomous agents. AGENTS ’99. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ni, Yang, Menon, Vijay S., Adl-Tabatabai, Ali-Reza, Hosking, Antony L., Hudson, Richard L., Moss, J. Eliot B., Saha, Bratin, & Shpeisman, Tatiana. 2007. Open nesting in software transactional memory. Pages 68–78 of: Proceedings of the 12th acm sigplan symposium on principles and practice of parallel programming. PPoPP ’07. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Nokogiri. 2016 (Nov.). Tutorials - nokogiri. http://www.nokogiri.org/ .Google ScholarGoogle Scholar
  23. Omari, Adi, Shoham, Sharon, & Yahav, Eran. 2017. Synthesis of forgiving data extractors. Pages 385–394 of: Proceedings of the tenth acm international conference on web search and data mining. WSDM ’17. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Platypus. 2013 (Nov.). Platypus. http://platypus.mozdev.org/ .Google ScholarGoogle Scholar
  25. Richardson, Leonard. 2016 (Mar.). Beautiful Soup: We called him Tortoise because he taught us. http://www.crummy.com/ software/BeautifulSoup/ .Google ScholarGoogle Scholar
  26. Scrapy. 2013 ( July). Scrapy. http://scrapy.org/ .Google ScholarGoogle Scholar
  27. Selenium. 2013 ( July). Selenium-web browser automation. http://seleniumhq.org/ .Google ScholarGoogle Scholar
  28. Selenium. 2016 (Mar.). Selenium IDE plugins. http://www.seleniumhq.org/projects/ide/ .Google ScholarGoogle Scholar
  29. StackOverflow. 2017. Posts containing “incremental scraping” - stack overflow.Google ScholarGoogle Scholar
  30. VisualWebRipper. 2017 (Apr.). Visual web ripper | data extraction software. http://visualwebripper.com/ .Google ScholarGoogle Scholar
  31. Zheng, Shuyi, Song, Ruihua, Wen, Ji-Rong, & Giles, C. Lee. 2009. Efficient record-level wrapper induction. Pages 47–56 of: Proceedings of the 18th acm conference on information and knowledge management. CIKM ’09. New York, NY, USA: ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Skip blocks: reusing execution history to accelerate web scripts

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!