skip to main content
research-article

BUbiNG: Massive Crawling for the Masses

Authors Info & Claims
Published:01 June 2018Publication History
Skip Abstract Section

Abstract

Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors’ experience with UbiCrawler [9] and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols to achieve very high throughput.

References

  1. Internet Archive website. 1996. Homepage. Retrieved May 16, 2018 from http://archive.org/web/web.php.Google ScholarGoogle Scholar
  2. Heritrix Web Site. 2003. Homepage. Retrieved May 16, 2018 from https://webarchive.jira.com/wiki/display/Heritrix/.Google ScholarGoogle Scholar
  3. The ClueWeb09 Dataset. 2009. Homepage. Retrieved May 16, 2018 from http://lemurproject.org/clueweb09/.Google ScholarGoogle Scholar
  4. ISO 28500:2009, Information and documentation—WARC file format. Retrieved May 16, 2018 from https://www.iso.org/standard/44717.html.Google ScholarGoogle Scholar
  5. Dimitris Achlioptas, Aaron Clauset, David Kempe, and Cristopher Moore. 2009. On the bias of traceroute sampling: Or, power-law degree distributions in regular graphs. Journal ACM 56, 4 (2009), 21:1--21:28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sarker Tanzir Ahmed, Clint Sparkman, Hsin-Tsang Lee, and Dmitri Loguinov. 2015. Around the web in six weeks: Documenting a large-scale crawl. In Proceedings of the 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE, 1598--1606.Google ScholarGoogle ScholarCross RefCross Ref
  7. Tim Berners-Lee, Roy Thomas Fielding, and Larry Masinter. 2005. Uniform Resource Identifier (URI): Generic Syntax. Retrieved May 16, 2018 from http://www.ietf.org/rfc/rfc3986.txt.Google ScholarGoogle ScholarCross RefCross Ref
  8. Burton H. Bloom. 1970. Space-time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7 (1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. UbiCrawler: A scalable fully distributed web crawler. Software: Practice 8 Experience 34, 8 (2004), 711--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Paolo Boldi, Andrea Marino, Massimo Santini, and Sebastiano Vigna. 2014. BUbiNG: Massive crawling for the masses. In WWW’14 Companion. 227--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Paolo Boldi and Sebastiano Vigna. 2013. In-core computation of geometric centralities with hyperball: A hundred billion nodes and beyond. In Proc. of 2013 IEEE 13th International Conference on Data Mining Workshops (ICDMW 2’13). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  12. Paolo Boldi and Sebastiano Vigna. 2014. Axioms for centrality. Internet Math. 10, 3--4 (2014), 222--262.Google ScholarGoogle ScholarCross RefCross Ref
  13. Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 1 (1998), 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. In Selected Papers from the 6th International Conference on World Wide Web. Elsevier Science Publishers Ltd., Essex, UK, 1157--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Burner. 1997. Crawling towards eternity: Building an archive of the world wide web. Web Techniques 2, 5 (1997).Google ScholarGoogle Scholar
  16. Jamie Callan. 2012. The Lemur Project and its ClueWeb12 Dataset. Invited talk at the SIGIR 2012 Workshop on Open-Source Information Retrieval. (2012).Google ScholarGoogle Scholar
  17. Soumen Chakrabarti. 2003. Mining the Web—Discovering Knowledge from Hypertext Data. Morgan Kaufmann. I--XVIII, 1--345 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. 380--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. Power-law distributions in empirical data. SIAM Rev. 51, 4 (2009), 661--703. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jenny Edwards, Kevin McCurley, and John Tomlin. 2001. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of the 10th International Conference on World Wide Web (WWW’01). ACM, New York, NY, 106--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Eichmann. 1994. The RBSE spider: Balancing effective search against web load. In Proceedings of the 1st World Wide Web Conference.Google ScholarGoogle ScholarCross RefCross Ref
  22. Peter Elias. 1974. Efficient storage and retrieval by content and address of static files. J. Assoc. Comput. Mach. 21, 2 (1974), 246--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. 2003. A large-scale study of the evolution of web pages. In Proceedings of the 12th Conference on World Wide Web. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Fielding. 1994. Maintaining distributed hypertext infostructures: Welcome to MOMspider. In Proceedings of the 1st International Conference on the World Wide Web. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web crawler. World Wide Web 2, 4 (April 1999), 219--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Rohit Khare, Doug Cutting, Kragen Sitaker, and Adam Rifkin. 2004. Nutch: A flexible and scalable open-source web search engine. CommerceNet Labs Technical Report 04-04.Google ScholarGoogle Scholar
  27. Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov. 2009. IRLbot: Scaling to 6 billion pages and beyond. ACM Trans. Web 3, 3, Article 8 (July 2009), 34 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In WWW’07: Proceedings of the 16th International Conference on World Wide Web. ACM, New York, NY, 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Oliver A. McBryan. 1994. GENVL and WWWW: Tools for taming the web. In Proceedings of the 1st World Wide Web Conference. 79--90.Google ScholarGoogle ScholarCross RefCross Ref
  30. Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. 2015. The graph structure in the web—Analyzed on different aggregation levels. The Journal of Web Science 1, 1 (2015), 33--47.Google ScholarGoogle ScholarCross RefCross Ref
  31. Maged M. Michael and Michael L. Scott. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing (PODC’96). ACM, 267--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Seyed M. Mirtaheri, Mustafa Emre Dincturk, Salman Hooshmand, Gregor V. Bochmann, Guy-Vincent Jourdan, and Iosif-Viorel Onut. 2013. A brief history of web crawlers. In CASCON. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Gordon Mohr, Michele Kimpton, Micheal Stack, and Igor Ranitovic. 2004. Introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW’04).Google ScholarGoogle Scholar
  34. Mark Najork and Allan Heydon. 2001. High-Performance Web Crawling. Technical Report 173. Compaq Systems Research Center.Google ScholarGoogle Scholar
  35. Marc Najork and Allan Heydon. 2002. High-performance web crawling. In Handbook of Massive Data Sets, James Abello, Panos M. Pardalos, and Mauricio G. C. Resende (Eds.). Kluwer Academic Publishers, 25--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Christopher Olston and Marc Najork. 2010. Web crawling. Foundations and Trends in Information Retrieval 4, 3 (2010), 175--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Brian Pinkerton. 1994. Finding what people want: Experiences with the webcrawler. In Proceedings of the 2nd International World Wide Web (Online 8 CDROM review: the international journal of), Anonymous (Ed.), Vol. 18(6). Learned Information, Medford, NJ.Google ScholarGoogle Scholar
  38. Vladislav Shkapenyuk and Torsten Suel. 2002. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering. 357--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sebastiano Vigna. 2013. Fibonacci binning. CoRR abs/1312.3749 (2013).Google ScholarGoogle Scholar

Index Terms

  1. BUbiNG: Massive Crawling for the Masses

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on the Web
          ACM Transactions on the Web  Volume 12, Issue 2
          May 2018
          174 pages
          ISSN:1559-1131
          EISSN:1559-114X
          DOI:10.1145/3176641
          Issue’s Table of Contents

          Copyright © 2018 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 June 2018
          • Accepted: 1 November 2017
          • Revised: 1 October 2017
          • Received: 1 January 2016
          Published in tweb Volume 12, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!