Abstract
Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors’ experience with UbiCrawler [9] and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols to achieve very high throughput.
- Internet Archive website. 1996. Homepage. Retrieved May 16, 2018 from http://archive.org/web/web.php.Google Scholar
- Heritrix Web Site. 2003. Homepage. Retrieved May 16, 2018 from https://webarchive.jira.com/wiki/display/Heritrix/.Google Scholar
- The ClueWeb09 Dataset. 2009. Homepage. Retrieved May 16, 2018 from http://lemurproject.org/clueweb09/.Google Scholar
- ISO 28500:2009, Information and documentation—WARC file format. Retrieved May 16, 2018 from https://www.iso.org/standard/44717.html.Google Scholar
- Dimitris Achlioptas, Aaron Clauset, David Kempe, and Cristopher Moore. 2009. On the bias of traceroute sampling: Or, power-law degree distributions in regular graphs. Journal ACM 56, 4 (2009), 21:1--21:28. Google Scholar
Digital Library
- Sarker Tanzir Ahmed, Clint Sparkman, Hsin-Tsang Lee, and Dmitri Loguinov. 2015. Around the web in six weeks: Documenting a large-scale crawl. In Proceedings of the 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE, 1598--1606.Google Scholar
Cross Ref
- Tim Berners-Lee, Roy Thomas Fielding, and Larry Masinter. 2005. Uniform Resource Identifier (URI): Generic Syntax. Retrieved May 16, 2018 from http://www.ietf.org/rfc/rfc3986.txt.Google Scholar
Cross Ref
- Burton H. Bloom. 1970. Space-time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7 (1970), 422--426. Google Scholar
Digital Library
- Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. UbiCrawler: A scalable fully distributed web crawler. Software: Practice 8 Experience 34, 8 (2004), 711--726. Google Scholar
Digital Library
- Paolo Boldi, Andrea Marino, Massimo Santini, and Sebastiano Vigna. 2014. BUbiNG: Massive crawling for the masses. In WWW’14 Companion. 227--228. Google Scholar
Digital Library
- Paolo Boldi and Sebastiano Vigna. 2013. In-core computation of geometric centralities with hyperball: A hundred billion nodes and beyond. In Proc. of 2013 IEEE 13th International Conference on Data Mining Workshops (ICDMW 2’13). IEEE.Google Scholar
Cross Ref
- Paolo Boldi and Sebastiano Vigna. 2014. Axioms for centrality. Internet Math. 10, 3--4 (2014), 222--262.Google Scholar
Cross Ref
- Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 1 (1998), 107--117. Google Scholar
Digital Library
- Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. In Selected Papers from the 6th International Conference on World Wide Web. Elsevier Science Publishers Ltd., Essex, UK, 1157--1166. Google Scholar
Digital Library
- M. Burner. 1997. Crawling towards eternity: Building an archive of the world wide web. Web Techniques 2, 5 (1997).Google Scholar
- Jamie Callan. 2012. The Lemur Project and its ClueWeb12 Dataset. Invited talk at the SIGIR 2012 Workshop on Open-Source Information Retrieval. (2012).Google Scholar
- Soumen Chakrabarti. 2003. Mining the Web—Discovering Knowledge from Hypertext Data. Morgan Kaufmann. I--XVIII, 1--345 pages. Google Scholar
Digital Library
- Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. 380--388. Google Scholar
Digital Library
- Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. Power-law distributions in empirical data. SIAM Rev. 51, 4 (2009), 661--703. Google Scholar
Digital Library
- Jenny Edwards, Kevin McCurley, and John Tomlin. 2001. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of the 10th International Conference on World Wide Web (WWW’01). ACM, New York, NY, 106--113. Google Scholar
Digital Library
- D. Eichmann. 1994. The RBSE spider: Balancing effective search against web load. In Proceedings of the 1st World Wide Web Conference.Google Scholar
Cross Ref
- Peter Elias. 1974. Efficient storage and retrieval by content and address of static files. J. Assoc. Comput. Mach. 21, 2 (1974), 246--260. Google Scholar
Digital Library
- Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. 2003. A large-scale study of the evolution of web pages. In Proceedings of the 12th Conference on World Wide Web. ACM Press. Google Scholar
Digital Library
- R. Fielding. 1994. Maintaining distributed hypertext infostructures: Welcome to MOMspider. In Proceedings of the 1st International Conference on the World Wide Web. Google Scholar
Digital Library
- Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web crawler. World Wide Web 2, 4 (April 1999), 219--229. Google Scholar
Digital Library
- Rohit Khare, Doug Cutting, Kragen Sitaker, and Adam Rifkin. 2004. Nutch: A flexible and scalable open-source web search engine. CommerceNet Labs Technical Report 04-04.Google Scholar
- Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov. 2009. IRLbot: Scaling to 6 billion pages and beyond. ACM Trans. Web 3, 3, Article 8 (July 2009), 34 pages. Google Scholar
Digital Library
- Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In WWW’07: Proceedings of the 16th International Conference on World Wide Web. ACM, New York, NY, 141--150. Google Scholar
Digital Library
- Oliver A. McBryan. 1994. GENVL and WWWW: Tools for taming the web. In Proceedings of the 1st World Wide Web Conference. 79--90.Google Scholar
Cross Ref
- Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. 2015. The graph structure in the web—Analyzed on different aggregation levels. The Journal of Web Science 1, 1 (2015), 33--47.Google Scholar
Cross Ref
- Maged M. Michael and Michael L. Scott. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing (PODC’96). ACM, 267--275. Google Scholar
Digital Library
- Seyed M. Mirtaheri, Mustafa Emre Dincturk, Salman Hooshmand, Gregor V. Bochmann, Guy-Vincent Jourdan, and Iosif-Viorel Onut. 2013. A brief history of web crawlers. In CASCON. Google Scholar
Digital Library
- Gordon Mohr, Michele Kimpton, Micheal Stack, and Igor Ranitovic. 2004. Introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW’04).Google Scholar
- Mark Najork and Allan Heydon. 2001. High-Performance Web Crawling. Technical Report 173. Compaq Systems Research Center.Google Scholar
- Marc Najork and Allan Heydon. 2002. High-performance web crawling. In Handbook of Massive Data Sets, James Abello, Panos M. Pardalos, and Mauricio G. C. Resende (Eds.). Kluwer Academic Publishers, 25--45. Google Scholar
Digital Library
- Christopher Olston and Marc Najork. 2010. Web crawling. Foundations and Trends in Information Retrieval 4, 3 (2010), 175--246. Google Scholar
Digital Library
- Brian Pinkerton. 1994. Finding what people want: Experiences with the webcrawler. In Proceedings of the 2nd International World Wide Web (Online 8 CDROM review: the international journal of), Anonymous (Ed.), Vol. 18(6). Learned Information, Medford, NJ.Google Scholar
- Vladislav Shkapenyuk and Torsten Suel. 2002. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering. 357--368. Google Scholar
Digital Library
- Sebastiano Vigna. 2013. Fibonacci binning. CoRR abs/1312.3749 (2013).Google Scholar
Index Terms
BUbiNG: Massive Crawling for the Masses
Recommendations
BUbiNG: massive crawling for the masses
WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide WebAlthough web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine tools and at the same time scales linearly with ...
Dist-RIA Crawler: A Distributed Crawler for Rich Internet Applications
3PGCIC '13: Proceedings of the 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet ComputingCrawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, as old as the web itself. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...






Comments