ABSTRACT
This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages ($7.6$ billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes.
- A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan, "Searching the Web," ACM Transactions on Internet Technology, vol. 1, no. 1, pp. 2--43, Aug. 2001. Google Scholar
Digital Library
- P. Boldi, B. Codenotti, M. Santini, and S. Vigna, "UbiCrawler: A Scalable Fully Distributed Web Crawler," Software: Practice & Experience, vol. 34, no. 8, pp. 711--726, Jul. 2004. Google Scholar
Digital Library
- P. Boldi, M. Santini, and S. Vigna, "Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations," LNCS: Algorithms and Models for the Web-Graph, vol. 3243, pp. 168--180, Oct. 2004.Google Scholar
Cross Ref
- S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," in Proc. WWW, Apr. 1998, pp. 107--117. Google Scholar
Digital Library
- M. Burner, "Crawling Towards Eternity: Building an Archive of the World Wide Web," Web Techniques Magazine, vol. 2, no. 5, May 1997.Google Scholar
- C. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, and S. R. G. Wesley, "Stanford WebBase Components and Applications," ACM Transactions on Internet Technology, vol. 6, no. 2, pp. 153--186, May 2006. Google Scholar
Digital Library
- J. Edwards, K. McCurley, and J. Tomlin, "An Adaptive Model for Optimizing Performance of an Incremental Web Crawler," in Proc. WWW, May 2001, pp. 106--113. Google Scholar
Digital Library
- D. Eichmann, "The RBSE Spider - Balancing Effective Search Against Web Load," in Proc. WWW, May 1994.Google Scholar
- G. Feng, T.-Y. Liu, Y. Wang, Y. Bao, Z. Ma, X.-D. Zhang, and W.-Y. Ma, "AggregateRank: Bringing Order to Web Sites," in Proc. ACM SIGIR, Aug. 2006, pp. 75--82. Google Scholar
Digital Library
- D. Gleich and L. Zhukov, "Scalable Computing for Power Law Graphs: Experience with Parallel PageRank," in Proc. SuperComputing, Nov. 2005.Google Scholar
- Z. Gyongyi and H. Garcia-Molina, "Link Spam Alliances," in Proc. VLDB, Aug. 2005, pp. 517--528. Google Scholar
Digital Library
- Y. Hafri and C. Djeraba, "High Performance Crawling System," in Proc. ACM MIR, Oct. 2004, pp. 299--306. Google Scholar
Digital Library
- A. Heydon and M. Najork, "Mercator: A Scalable, Extensible Web Crawler," World Wide Web, vol. 2, no. 4, pp. 219--229, Dec. 1999. Google Scholar
Digital Library
- J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke, "WebBase: A Repository of Web Pages," in Proc. WWW, May 2000, pp. 277--293. Google Scholar
Digital Library
- Internet Archive. {Online}. Available: http://www.archive.org/.Google Scholar
- IRLbot Project at Texas A&M. {Online}. Available: http://irl.cs.tamu.edu/crawler/.Google Scholar
- S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub, "Exploiting the Block Structure of the Web for Computing PageRank," Stanford University, Tech. Rep., Mar. 2003. {Online}. Available: http://www.stanford.edu/sdkamvar/papers/blockrank.pdf.Google Scholar
- S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub, "Extrapolation methods for accelerating PageRank computations," in Proc. WWW, May 2003, pp. 261--270. Google Scholar
Digital Library
- K. Koht-arsa and S. Sanguanpong, "High Performance Large Scale Web Spider Architecture," in Proc. Internataional Symposium on Communications and Information Technology, Oct. 2002.Google Scholar
- H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "IRLbot: Scaling to 6 Billion Pages and Beyond," Texas A&M University, Tech. Rep. 2008-2-2, Feb. 2008. {Online}. Available: http://irl.cs.tamu.edu/publications/.Google Scholar
- M. Mauldin, "Lycos: Design Choices in an Internet Search Service," IEEE Expert Magazine, vol. 12, no. 1, pp. 8--11, Jan./Feb. 1997.Google Scholar
Digital Library
- O. A. McBryan, "GENVL and WWWW: Tools for Taming the Web," in Proc. WWW, May 1994.Google Scholar
- M. Najork and A. Heydon, "High-Performance Web Crawling," Compaq Systems Research Center, Tech. Rep. 173, Sep. 2001. {Online}. Available: http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-173.pdf.Google Scholar
- M. Najork and J. L. Wiener, "Breadth-First Search Crawling Yields High-Quality Pages," in Proc. WWW, May 2001, pp. 114--118. Google Scholar
Digital Library
- B. Pinkerton, "Finding What People Want: Experiences with the Web Crawler," in Proc. WWW, Oct. 1994.Google Scholar
- B. Pinkerton, "WebCrawler: Finding What People Want," Ph.D. dissertation, University of Washington, 2000. Google Scholar
Digital Library
- V. Shkapenyuk and T. Suel, "Design and Implementation of a High-Performance Distributed Web Crawler," in Proc. IEEE ICDE, Mar. 2002, pp. 357--368. Google Scholar
Digital Library
- A. Singh, M. Srivatsa, L. Liu, and T. Miller, "Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web," in Proc. SIGIR Workshop on Distributed Information Retrieval, Aug. 2003, pp. 126--142.Google Scholar
- T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram, "ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval," in Proc. WebDB, Jun. 2003, pp. 67--72.Google Scholar
- J. Wu and H. El-Ocla, "TCP Congestion Avoidance Model with Congestive Loss," in Proc. IEEE ICON, Nov. 2004, pp. 3--8.Google Scholar
Index Terms
IRLbot: scaling to 6 billion pages and beyond
Recommendations
IRLbot: Scaling to 6 billion pages and beyond
This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer ...
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebThe steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Effective web-scale crawling through website analysis
WWW '06: Proceedings of the 15th international conference on World Wide WebThe web crawler space is often delimited into two general areas: full-web crawling and focused crawling. We present netSifter, a crawler system which integrates features from these two areas to provide an effective mechanism for web-scale crawling. ...





Comments