Abstract
This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.
- Abiteboul, S., Preda, M., and Cobena, G. 2003. Adaptive on-line page importance computation. In Proceedings of the World Wide Web Conference (WWW'03). 280--290. Google Scholar
Digital Library
- Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. 2001. Searching the Web. ACM Trans. Internet Technol.1, 1, 2--43. Google Scholar
Digital Library
- Bharat, K. and Broder, A. 1999. Mirror, mirror on the Web: A study of hst pairs with replicated content. In Proceedings of the World Wide Web Conference (WWW'99). 1579--1590. Google Scholar
Digital Library
- Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004a. Ubicrawler: A scalable fully distributed Web crawler. Softw. Pract. Exper. 34, 8, 711--726. Google Scholar
Digital Library
- Boldi, P., Santini, M., and Vigna, S. 2004b. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. In Algorithms and Models for the Web-Graph. Lecture Notes in Computer Science, vol. 3243. Springer,168--180.Google Scholar
- Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the World Wide Web Conference (WWW'98). 107--117. Google Scholar
Digital Library
- Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8-13, 1157--1166. Google Scholar
Digital Library
- Broder, A. Z., Najork, M., and Wiener, J. L. 2003. Efficient url caching for World Wide Web crawling. In Proceedings of the World Wide Web Conference (WWW'03). 679--689. Google Scholar
Digital Library
- Burner, M. 1997. Crawling towards eternity: Building an archive of the World Wide Web. Web Techn. Mag. 2, 5.Google Scholar
- Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'02). 380--388. Google Scholar
Digital Library
- Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the World Wide Web Conference (WWW'02). 124--135. Google Scholar
Digital Library
- Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., and Wesley, S. R. G. 2006. Stanford Web base components and applications. ACM Trans. Internet Technol. 6, 2, 153--186. Google Scholar
Digital Library
- Edwards, J., McCurley, K., and Tomlin, J. 2001. An adaptive model for optimizing performance of an incremental Web crawler. In Proceedings of the World Wide Web Conference (WWW'01). 106--113. Google Scholar
Digital Library
- Eichmann, D. 1994. The rbse spider -- Balancing effective search against Web load. In World Wide Web Conference.Google Scholar
Cross Ref
- Feng, G., Liu, T.-Y., Wang, Y., Bao, Y., Ma, Z., Zhang, X.-D., and Ma, W.-Y. 2006. Aggregaterank: Bringing order to Web sites. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval. 75--82. Google Scholar
Digital Library
- Gleich, D. and Zhukov, L. 2005. Scalable computing for power law graphs: Experience with parallel pagerank. In Proceedings of SuperComputing.Google Scholar
- Gyöngyi, Z. and Garcia-Molina, H. 2005. Link spam alliances. In Proceedings of the International Conference on Very Large Databases (VLDB'05). 517--528. Google Scholar
Digital Library
- Hafri, Y. and Djeraba, C. 2004. High-performance crawling system. In Proceedings of the ACM International Conference on Multimedia Information Retrieval (MIR'04). 299--306. Google Scholar
Digital Library
- Henzinger, M. R. 2006. Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval. 284--291. Google Scholar
Digital Library
- Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web 2, 4, 219--229. Google Scholar
Digital Library
- Hirai, J., Raghavan, S., Garcia-Molina, H., and Paepcke, A. 2000. Web base: A repository of Web pages. In Proceedings of the World Wide Web Conference (WWW'00). 277--293. Google Scholar
Digital Library
- Internet Archive. Internet archive homepage. http://www.archive.org/.Google Scholar
- IRLbot. 2007. IRLbot project at Texas A&M. http://irl.cs.tamu.edu/crawler/.Google Scholar
- Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. 2003a. Exploiting the block structure of the Web for computing pagerank. Tech. rep., Stanford University.Google Scholar
- Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. 2003b. Extrapolation methods for accelerating pagerank computations. In Proceedings of the World Wide Web Conference (WWW'03). 261--270. Google Scholar
Digital Library
- Koht-arsa, K. and Sanguanpong, S. 2002. High-performance large scale Web spider architecture. In International Symposium on Communications and Information Technology.Google Scholar
- Manasse, D. F. M. and Najork, M. 2003. Evolution of clusters of near-duplicate Web pages. In Proceedings of the Latin American Web Congress (LAWEB'03). 37--45. Google Scholar
Digital Library
- Manku, G. S., Jain, A., and Sarma, A. D. 2007. Detecting near duplicates for Web crawling. In Proceedings of the World Wide Web Conference (WWW'07). 141--149. Google Scholar
Digital Library
- Mauldin, M. 1997. Lycos: Design choices in an Internet search service. IEEE Expert Mag. 12, 1, 8--11.Google Scholar
Digital Library
- McBryan, O. A. 1994. Genvl and wwww: Tools for taming the Web. In World Wide Web Conference (WWW'94).Google Scholar
Cross Ref
- Najork, M. and Heydon, A. 2001. High-performance Web crawling. Tech: rep. 173, Compaq Systems Research Center.Google Scholar
- Najork, M. and Wiener, J. L. 2001. Breadth-first search crawling yields high-quality pages. In Proceedings of the World Wide Web Conference (WWW'01). 114--118. Google Scholar
Digital Library
- Official Google Blog. 2008. We knew the Web was big… http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.Google Scholar
- Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In World Wide Web Conference (WWW'94).Google Scholar
- Pinkerton, B. 2000. Web crawler: Finding what people want. Ph.D. thesis, University of Washington. Google Scholar
Digital Library
- Shkapenyuk, V. and Suel, T. 2002. Design and implementation of a high-performance distributed Web crawler. In Proceedings of the IEEE International Conference on Data Engineering (ICDE'02). 357--368. Google Scholar
Digital Library
- Singh, A., Srivatsa, M., Liu, L., and Miller, T. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the World Wide Web. In Proceedings of the ACM SIGIR Workshop on Distributed Information Retrieval. 126--142.Google Scholar
- Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, M., Long, X., and Shanmugasundaram, K. 2003. Odissea: A peer-to-peer architecture for scalable Web search and information retrieval. In Proceedings of the International Workshop on Web and Databases (WebDB'03). 67--72.Google Scholar
- Vitter, J. 2001. External memory algorithms and data structures: Dealing with massive data. ACM Comput. Surv. 33, 2, 209--271. Google Scholar
Digital Library
- Wu, J. and Aberer, K. 2004. Using siterank for decentralized computation of Web document ranking. In Proceedings of the International Conference on Adaptive Hypermedia, 265--274.Google Scholar
Index Terms
IRLbot: Scaling to 6 billion pages and beyond
Recommendations
IRLbot: scaling to 6 billion pages and beyond
WWW '08: Proceedings of the 17th international conference on World Wide WebThis paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS ...
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebThe steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Effective web-scale crawling through website analysis
WWW '06: Proceedings of the 15th international conference on World Wide WebThe web crawler space is often delimited into two general areas: full-web crawling and focused crawling. We present netSifter, a crawler system which integrates features from these two areas to provide an effective mechanism for web-scale crawling. ...






Comments