skip to main content
research-article

IRLbot: Scaling to 6 billion pages and beyond

Published:03 July 2009Publication History
Skip Abstract Section

Abstract

This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.

References

  1. Abiteboul, S., Preda, M., and Cobena, G. 2003. Adaptive on-line page importance computation. In Proceedings of the World Wide Web Conference (WWW'03). 280--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. 2001. Searching the Web. ACM Trans. Internet Technol.1, 1, 2--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bharat, K. and Broder, A. 1999. Mirror, mirror on the Web: A study of hst pairs with replicated content. In Proceedings of the World Wide Web Conference (WWW'99). 1579--1590. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004a. Ubicrawler: A scalable fully distributed Web crawler. Softw. Pract. Exper. 34, 8, 711--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Boldi, P., Santini, M., and Vigna, S. 2004b. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. In Algorithms and Models for the Web-Graph. Lecture Notes in Computer Science, vol. 3243. Springer,168--180.Google ScholarGoogle Scholar
  6. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the World Wide Web Conference (WWW'98). 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8-13, 1157--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Broder, A. Z., Najork, M., and Wiener, J. L. 2003. Efficient url caching for World Wide Web crawling. In Proceedings of the World Wide Web Conference (WWW'03). 679--689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Burner, M. 1997. Crawling towards eternity: Building an archive of the World Wide Web. Web Techn. Mag. 2, 5.Google ScholarGoogle Scholar
  10. Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'02). 380--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the World Wide Web Conference (WWW'02). 124--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., and Wesley, S. R. G. 2006. Stanford Web base components and applications. ACM Trans. Internet Technol. 6, 2, 153--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Edwards, J., McCurley, K., and Tomlin, J. 2001. An adaptive model for optimizing performance of an incremental Web crawler. In Proceedings of the World Wide Web Conference (WWW'01). 106--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Eichmann, D. 1994. The rbse spider -- Balancing effective search against Web load. In World Wide Web Conference.Google ScholarGoogle ScholarCross RefCross Ref
  15. Feng, G., Liu, T.-Y., Wang, Y., Bao, Y., Ma, Z., Zhang, X.-D., and Ma, W.-Y. 2006. Aggregaterank: Bringing order to Web sites. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval. 75--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gleich, D. and Zhukov, L. 2005. Scalable computing for power law graphs: Experience with parallel pagerank. In Proceedings of SuperComputing.Google ScholarGoogle Scholar
  17. Gyöngyi, Z. and Garcia-Molina, H. 2005. Link spam alliances. In Proceedings of the International Conference on Very Large Databases (VLDB'05). 517--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hafri, Y. and Djeraba, C. 2004. High-performance crawling system. In Proceedings of the ACM International Conference on Multimedia Information Retrieval (MIR'04). 299--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Henzinger, M. R. 2006. Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval. 284--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web 2, 4, 219--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hirai, J., Raghavan, S., Garcia-Molina, H., and Paepcke, A. 2000. Web base: A repository of Web pages. In Proceedings of the World Wide Web Conference (WWW'00). 277--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Internet Archive. Internet archive homepage. http://www.archive.org/.Google ScholarGoogle Scholar
  23. IRLbot. 2007. IRLbot project at Texas A&M. http://irl.cs.tamu.edu/crawler/.Google ScholarGoogle Scholar
  24. Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. 2003a. Exploiting the block structure of the Web for computing pagerank. Tech. rep., Stanford University.Google ScholarGoogle Scholar
  25. Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. 2003b. Extrapolation methods for accelerating pagerank computations. In Proceedings of the World Wide Web Conference (WWW'03). 261--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Koht-arsa, K. and Sanguanpong, S. 2002. High-performance large scale Web spider architecture. In International Symposium on Communications and Information Technology.Google ScholarGoogle Scholar
  27. Manasse, D. F. M. and Najork, M. 2003. Evolution of clusters of near-duplicate Web pages. In Proceedings of the Latin American Web Congress (LAWEB'03). 37--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Manku, G. S., Jain, A., and Sarma, A. D. 2007. Detecting near duplicates for Web crawling. In Proceedings of the World Wide Web Conference (WWW'07). 141--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mauldin, M. 1997. Lycos: Design choices in an Internet search service. IEEE Expert Mag. 12, 1, 8--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. McBryan, O. A. 1994. Genvl and wwww: Tools for taming the Web. In World Wide Web Conference (WWW'94).Google ScholarGoogle ScholarCross RefCross Ref
  31. Najork, M. and Heydon, A. 2001. High-performance Web crawling. Tech: rep. 173, Compaq Systems Research Center.Google ScholarGoogle Scholar
  32. Najork, M. and Wiener, J. L. 2001. Breadth-first search crawling yields high-quality pages. In Proceedings of the World Wide Web Conference (WWW'01). 114--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Official Google Blog. 2008. We knew the Web was big… http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.Google ScholarGoogle Scholar
  34. Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In World Wide Web Conference (WWW'94).Google ScholarGoogle Scholar
  35. Pinkerton, B. 2000. Web crawler: Finding what people want. Ph.D. thesis, University of Washington. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Shkapenyuk, V. and Suel, T. 2002. Design and implementation of a high-performance distributed Web crawler. In Proceedings of the IEEE International Conference on Data Engineering (ICDE'02). 357--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Singh, A., Srivatsa, M., Liu, L., and Miller, T. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the World Wide Web. In Proceedings of the ACM SIGIR Workshop on Distributed Information Retrieval. 126--142.Google ScholarGoogle Scholar
  38. Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, M., Long, X., and Shanmugasundaram, K. 2003. Odissea: A peer-to-peer architecture for scalable Web search and information retrieval. In Proceedings of the International Workshop on Web and Databases (WebDB'03). 67--72.Google ScholarGoogle Scholar
  39. Vitter, J. 2001. External memory algorithms and data structures: Dealing with massive data. ACM Comput. Surv. 33, 2, 209--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Wu, J. and Aberer, K. 2004. Using siterank for decentralized computation of Web document ranking. In Proceedings of the International Conference on Adaptive Hypermedia, 265--274.Google ScholarGoogle Scholar

Index Terms

  1. IRLbot: Scaling to 6 billion pages and beyond

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on the Web
              ACM Transactions on the Web  Volume 3, Issue 3
              June 2009
              179 pages
              ISSN:1559-1131
              EISSN:1559-114X
              DOI:10.1145/1541822
              Issue’s Table of Contents

              Copyright © 2009 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 3 July 2009
              • Accepted: 1 March 2009
              • Revised: 1 February 2009
              • Received: 1 March 2008
              Published in tweb Volume 3, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!