skip to main content
article

The Web as a graph: How far we are

Published:01 February 2007Publication History
Skip Abstract Section

Abstract

In this article we present an experimental study of the properties of webgraphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges, made available by the WebBase project at Stanford, as well as several synthetic ones generated according to various models proposed recently. We investigate several topological properties of such graphs, including the number of bipartite cores and strongly connected components, the distribution of degrees and PageRank values and some correlations; we present a comparison study of the models against these measures.Our findings are that (i) the WebBase sample differs slightly from the (older) samples studied in the literature, and (ii) despite the fact that these models do not catch all of its properties, they do exhibit some peculiar behaviors not found, for example, in the models from classical random graph theory.Moreover we developed a software library able to generate and measure massive graphs in secondary memory; this library is publicy available under the GPL licence. We discuss its implementation and some computational issues related to secondary memory graph algorithms.

References

  1. Abello, J., Pardalos, P. M., and Resende, M. G. C. 2002. Handbook of massive data sets. Kluwer Academic Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Adler, M. and Mitzenmacher, M. 2001. Towards compressing web graphs. in the Proceedings of the Data Compression Conference (DCC'01) Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bollobas, B. and Riordan, O. 2003. Robustness and vulnerability of scale-free random graphs. Internet Math. 1, 1, 1--35.Google ScholarGoogle ScholarCross RefCross Ref
  4. Barabasi, A. and Albert, A. 1999. Emergence of scaling in random networks. Science 286, 509.Google ScholarGoogle ScholarCross RefCross Ref
  5. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Ubicrawler: A scalable fully distributed web crawler.Google ScholarGoogle Scholar
  6. Boldi, P. and Vigna, S. 2004. The webgraph framework i: compression techniques. In WWW '04: Proceedings of the 13th International Conference on World Wide Web. ACM Press, 595--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, S., Tomkins, A., and Wiener, J. 2000. Graph structure in the web. In Proceedings of the 9th WWW conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the 11th International World Wide Web Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., and Wesley, G. 2006. Stanford WebBase Components and Applications. ACM Trans. Internet Tech. 6, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cormen, T. H., Leiserson, C. E., and Rivest, R. L. 1992. Introduction to Algorithms, 6th ed. MIT Press and McGraw-Hill Book Company. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. cyvellance. www.cyvellance.com. Cyvellance.Google ScholarGoogle Scholar
  13. Diestel, R. 1997. Graph Theory. Springer, New York.Google ScholarGoogle Scholar
  14. Dill, S., Kumar, R., McCurley, K., Rajagopalan, S., Sivakumar, D., and Tomkins, A. 2001. Self-similarity in the web. In Proceedings of the 27th VLDB Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004a. Large scale properties of the Webgraph. Europ. J. Phys. B 38, 2, 239--243. DOI: 10.1140/epjb/e2004-00056-6.Google ScholarGoogle ScholarCross RefCross Ref
  16. Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004b. Simulating the Webgraph: A comparative analysis of models. Computing in Science and Engineering 6, 6, 84--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004c. A software library for generating and measuring massive Webgraphs. Tech. Rep. D13, COSIN European Research Project. http://www.dis.uniroma1.it/~cosin/html_pages/COSIN-Tools.htm.Google ScholarGoogle Scholar
  18. Erdös, P. and Rényi, A. 1960. On the evoluation of random graphs Publ. Math. Inst. Hung. Acad. Sci 5.Google ScholarGoogle Scholar
  19. Gleich, D., Zuchov, L., and Berkhin, P. 2004. Fast Parallel PageRank: A Linear System Approach. Tech. Rep. 038, Yahoo! Research.Google ScholarGoogle Scholar
  20. Gulli, A. and Signorini, A. 2005. The Indexable Web is More than 11.5 Billion Pages. In Proceedings of WWW-05, International Conference on the World Wide Web. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Harary, F. 1969. Graph Theory. Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar
  22. Haveliwala, T. H. 1999. Efficient computation of PageRank. Tech. rep., Stanford University.Google ScholarGoogle Scholar
  23. Kleinberg, J. 1997. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. The Web as a graph: measurements, models and methods. In Proceedings of the International Conference on Combinatorics and Computing. 1--18.Google ScholarGoogle Scholar
  25. Knuth, D. E. 1997. Seminumerical Algorithms, Third ed. The Art of Computer Programming, vol. 2. Addison-Wesley, Reading, Massachusetts. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kraft, R., Hastor, E., and Stata, R. 2003. Timelinks: Exploring the link structure of the evolving Web. In Second Workshop on Algorithms and Models for the Web-Graph (WAW2003).Google ScholarGoogle Scholar
  27. Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., and Upfal, E. 2000. Stochastic models for the Web graph. In Proceedings of the 41st FOCS. 57--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. Trawling the Web for emerging cyber communities. In Proceedings of the 8th WWW Conference. 403--416. Google ScholarGoogle Scholar
  29. Laura, L., Leonardi, S., Caldarelli, G., and De Los Rios, P. 2002. A multi-layer model for the Webgraph. In On-line proceedings of the 2nd International Workshop on Web Dynamics.Google ScholarGoogle Scholar
  30. Mitzenmacher, M. 2003. A brief history of generative models for power law and lognormal distributions. Internet Math. 1, 2.Google ScholarGoogle Scholar
  31. Pandurangan, G., Raghavan, P., and Upfal, E. 2002. Using PageRank to characterize Web structure. In Proceedings of the 8th Annual International Conference on Combinatorics and Computing (COCOON), Springer-Verlag, Ed. LNCS 2387. 330--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Pennock, D., Flake, G., Lawrence, S., Glover, E., and Giles, C. 2002. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences 99, 8 (April), 5207--5211.Google ScholarGoogle ScholarCross RefCross Ref
  33. Sibeyn, J., Abello, J., and Meyer, U. 2002. Heuristics for semi-external depth first search on directed graphs. In Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA). 282--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tarjan, R. E. 1972. Depth-first search and linear graph algorithms. SIAM J. Comput. 1, 2, 146--160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Vitter, J. and Shriver, E. 1994a. Algorithms for parallel memory i: Two-level memories. Algorithmica 12, 2-3, 107--114.Google ScholarGoogle Scholar
  36. Vitter, J. and Shriver, E. 1994b. Algorithms for parallel memory ii: Hierarchical multilevel memories. Algorithmica 12, 2-3, 148--169.Google ScholarGoogle Scholar
  37. Walker, A. 1977. An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Softw. 3, 3, 253--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. webbase. The Stanford Webbase project. http://www-diglib.stanford.edu/~testbed/doc2/WebBase/.Google ScholarGoogle Scholar
  39. Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes (2nd ed.): Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Web as a graph: How far we are

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Internet Technology
        ACM Transactions on Internet Technology  Volume 7, Issue 1
        February 2007
        184 pages
        ISSN:1533-5399
        EISSN:1557-6051
        DOI:10.1145/1189740
        Issue’s Table of Contents

        Copyright © 2007 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 February 2007
        Published in toit Volume 7, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!