Abstract
In this article we present an experimental study of the properties of webgraphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges, made available by the WebBase project at Stanford, as well as several synthetic ones generated according to various models proposed recently. We investigate several topological properties of such graphs, including the number of bipartite cores and strongly connected components, the distribution of degrees and PageRank values and some correlations; we present a comparison study of the models against these measures.Our findings are that (i) the WebBase sample differs slightly from the (older) samples studied in the literature, and (ii) despite the fact that these models do not catch all of its properties, they do exhibit some peculiar behaviors not found, for example, in the models from classical random graph theory.Moreover we developed a software library able to generate and measure massive graphs in secondary memory; this library is publicy available under the GPL licence. We discuss its implementation and some computational issues related to secondary memory graph algorithms.
- Abello, J., Pardalos, P. M., and Resende, M. G. C. 2002. Handbook of massive data sets. Kluwer Academic Publishers. Google Scholar
Digital Library
- Adler, M. and Mitzenmacher, M. 2001. Towards compressing web graphs. in the Proceedings of the Data Compression Conference (DCC'01) Google Scholar
Digital Library
- Bollobas, B. and Riordan, O. 2003. Robustness and vulnerability of scale-free random graphs. Internet Math. 1, 1, 1--35.Google Scholar
Cross Ref
- Barabasi, A. and Albert, A. 1999. Emergence of scaling in random networks. Science 286, 509.Google Scholar
Cross Ref
- Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Ubicrawler: A scalable fully distributed web crawler.Google Scholar
- Boldi, P. and Vigna, S. 2004. The webgraph framework i: compression techniques. In WWW '04: Proceedings of the 13th International Conference on World Wide Web. ACM Press, 595--602. Google Scholar
Digital Library
- Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117. Google Scholar
Digital Library
- Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, S., Tomkins, A., and Wiener, J. 2000. Graph structure in the web. In Proceedings of the 9th WWW conference. Google Scholar
Digital Library
- Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the 11th International World Wide Web Conference. Google Scholar
Digital Library
- Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., and Wesley, G. 2006. Stanford WebBase Components and Applications. ACM Trans. Internet Tech. 6, 2. Google Scholar
Digital Library
- Cormen, T. H., Leiserson, C. E., and Rivest, R. L. 1992. Introduction to Algorithms, 6th ed. MIT Press and McGraw-Hill Book Company. Google Scholar
Digital Library
- cyvellance. www.cyvellance.com. Cyvellance.Google Scholar
- Diestel, R. 1997. Graph Theory. Springer, New York.Google Scholar
- Dill, S., Kumar, R., McCurley, K., Rajagopalan, S., Sivakumar, D., and Tomkins, A. 2001. Self-similarity in the web. In Proceedings of the 27th VLDB Conference. Google Scholar
Digital Library
- Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004a. Large scale properties of the Webgraph. Europ. J. Phys. B 38, 2, 239--243. DOI: 10.1140/epjb/e2004-00056-6.Google Scholar
Cross Ref
- Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004b. Simulating the Webgraph: A comparative analysis of models. Computing in Science and Engineering 6, 6, 84--89. Google Scholar
Digital Library
- Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004c. A software library for generating and measuring massive Webgraphs. Tech. Rep. D13, COSIN European Research Project. http://www.dis.uniroma1.it/~cosin/html_pages/COSIN-Tools.htm.Google Scholar
- Erdös, P. and Rényi, A. 1960. On the evoluation of random graphs Publ. Math. Inst. Hung. Acad. Sci 5.Google Scholar
- Gleich, D., Zuchov, L., and Berkhin, P. 2004. Fast Parallel PageRank: A Linear System Approach. Tech. Rep. 038, Yahoo! Research.Google Scholar
- Gulli, A. and Signorini, A. 2005. The Indexable Web is More than 11.5 Billion Pages. In Proceedings of WWW-05, International Conference on the World Wide Web. Google Scholar
Digital Library
- Harary, F. 1969. Graph Theory. Addison-Wesley, Reading, MA.Google Scholar
- Haveliwala, T. H. 1999. Efficient computation of PageRank. Tech. rep., Stanford University.Google Scholar
- Kleinberg, J. 1997. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google Scholar
Digital Library
- Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. The Web as a graph: measurements, models and methods. In Proceedings of the International Conference on Combinatorics and Computing. 1--18.Google Scholar
- Knuth, D. E. 1997. Seminumerical Algorithms, Third ed. The Art of Computer Programming, vol. 2. Addison-Wesley, Reading, Massachusetts. Google Scholar
Digital Library
- Kraft, R., Hastor, E., and Stata, R. 2003. Timelinks: Exploring the link structure of the evolving Web. In Second Workshop on Algorithms and Models for the Web-Graph (WAW2003).Google Scholar
- Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., and Upfal, E. 2000. Stochastic models for the Web graph. In Proceedings of the 41st FOCS. 57--65. Google Scholar
Digital Library
- Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. Trawling the Web for emerging cyber communities. In Proceedings of the 8th WWW Conference. 403--416. Google Scholar
- Laura, L., Leonardi, S., Caldarelli, G., and De Los Rios, P. 2002. A multi-layer model for the Webgraph. In On-line proceedings of the 2nd International Workshop on Web Dynamics.Google Scholar
- Mitzenmacher, M. 2003. A brief history of generative models for power law and lognormal distributions. Internet Math. 1, 2.Google Scholar
- Pandurangan, G., Raghavan, P., and Upfal, E. 2002. Using PageRank to characterize Web structure. In Proceedings of the 8th Annual International Conference on Combinatorics and Computing (COCOON), Springer-Verlag, Ed. LNCS 2387. 330--339. Google Scholar
Digital Library
- Pennock, D., Flake, G., Lawrence, S., Glover, E., and Giles, C. 2002. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences 99, 8 (April), 5207--5211.Google Scholar
Cross Ref
- Sibeyn, J., Abello, J., and Meyer, U. 2002. Heuristics for semi-external depth first search on directed graphs. In Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA). 282--292. Google Scholar
Digital Library
- Tarjan, R. E. 1972. Depth-first search and linear graph algorithms. SIAM J. Comput. 1, 2, 146--160.Google Scholar
Digital Library
- Vitter, J. and Shriver, E. 1994a. Algorithms for parallel memory i: Two-level memories. Algorithmica 12, 2-3, 107--114.Google Scholar
- Vitter, J. and Shriver, E. 1994b. Algorithms for parallel memory ii: Hierarchical multilevel memories. Algorithmica 12, 2-3, 148--169.Google Scholar
- Walker, A. 1977. An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Softw. 3, 3, 253--256. Google Scholar
Digital Library
- webbase. The Stanford Webbase project. http://www-diglib.stanford.edu/~testbed/doc2/WebBase/.Google Scholar
- Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes (2nd ed.): Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers Inc. Google Scholar
Digital Library
Index Terms
The Web as a graph: How far we are
Recommendations
The structure of bull-free graphs I-Three-edge-paths with centers and anticenters
The bull is the graph consisting of a triangle and two disjoint pendant edges. A graph is called bull-free if no induced subgraph of it is a bull. This is the first paper in a series of three. The goal of the series is to explicitly describe the ...
The structure of bull-free graphs II and III-A summary
The bull is a graph consisting of a triangle and two pendant edges. A graph is called bull-free if no induced subgraph of it is a bull. This is a summary of the last two papers [2,3] in a series [1-3] (Chudnovsky, 2012). The goal of the series is to ...
Structure of the Thai Web Graph
AINAW '08: Proceedings of the 22nd International Conference on Advanced Information Networking and Applications - WorkshopsThis paper presents structural properties of the Thai Web graph. We conduct an empirical study on the Webgraphs induced from two Thai web snapshots crawled during January 2007 (5.7M nodes and 12M directed edges) and May 2007 (18.8M nodes and 70M ...






Comments