skip to main content
research-article

W-tree: A Compact External Memory Representation for Webgraphs

Published:08 February 2016Publication History
Skip Abstract Section

Abstract

World Wide Web applications need to use, constantly update, and maintain large webgraphs for executing several tasks, such as calculating the web impact factor, finding hubs and authorities, performing link analysis by webometrics tools, and ranking webpages by web search engines. Such webgraphs need to use a large amount of main memory, and, frequently, they do not completely fit in, even if compressed. Therefore, applications require the use of external memory. This article presents a new compact representation for webgraphs, called w-tree, which is designed specifically for external memory. It supports the execution of basic queries (e.g., full read, random read, and batch random read), set-oriented queries (e.g., superset, subset, equality, overlap, range, inlink, and co-inlink), and some advanced queries, such as edge reciprocal and hub and authority. Furthermore, a new layout tree designed specifically for webgraphs is also proposed, reducing the overall storage cost and allowing the random read query to be performed with an asymptotically faster runtime in the worst case. To validate the advantages of the w-tree, a series of experiments are performed to assess an implementation of the w-tree comparing it to a compact main memory representation. The results obtained show that w-tree is competitive in compression time and rate and in query time, which may execute several orders of magnitude faster for set-oriented queries than its competitors. The results provide empirical evidence that it is feasible to use a compact external memory representation for webgraphs in real applications, contradicting the previous assumptions made by several researchers.

References

  1. M. Adler and M. Mitzenmacher. 2001. Towards compressing web graphs. In Proceedings of the Data Compression Conference (DCC'01). 203--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Albert, H. Jeong, and A. Barabasi. 1999. Internet: Diameter of the world wide web. Nature 401, 6749 (September 1999), 130--131.Google ScholarGoogle ScholarCross RefCross Ref
  3. T. C. Almind and P. Ingwersen. 1997. Informetric analyses on the world wide web: Methodological approaches to “webometrics.” Journal of Documentation 53, 4 (1997).Google ScholarGoogle ScholarCross RefCross Ref
  4. A. Apostolico and G. Drovandi. 2009. Graph compression by BFS. Algorithms 2, 3 (2009), 1031--1044.Google ScholarGoogle ScholarCross RefCross Ref
  5. Y. Asano, Y. Miyawaki, and T. Nishizeki. 2008. Efficient compression of web graphs. In Computing and Combinatorics. Lecture Notes in Computer Science, Springer Berlin Heidelberg, Berlin, Heidelberg, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. T. Ávila. 2011. Compression of Sequence of Bits and Graphs (in Portuguese). Ph.D. Dissertation. PUC-Rio.Google ScholarGoogle Scholar
  7. M. A. Bender, E. D. Demaine, and M. Farach-Colton. 2005. Cache-oblivious b-trees. SIAM Journal on Comput. 35, 2 (2005), 341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Berners-Lee, W. Hall, and J. A. Hendler. 2006. A framework for web science. Foundations and Trends in Web Science 1, 1 (2006), 1--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. 1998. The connectivity server: Fast access to linkage information on the web. Computer Network ISDN System 30, 1-- 7 (1998), 469--477. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Björneborn and P. Ingwersen. 2004. Toward a basic framework for webometrics. Journal of the American Society for Information Science and Technology 55, 14 (2004), 1216--1227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Boldi, M. Santini, and S. Vigna. 2009. Permuting web graphs. In WAW'09: Proceedings of the 6th International Workshop on Algorithms and Models for the Webgraph. Springer-Verlag, 116--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Boldi and S. Vigna. 2004a. The webgraph framework i: Compression techniques. In Proceedings of the International Conference on World Wide Web (WWW'04). ACM, 595--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Boldi and S. Vigna. 2004b. The webgraph framework II: Codes for the world wide web. In Proceedings of the Data Compression Conference (DCC'04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Bonato. 2005. A survey of models of the web graph. In Combinatorial and Algorithmic Aspects of Networking. Lecture Notes in Computer Science. Vol. 3405. Springer Berlin Heidelberg, Berlin, Heidelberg, 159--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. R. Brisaboa, S. Ladra, and G. Navarro. 2009. k2-Trees for compact web graph representation. Proceedings of the International Symposium on String Processing and Information Retrieval 5721 (2009), 18--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Buehrer and K. Chellapilla. 2008. A scalable pattern mining approach to web graph compression with communities. In WSDM. ACM, 95--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, 423--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Chakrabarti, M. van den Berg, and B. Dom. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the International Conference on World Wide Web (WWW'99). Elsevier, 1623--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Chierichetti, R. Kumar, S. Lattanzi, A. Panconesi, and P. Raghavan. 2009. Models for the compressible web. In FOCS. USA, 331--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Cho, H. Garcia-Molina, and L. Page. 1998. Efficient crawling through URL ordering. In Proceedings of the International Conference on World Wide Web (WWW'98). Elsevier, 161--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Claude and G. Navarro. 2010. Fast and compact web graph representations. ACM Transactions on the Web (TWEB) 4, 4 (2010), 1--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Webgraph Dataset. 2010. http://law.dsi.unimi.it/. (2010).Google ScholarGoogle Scholar
  23. J. Dean and M. R. Henzinger. 1999. Finding related pages in the world wide web. In Proceedings of the International Conference on World Wide Web (WWW'99). Elsevier, 1467--1479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Dourisboure, F. Geraci, and M. Pellegrini. 2007. Extraction and classification of dense communities in the web. In Proceedings of the International Conference on World Wide Web (WWW'07). ACM Press, 461--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Elias. 1975. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT-21, 2 (1975), 194--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Eppstein. 1994. Arboricity and bipartite subgraph listing algorithms. Information Processing Letters 51, 4 (1994), 207--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. W. Flake, S. Lawrence, C. L. Giles, and F. M. Coetzee. 2002. Self-organization and identification of web communities. Computer 35, 3 (2002), 66--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. W. Golomb. 1966. Run-length encodings. IEEE Transactions on Information Theory 12, 7 (1966), 399--401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Google. 2008. http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html, 2008. (2008).Google ScholarGoogle Scholar
  30. J.-L. Guillaume, M. Latapy, and L. Viennot. 2002. Efficient and simple encodings for the web graph. In Advances in Web-Age Information Management. Lecture Notes in Computer Science. Vol. 2419. Springer Berlin Heidelberg, Berlin, Heidelberg, 328--337. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Ingwersen. 1998. The calculation of web impact factors. Journal of Documentation 54, 2 (1998), 236--243.Google ScholarGoogle ScholarCross RefCross Ref
  32. J. M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 5 (1999), 604--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. 1999. Trawling the web for emerging cyber-communities. Computer Networks 31, 11--16 (1999), 1481--1493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Mahdian, H. Khalili, E. Nourbakhsh, and M. Ghodsi. 2006. Web graph compression by edge elimination. In Proceedings of the Data Compression Conference (DCC). IEEE Computer Society, 459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. Morville and L. Rosenfeld. 2006. Information Architecture for the World Wide Web: Designing Large-Scale Web Sites. Vol. 27. O'Reilly Media. 528 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Mubayi and G. Turán. 2010. Finding bipartite subgraphs efficiently. Information Processing Letters 110, 5 (2010), 174--177.Google ScholarGoogle ScholarCross RefCross Ref
  37. M. Najork and J. L. Wiener. 2001. Breadth-first crawling yields high-quality pages. In WWW'01: Proceedings of the International Conference on World Wide Web (WWW'01). ACM Press, 114--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Ntoulas, J. Cho, and C. Olston. 2004. What's new on the web?: The evolution of the web from a search engine perspective. In Proceedings of the International Conference on World Wide Web (WWW'04). ACM Press, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. Page, S. Brin, R. Motwani, and T. Winograd. 1998. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford Digital Library Technologies Project.Google ScholarGoogle Scholar
  40. S. Raghavan and H. Garcia-Molina. 2003. Representing web graphs. In Proceedings of the 19th International Conference on Data Engineering. 405--416.Google ScholarGoogle Scholar
  41. K. H. Randall, R. Stata, R. G. Wickremesinghe, and J. L. Wiener. 2002. The link database: Fast access to graphs of the web. In Proceedings of the Data Compression Conference (DCC). 122--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. T. Suel and J. Yuan. 2001. Compressing the graph structure of the web. In Proceedings of the Data Compression Conference (DCC). IEEE Computer Society, Snowbird, UT, USA, 213--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Michael Thelwall. 2009. Introduction to webometrics: Quantitative web research for the social sciences. Synthesis Lectures on Information Concepts, Retrieval, and Services 1, 1 (Jan. 2009), 1--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Thelwall. 2010. Webometrics: Emergent or doomed? Information Research 15, 4 (2010).Google ScholarGoogle Scholar
  45. D. J. Watts and S. H. Strogatz. 1998. Collective dynamics of small-world networks. Nature 393, 6684 (June 1998), 440--442.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. W-tree: A Compact External Memory Representation for Webgraphs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Article Metrics

          • Downloads (Last 12 months)1
          • Downloads (Last 6 weeks)0

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!