Abstract
World Wide Web applications need to use, constantly update, and maintain large webgraphs for executing several tasks, such as calculating the web impact factor, finding hubs and authorities, performing link analysis by webometrics tools, and ranking webpages by web search engines. Such webgraphs need to use a large amount of main memory, and, frequently, they do not completely fit in, even if compressed. Therefore, applications require the use of external memory. This article presents a new compact representation for webgraphs, called w-tree, which is designed specifically for external memory. It supports the execution of basic queries (e.g., full read, random read, and batch random read), set-oriented queries (e.g., superset, subset, equality, overlap, range, inlink, and co-inlink), and some advanced queries, such as edge reciprocal and hub and authority. Furthermore, a new layout tree designed specifically for webgraphs is also proposed, reducing the overall storage cost and allowing the random read query to be performed with an asymptotically faster runtime in the worst case. To validate the advantages of the w-tree, a series of experiments are performed to assess an implementation of the w-tree comparing it to a compact main memory representation. The results obtained show that w-tree is competitive in compression time and rate and in query time, which may execute several orders of magnitude faster for set-oriented queries than its competitors. The results provide empirical evidence that it is feasible to use a compact external memory representation for webgraphs in real applications, contradicting the previous assumptions made by several researchers.
- M. Adler and M. Mitzenmacher. 2001. Towards compressing web graphs. In Proceedings of the Data Compression Conference (DCC'01). 203--212. Google Scholar
Digital Library
- R. Albert, H. Jeong, and A. Barabasi. 1999. Internet: Diameter of the world wide web. Nature 401, 6749 (September 1999), 130--131.Google Scholar
Cross Ref
- T. C. Almind and P. Ingwersen. 1997. Informetric analyses on the world wide web: Methodological approaches to “webometrics.” Journal of Documentation 53, 4 (1997).Google Scholar
Cross Ref
- A. Apostolico and G. Drovandi. 2009. Graph compression by BFS. Algorithms 2, 3 (2009), 1031--1044.Google Scholar
Cross Ref
- Y. Asano, Y. Miyawaki, and T. Nishizeki. 2008. Efficient compression of web graphs. In Computing and Combinatorics. Lecture Notes in Computer Science, Springer Berlin Heidelberg, Berlin, Heidelberg, 1--11. Google Scholar
Digital Library
- B. T. Ávila. 2011. Compression of Sequence of Bits and Graphs (in Portuguese). Ph.D. Dissertation. PUC-Rio.Google Scholar
- M. A. Bender, E. D. Demaine, and M. Farach-Colton. 2005. Cache-oblivious b-trees. SIAM Journal on Comput. 35, 2 (2005), 341. Google Scholar
Digital Library
- T. Berners-Lee, W. Hall, and J. A. Hendler. 2006. A framework for web science. Foundations and Trends in Web Science 1, 1 (2006), 1--130. Google Scholar
Digital Library
- K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. 1998. The connectivity server: Fast access to linkage information on the web. Computer Network ISDN System 30, 1-- 7 (1998), 469--477. Google Scholar
Digital Library
- L. Björneborn and P. Ingwersen. 2004. Toward a basic framework for webometrics. Journal of the American Society for Information Science and Technology 55, 14 (2004), 1216--1227. Google Scholar
Digital Library
- P. Boldi, M. Santini, and S. Vigna. 2009. Permuting web graphs. In WAW'09: Proceedings of the 6th International Workshop on Algorithms and Models for the Webgraph. Springer-Verlag, 116--126. Google Scholar
Digital Library
- P. Boldi and S. Vigna. 2004a. The webgraph framework i: Compression techniques. In Proceedings of the International Conference on World Wide Web (WWW'04). ACM, 595--602. Google Scholar
Digital Library
- P. Boldi and S. Vigna. 2004b. The webgraph framework II: Codes for the world wide web. In Proceedings of the Data Compression Conference (DCC'04). Google Scholar
Digital Library
- A. Bonato. 2005. A survey of models of the web graph. In Combinatorial and Algorithmic Aspects of Networking. Lecture Notes in Computer Science. Vol. 3405. Springer Berlin Heidelberg, Berlin, Heidelberg, 159--172. Google Scholar
Digital Library
- N. R. Brisaboa, S. Ladra, and G. Navarro. 2009. k2-Trees for compact web graph representation. Proceedings of the International Symposium on String Processing and Information Retrieval 5721 (2009), 18--30. Google Scholar
Digital Library
- G. Buehrer and K. Chellapilla. 2008. A scalable pattern mining approach to web graph compression with communities. In WSDM. ACM, 95--106. Google Scholar
Digital Library
- C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, 423--430. Google Scholar
Digital Library
- S. Chakrabarti, M. van den Berg, and B. Dom. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the International Conference on World Wide Web (WWW'99). Elsevier, 1623--1640. Google Scholar
Digital Library
- F. Chierichetti, R. Kumar, S. Lattanzi, A. Panconesi, and P. Raghavan. 2009. Models for the compressible web. In FOCS. USA, 331--340. Google Scholar
Digital Library
- J. Cho, H. Garcia-Molina, and L. Page. 1998. Efficient crawling through URL ordering. In Proceedings of the International Conference on World Wide Web (WWW'98). Elsevier, 161--172. Google Scholar
Digital Library
- F. Claude and G. Navarro. 2010. Fast and compact web graph representations. ACM Transactions on the Web (TWEB) 4, 4 (2010), 1--31. Google Scholar
Digital Library
- Webgraph Dataset. 2010. http://law.dsi.unimi.it/. (2010).Google Scholar
- J. Dean and M. R. Henzinger. 1999. Finding related pages in the world wide web. In Proceedings of the International Conference on World Wide Web (WWW'99). Elsevier, 1467--1479. Google Scholar
Digital Library
- Y. Dourisboure, F. Geraci, and M. Pellegrini. 2007. Extraction and classification of dense communities in the web. In Proceedings of the International Conference on World Wide Web (WWW'07). ACM Press, 461--470. Google Scholar
Digital Library
- P. Elias. 1975. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT-21, 2 (1975), 194--203. Google Scholar
Digital Library
- D. Eppstein. 1994. Arboricity and bipartite subgraph listing algorithms. Information Processing Letters 51, 4 (1994), 207--211. Google Scholar
Digital Library
- G. W. Flake, S. Lawrence, C. L. Giles, and F. M. Coetzee. 2002. Self-organization and identification of web communities. Computer 35, 3 (2002), 66--70. Google Scholar
Digital Library
- S. W. Golomb. 1966. Run-length encodings. IEEE Transactions on Information Theory 12, 7 (1966), 399--401. Google Scholar
Digital Library
- Google. 2008. http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html, 2008. (2008).Google Scholar
- J.-L. Guillaume, M. Latapy, and L. Viennot. 2002. Efficient and simple encodings for the web graph. In Advances in Web-Age Information Management. Lecture Notes in Computer Science. Vol. 2419. Springer Berlin Heidelberg, Berlin, Heidelberg, 328--337. Google Scholar
Digital Library
- P. Ingwersen. 1998. The calculation of web impact factors. Journal of Documentation 54, 2 (1998), 236--243.Google Scholar
Cross Ref
- J. M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 5 (1999), 604--632. Google Scholar
Digital Library
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. 1999. Trawling the web for emerging cyber-communities. Computer Networks 31, 11--16 (1999), 1481--1493. Google Scholar
Digital Library
- A. Mahdian, H. Khalili, E. Nourbakhsh, and M. Ghodsi. 2006. Web graph compression by edge elimination. In Proceedings of the Data Compression Conference (DCC). IEEE Computer Society, 459. Google Scholar
Digital Library
- P. Morville and L. Rosenfeld. 2006. Information Architecture for the World Wide Web: Designing Large-Scale Web Sites. Vol. 27. O'Reilly Media. 528 pages. Google Scholar
Digital Library
- D. Mubayi and G. Turán. 2010. Finding bipartite subgraphs efficiently. Information Processing Letters 110, 5 (2010), 174--177.Google Scholar
Cross Ref
- M. Najork and J. L. Wiener. 2001. Breadth-first crawling yields high-quality pages. In WWW'01: Proceedings of the International Conference on World Wide Web (WWW'01). ACM Press, 114--118. Google Scholar
Digital Library
- A. Ntoulas, J. Cho, and C. Olston. 2004. What's new on the web?: The evolution of the web from a search engine perspective. In Proceedings of the International Conference on World Wide Web (WWW'04). ACM Press, 1--12. Google Scholar
Digital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. 1998. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford Digital Library Technologies Project.Google Scholar
- S. Raghavan and H. Garcia-Molina. 2003. Representing web graphs. In Proceedings of the 19th International Conference on Data Engineering. 405--416.Google Scholar
- K. H. Randall, R. Stata, R. G. Wickremesinghe, and J. L. Wiener. 2002. The link database: Fast access to graphs of the web. In Proceedings of the Data Compression Conference (DCC). 122--131. Google Scholar
Digital Library
- T. Suel and J. Yuan. 2001. Compressing the graph structure of the web. In Proceedings of the Data Compression Conference (DCC). IEEE Computer Society, Snowbird, UT, USA, 213--222. Google Scholar
Digital Library
- Michael Thelwall. 2009. Introduction to webometrics: Quantitative web research for the social sciences. Synthesis Lectures on Information Concepts, Retrieval, and Services 1, 1 (Jan. 2009), 1--116. Google Scholar
Digital Library
- M. Thelwall. 2010. Webometrics: Emergent or doomed? Information Research 15, 4 (2010).Google Scholar
- D. J. Watts and S. H. Strogatz. 1998. Collective dynamics of small-world networks. Nature 393, 6684 (June 1998), 440--442.Google Scholar
Cross Ref
Index Terms
W-tree: A Compact External Memory Representation for Webgraphs
Recommendations
CompEx++: Compression-Expansion Coding for Energy, Latency, and Lifetime Improvements in MLC/TLC NVMs
Multilevel/triple-level cell nonvolatile memories (MLC/TLC NVMs) such as phase-change memory (PCM) and resistive RAM (RRAM) are the subject of active research and development as replacement candidates for DRAM, which is limited by its high refresh power ...
Revisiting wear leveling design on compression applied 3D NAND flash memory: work-in-progress
CODES '18: Proceedings of the International Conference on Hardware/Software Codesign and System SynthesisCompression has been demonstrated as an efficient method for lifetime improvement on flash memory. However, data compression ratios are various, which bring proportional wearing on flash pages. Furthermore, the compression schemes have still not been ...
Dynamic top-k range reporting in external memory
PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database SystemsIn the top-K range reporting problem, the dataset contains N points in the real domain ℜ, each of which is associated with a real-valued score. Given an interval x1,x2 in ℜ and an integer K≤ N, a query returns the K points in x1,x2 having the smallest ...






Comments