skip to main content
research-article

Extraction and classification of dense implicit communities in the Web graph

Authors Info & Claims
Published:30 April 2009Publication History
Skip Abstract Section

Abstract

The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information, and services, and there is a growing interest in tools for understanding collective behavior and emerging phenomena in the WWW. In this article we focus on the problem of searching and classifying communities in the Web. Loosely speaking a community is a group of pages related to a common interest. More formally, communities have been associated in the computer science literature with the existence of a locally dense subgraph of the Web graph (where Web pages are nodes and hyperlinks are arcs of the Web graph). The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on Web graphs built on three publicly available large crawls of the Web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the Web graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for communities of thirty nodes or more (even at low density). It is still about 80% even for communities of twenty nodes with density over 50% of the arcs present. At the lower extremes the algorithm catches 35% of dense communities made of ten nodes. We also develop some sufficient conditions for the detection of a community under some local graph models and not-too-restrictive hypotheses. We complete our Community Watch system by clustering the communities found in the Web graph into homogeneous groups by topic and labeling each group by representative keywords.

References

  1. Abello, J., Resende, M. G. C., and Sudarsky, S. 2002. Massive quasi-clique detection. In Proceedings of the Latin American Theoretical Informatics Symposium (LATIN). 598--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia (HYPERTEXT). 38--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bharat, K., Broder, A. Z., Dean, J., and Henzinger, M. R. 2000. A comparison of techniques to find mirrored hosts on the WWW. J. Amer. Soc. Inform. Sci. 51, 12, 1114--1122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bianchini, M., Gori, M., and Scarselli, F. 2005. Inside pagerank. ACM Trans. Inter. Tech. 5, 1, 92--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: A scalable fully distributed Web crawler. Softw. Prac. Exper. 34, 8, 711--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Boldi, P. and Vigna, S. 2004. The Webgraph framework I: Compression techniques. In Proceedings of the 13th International Conference on the World Wide Web. 595--601. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. 2000a. Graph structure in the Web. Comput. Netw. 33, 1-6, 309--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. 2000b. Min-wise independent permutations. J. Comput. Syst. Sci. 60, 3, 630--659. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the web. In Selected Papers from the 6th International Conference on the World Wide Web. Elsevier Science Publishers Ltd., Essex, UK, 1157--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Capocci, A., Servedio, V. D. P., Caldarelli, G., and Colaiori, F. 2004. Communities detection in large networks. In Proceedings of the Algorithms and Models for the Web graph (WAW'04): Third International Workshop. 181--188.Google ScholarGoogle Scholar
  11. Chakrabarti, S., Dom, B. E., Kumar, S. R., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D., and Kleinberg, J. 1999. Mining the link structure of the World Wide Web. Comput. 32, 8, 60--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cho, J. and Garcia-Molina, H. 2000. WebBase and the Stanford interlib project. In Proceedings of the Kyoto International Conference on Digital Libraries: Research and Practice.Google ScholarGoogle Scholar
  13. Cover, T. M. and Thomas, J. A. 1991. Elements of Information Theory. John Wiley and Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dourisboure, Y., Geraci, F., and Pellegrini, M. 2007. Extraction and classification of dense communities in the web. In Proceedings of the 16th International Conference on the World Wide Web. ACM, New York, NY, 461--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Fang, R., Mikroyannidis, A., and Theodoulidis, B. 2006. A voting method for the classification of web pages. In Proceedings of the IEEE/WIC/ACM International Conference on Intelligent Agent Technology—Workshops. IEEE Computer Society, 610--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Feige, U. 2002. Relations between average case complexity and approximation complexity. In Proceedings of the ACM Symposium on Theory of Computing (STOC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Feige, U. and Langberg, M. 2001. Approximation algorithms for maximization problems arising in graph partitioning. J. Algor. 41, 174--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Feige, U., Peleg, D., and Kortsarz, G. 2001. The dense k-subgraph problem. Algorithmica 29, 3, 410--421.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Flake, G. W., Lawrence, S., and Giles, C. L. 2000. Efficient identification of Web communities. In Proceedings of the Conference on Knowledge Discovery in Data (KDD). ACM Press, New York, NY, 150--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Flake, G. W., Lawrence, S., Giles, C. L., and Coetzee, F. 2002. Self-organization of the web and identification of communities. IEEE Comput. 35, 3, 66--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Geraci, F., Pellegrini, M., Maggini, M., and Sebastiani, F. 2007. Cluster generation and labeling for web snippets:a fast, accurate hierarchical solution. Internet Math. 3, 4, 413--443.Google ScholarGoogle ScholarCross RefCross Ref
  22. Gibson, D., Kleinberg, J., and Raghavan, P. 1998. Inferring web communities from link topology. In Proceedings of the ninth ACM Conference on Hypertext and Hypermedia (HYPERTEXT). ACM Press, New York, NY, 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gibson, D., Kumar, R., and Tomkins, A. 2005. Discovering large dense subgraphs in massive graphs. In Proceedings of the International Conference on Very Large Databases (VLDB). VLDB Endowment, 721--732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Girvan, M. and Newman, M. E. J. 2002. Community structure in social and biological networks. Proceedings of the National Academic Science, 7821--7826.Google ScholarGoogle Scholar
  25. Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., and Flake, G. W. 2002. Using web structure for classifying and describing web pages. In Proceedings of the International Conference on the World Wide Web (WWW). 562--569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gulli, A. and Signorini, A. 2005. The indexable web is more than 11.5 billion pages. In Proceedings of the 11th International Conference on the World Wide Web (WWW). Special Interest Tracks and Posters. 902--903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  28. Hastad, J. 1999. Clique is hard to approximate within n1−ε. Acta Mathematica 182, 105--142.Google ScholarGoogle ScholarCross RefCross Ref
  29. Haveliwala, T. H., Gionis, A., and Indyk, P. 2000. Scalable techniques for clustering the web. In Proceedings of the WebDB Workshop. 129--134.Google ScholarGoogle Scholar
  30. Henzinger, M. 2002. Algorithmic challenges in Web search engines. Internet Math. 1, 1, 115--126.Google ScholarGoogle ScholarCross RefCross Ref
  31. Imafuji, N. and Kitsuregawa, M. 2003. Finding a web community by maximum flow algorithm with HITS score based capacity. In Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA). 101--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ino, H., Kudo, M., and Nakamura, A. 2005. Partitioning of Web graphs by community topology. In Proceedings of the International Conference on the World Wide Web (WWW). ACM Press, New York, NY, 661--669. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kautz, H., Selman, B., and Shah, M. 1997. Referral Web: Combining social networks and collaborative filtering. Comm. ACM 40, 3, 63--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999a. Extracting large-scale knowledge bases from the Web. In Proceedings of the International Conference on Very Large Databases (VLDB). 639--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999b. Trawling the Web for emerging cyber-communities. Comput. Netw. 31, 11--16, 1481--1493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 2005. Method and system for trawling the world-wide Web to identify implicitly-defined communities of Web pages. US patent 6886129.Google ScholarGoogle Scholar
  37. Kumar, S. R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999c. Extracting large-scale knowledge bases from the Web. VLDB J. 639--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lempel, R. and Moran, S. 2000. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Comput. Netw. 33, 1--6, 387--401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lindemann, C. and Littig, L. 2007. Classifying Web sites. In Proceedings of the 16th International Conference on World Wide Web (WWW). 1143--1144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Newman, M. 2003. The structure and function of complex networks. SIAM Rev. 45, 2, 167--256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Reddy, P. K. and Kitsuregawa, M. 2001. An approach to relate the web communities through bipartite graphs. In Proceedings of the International Conference on Web Information Systems Engineering (WISE). 301--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Stamou, S., Ntoulas, A., Krikos, V., Kokosis, P., and Christodoulakis, D. 2006. Classifying web data in directory structures. In Frontiers of WWW Research and Development—APWeb, 8th Asia-Pacific Web Conference. Lecture Notes in Computer Science, vol. 3841. Springer, 238--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wu, B. and Davison, B. D. 2005. Identifying link farm spam pages. In Proceedings of the International Conference on World Wide Web (WWW). ACM Press, New York, NY, 820--829. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Extraction and classification of dense implicit communities in the Web graph

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on the Web
          ACM Transactions on the Web  Volume 3, Issue 2
          April 2009
          98 pages
          ISSN:1559-1131
          EISSN:1559-114X
          DOI:10.1145/1513876
          Issue’s Table of Contents

          Copyright © 2009 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 April 2009
          • Accepted: 1 February 2009
          • Revised: 1 October 2008
          • Received: 1 February 2008
          Published in tweb Volume 3, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!