skip to main content
research-article

Link analysis for Web spam detection

Published:03 March 2008Publication History
Skip Abstract Section

Abstract

We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible.

We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.

References

  1. Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Angelova, R. and Weikum, G. 2006. Graph-based text classification: learn from your neighbors. Proceedings of the International ACM SIGIR Conference, 485--492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baeza-Yates, R., Boldi, P., and Castillo, C. 2006. Generalizing pagerank: Damping functions for link-based ranking algorithms. In Proceedings of ACM SIGIR. ACM Press. WA, 308--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Baeza-Yates, R., Castillo, C., and López, V. 2005. Pagerank increase under different collusion topologies. In 1st International Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  5. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. 2006a. Link-based characterization and detection of Web Spam. In 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). Seattle, WA.Google ScholarGoogle Scholar
  7. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. 2006b. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD). ACM Press.Google ScholarGoogle Scholar
  8. Benczúr, A. A., Csalogány, K., Sarlós, T., and Uher, M. 2005. Spamrank: Fully automatic link spam detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. Chiba, Japan.Google ScholarGoogle Scholar
  9. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: A scalable fully distributed web crawler. Softw. Pract. Exp. 34, 8, 711--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Breiman, L. 1996. Bagging predictors. Machine Learn. 24, 2, 123--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Broder, A. and Mitzenmacher, M. 2003. Network applications of Bloom filters: A survey. Internet Math. 1, 4, 485--509.Google ScholarGoogle ScholarCross RefCross Ref
  12. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006a. A reference collection for web spam. SIGIR Forum 40, 2, 11--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006b. A reference collection for web spam. SIGIR Forum 40, 2, 11--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR). ACM Press, 423--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cohen, E. 1997. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55, 3, 441--453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Costa, L., Rodrigues, F. A., Travieso, G., and Villas. 2005. Characterization of complex networks: A survey of measurements. URL: http://arxiv.org/abs/cond-mat/0505185.Google ScholarGoogle Scholar
  17. da Costa-Carvalho, A. L., Chirita, P.-A., de Moura, E. S., Calado, P., and Nejdl, W. 2006. Site level noise removal for search engines. In Proceedings of the 15th International Conference on World Wide Web (WWW'06), ACM Press, 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Davison, B. D. 2000a. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search. AAAI Press, TX, 23--28.Google ScholarGoogle Scholar
  19. Davison, B. D. 2000b. Topical locality in the web. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 272--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Demetrescu, C., Finocchi, I., and Ribichini, A. 2006. Trading off space for passes in graph streaming problems. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Drost, I. and Scheffer, T. 2005. Thwarting the nigritude ultramarine: learning to identify link spam. In Proceedings of the 16th European Conference on Machine Learning (ECML). Lecture Notes in Artificial Intelligence, vol. 3720. Springer, 233--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Durand, M. and Flajolet, P. 2003. Loglog counting of large cardinalities (extended abstract). In Proceedings of 11th Annual European Symposium on Algorithms. Lecture Notes in Computer Science, vol. 2832. Springer, 605--617.Google ScholarGoogle Scholar
  23. Eiron, N., Curley, K. S., and Tomlin, J. A. 2004. Ranking the web frontier. In Proceedings of the 13th International Conference on World Wide Web. ACM Press, 309--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Feigenbaum, J., Kannan, S., Gregor, M. A., Suri, S., and Zhang, J. 2004. On graph problems in a semi-streaming model. In 31st International Colloquium on Automata, Languages and Programming.Google ScholarGoogle Scholar
  25. Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th Workshop on the Web and Databases (WebDB). Paris, France. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Flajolet, P. and Martin, N. G. 1985. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31, 2, 182--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Gibson, D., Kumar, R., and Tomkins, A. 2005. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). 721--732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gomes, L. H., Almeida, R. B., Bettencourt, L. M. A., Almeida, V., and Almeida, J. M. 2005. Comparative graph theoretical characterization of networks of spam and legitimate email. URL: http://www.ceas.cc/papers-2005/131.pdf.Google ScholarGoogle Scholar
  29. Gori, M. and Witten, I. 2005. The bubble of web visibility. Comm. ACM 48, 3, 115--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gulli, A. and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Poster Proceedings of the 14th International Conference on World Wide Web. ACM Press, 902--903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Gupta, S., Anderson, R. M., and May, R. M. 1989. Networks of sexual contacts: implications for the pattern of spread of hiv. AIDS 3, 12, 807--817.Google ScholarGoogle ScholarCross RefCross Ref
  32. Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J. 2006. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases. ACM, 439--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  34. Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann, 576--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Haveliwala, T. 1999. Efficient computation of pagerank. Tech. rep., Stanford University.Google ScholarGoogle Scholar
  36. Henzinger, M. R., Raghavan, P., and Rajagopalan, S. 1999. Computing on data streams. In Dimacs Series in Discrete Mathematics and Theoretical Computer Science, 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Leskovec, J., Kleinberg, J., and Faloutsos, C. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceeding of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD'05). ACM Press, 177--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lipton, R. J. and Naughton, J. F. 1989. Estimating the size of generalized transitive closures. In Proceedings of the 15th International Conference on Very Large Data Bases (VLDB'89). Morgan Kaufmann Publishers Inc., 165--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lu, Q. and Getoor, L. 2003. Link-based classification. In Proceedings of the International Conference on Machine Learning.Google ScholarGoogle Scholar
  40. Mitzenmacher, M. and Upfal, E. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Morris, R. 1978. Counting large numbers of events in small registers. Comm. ACM 21, 10, 840--842. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the World Wide Web Conference. Edinburgh, Scotland. 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: bringing order to the Web. Tech. rep., Stanford Digital Library Technologies Project.Google ScholarGoogle Scholar
  44. Palmer, C. R., Gibbons, P. B., and Faloutsos, C. 2002. ANF: A fast and scalable tool for data mining in massive graphs. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press. 81--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Perkins, A. 2001. The classification of search engine spam. http://www.silverdisc.co.uk/articles/spam-classification/.Google ScholarGoogle Scholar
  46. Qi, X. and Davison, B. D. 2006. Knowing a web page by the company it keeps. In Proceedings of the 15th ACM Conference on Information and Knowledge Management (CIKM). Arlington, VA. 228--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Shen, G., Gao, B., Liu, T.-Y., Feng, G., Song, S., and Li, H. 2006. Detecting link spam using temporal information. In Proceedings of the International Conference on Data Mining (ICDM). Hong Kong. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Vitter, J. S. 2001. External memory algorithms and data structures. ACM Comput. Surv. 33, 2, 209--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Witten, I. H. and Frank, E. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Wu, B. and Davison, B. D. 2005. Identifying link farm spam pages. In Special Interest Tracks and Posters of the 14th International Conference on World Wide Web (WWW'05), ACM Press, 820--829. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zhang, H., Goel, A., Govindan, R., Mason, K., and Van Roy, B. 2004. Making eigenvector-based reputation systems robust to collusion. In Proceedings of the 3rd Workshop on Web Graphs (WAW). Lecture Notes in Computer Science, vol. 3243. Springer, 92--104.Google ScholarGoogle ScholarCross RefCross Ref
  52. Zhang, T., Popescul, A., and Dom, B. 2006. Linear prediction models with graph regularization for web-page categorization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). ACM Press, 821--826. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Link analysis for Web spam detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 2, Issue 1
      February 2008
      280 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/1326561
      Issue’s Table of Contents

      Copyright © 2008 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 March 2008
      • Accepted: 1 October 2007
      • Revised: 1 September 2007
      • Received: 1 March 2007
      Published in tweb Volume 2, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!