ABSTRACT
Web spam can significantly deteriorate the quality of search engines. Early web spamming techniques mainly manipulate page content. Since linkage information is widely used in web search, link-based spamming has also developed. So far, many techniques have been proposed to detect link spam. Those approaches are basically built on link-based web ranking methods.
In contrast, we cast the link spam detection problem into a machine learning problem of classification on directed graphs. We develop discrete analysis on directed graphs, and construct a discrete analogue of classical regularization theory via discrete analysis. A classification algorithm for directed graphs is then derived from the discrete regularization. We have applied the approach to real-world link spam detection problems, and encouraging results have been obtained.
- A. Ando and T. Zhang. Learning on graph with Laplacian regularization. In Advances in Neural Information Processing Systems 19, Cambridge, MA, 2007. MIT Press.Google Scholar
- R. Baeza-Yates, P. Boldi, and C. Castillo. Generalizing PageRank: Damping functions for link-based ranking algorithms. In Proc. 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval, pages 308--315, 2006. Google Scholar
Digital Library
- M. Belkin, I. Matveeva, and P. Niyogi. Regression and regularization on large graphs. In Proc. 17th Annual Conference on Learning Theory, 2004.Google Scholar
- S. Brin and L. Page. The anatomy of a large scale hypertextual web search engine. In Proc. 7th International World Wide Web Conference, 1998. Google Scholar
Digital Library
- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. In Proc. 9th International World Wide Web Conference, 2000. Google Scholar
Digital Library
- C. Carlos, D. Debora, G. Aristides, M. Vanessa, and S. Fabrizio. Know your neighbors: Web spam detection using the web topology. Technical report, 2006.Google Scholar
- C. Castillo, D. Donato, L. Becchetti, P. Boldi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40(2), 2006. Google Scholar
Digital Library
- F. Chung. Laplacian and the Cheeger inequality for directed graphs. Annals of Combinatorics, 9:1--19, 2005.Google Scholar
Cross Ref
- F. Chung, A. Grigoryan, and S.-T. Yau. Higher eigenvalues and isoperimetric inequalities on Riemannian manifolds and graphs. Communications on Analysis and Geometry, 8:969--1026, 2000.Google Scholar
Cross Ref
- D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artificial Intelligence Research, 4:129--145, 1996. Google Scholar
Digital Library
- S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6):391--407, 1990.Google Scholar
Cross Ref
- R, El-Yaniv and D. Pechyony. Stable transductive learning. In Proc. 19th Annual Conference on Computational Learning Theory, pages 35--49, 2006. Google Scholar
Digital Library
- Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proc. 30th International Conference on Very Large Data Bases, pages 576--587, 2004. Google Scholar
Digital Library
- T. H. Haveliwala. Topic-sensitive pagerank. In Proc. 11th International World Wide Web Conference, pages 517--526, 2002. Google Scholar
Digital Library
- M. Hein, J. Audibert, and U. von Luxburg. From graphs to manifolds - weak and strong pointwise consistency of graph Laplacians. In Proc. 18th Annual Conference on Learning Theory, pages 470--485, 2005. Google Scholar
Digital Library
- M. Henzinger. Hyperlink analysis for the web. IEEE Internet Computing, 5(1):45--50, 2001. Google Scholar
Digital Library
- T. Joachims. Transductive learning via spectral graph partitioning. In Proc. 20th I International Conference on Machine Learning, 2003.Google Scholar
- J. Jost. Riemannian Geometry and Geometric Analysis. Springer-Verlag, Berlin-Heidelberg, third edition, 2002.Google Scholar
- J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google Scholar
Digital Library
- J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. The web as a graph: Measurements, models, and methods. In Proc. 5th International Conference on Computing and Combinatorics, 1999. Google Scholar
Digital Library
- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proc. 15th International World Wide Web Conference, 2006. Google Scholar
Digital Library
- R. Raj and V. Krishnan. Web spam detection with anti-trust rank. In Proc. 2nd International Workshop on Adversarial Information Retrieval on the Web, pages 37--40, 2006.Google Scholar
- B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.Google Scholar
- T. Simon and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2001. Google Scholar
Digital Library
- R. Tarjan. Depth first search and linear graph algorithms. SIAM Journal on Computing, 1:146--160, 1972.Google Scholar
Cross Ref
- A. Tikhonov and V. Arsenin. Solutions of Ill-posed Problems. W. H. Winston, Washington, DC, 1977.Google Scholar
- V. Vapnik. Statistical Learning Theory. Wiley, NY, 1998. Google Scholar
Digital Library
- G. Wahba. Spline Models for Observational Data. Number 59 in CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.Google Scholar
- B. Wu and B. D. Davison. Identifying link farm spam pages. In WWW (Special interest tracks and posters), pages 820--829, 2005. Google Scholar
Digital Library
- D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.Google Scholar
- D. Zhou, J. Huang, and B. Schölkopf. Learning from labeled and unlabeled data on a directed graph. In Proc. 22th International Conference on Machine Learning, 2005. Google Scholar
Digital Library
- X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In Proc. 20th International Conference on Machine Learning, 2003.Google Scholar
Index Terms
Transductive link spam detection
Recommendations
Survey on web spam detection: principles and algorithms
Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even ...
Link-based web spam detection using weight properties
Link spam is created with the intention of boosting one target's rank in exchange of business profit. This unethical way of deceiving Web search engines is known as Web spam. Since then many anti-link spam detection techniques have constantly being ...
Combating link spam by noisy link analysis
ADMA'10: Proceedings of the 6th international conference on Advanced data mining and applications: Part ILink Spam has indentified as one of the major obstacles for linkbased ranking algorithms of modern search engine since it intently constructs hyperlink structure to help some poor-content pages obtaining undeserved high rank. This problem is even worse ...





Comments