skip to main content
research-article

Propagating Both Trust and Distrust with Target Differentiation for Combating Link-Based Web Spam

Published:08 July 2014Publication History
Skip Abstract Section

Abstract

Semi-automatic anti-spam algorithms propagate either trust through links from a good seed set (e.g., TrustRank) or distrust through inverse links from a bad seed set (e.g., Anti-TrustRank) to the entire Web. These kinds of algorithms have shown their powers in combating link-based Web spam since they integrate both human judgement and machine intelligence. Nevertheless, there is still much space for improvement. One issue of most existing trust/distust propagation algorithms is that only trust or distrust is propagated and only a good seed set or a bad seed set is used. According to Wu et al. [2006a], a combined usage of both trust and distrust propagation can lead to better results, and an effective framework is needed to realize this insight. Another more serious issue of existing algorithms is that trust or distrust is propagated in nondifferential ways, that is, a page propagates its trust or distrust score uniformly to its neighbors, without considering whether each neighbor should be trusted or distrusted. Such kinds of blind propagating schemes are inconsistent with the original intention of trust/distrust propagation. However, it seems impossible to implement differential propagation if only trust or distrust is propagated. In this article, we take the view that each Web page has both a trustworthy side and an untrustworthy side, and we thusly assign two scores to each Web page: T-Rank, scoring the trustworthiness of the page, and D-Rank, scoring the untrustworthiness of the page. We then propose an integrated framework that propagates both trust and distrust. In the framework, the propagation of T-Rank/D-Rank is penalized by the target's current D-Rank/T-Rank. In other words, the propagation of T-Rank/D-Rank is decided by the target's current (generalized) probability of being trustworthy/untrustworthy; thus a page propagates more trust/distrust to a trustworthy/untrustworthy neighbor than to an untrustworthy/trustworthy neighbor. In this way, propagating both trust and distrust with target differentiation is implemented. We use T-Rank scores to realize spam demotion and D-Rank scores to accomplish spam detection. The proposed Trust-DistrustRank (TDR) algorithm regresses to TrustRank and Anti-TrustRank when the penalty factor is set to 1 and 0, respectively. Thus TDR could be seen as a combinatorial generalization of both TrustRank and Anti-TrustRank. TDR not only makes full use of both trust and distrust propagation, but also overcomes the disadvantages of both TrustRank and Anti-TrustRank. Experimental results on benchmark datasets show that TDR outperforms other semi-automatic anti-spam algorithms for both spam demotion and spam detection tasks under various criteria.

References

  1. L. Becchetti, C. Castillo, D. Donato, R. Baeza-Yates, and S. Leonardi. 2008. Link analysis for web spam detection. ACM Trans. Web 2, 1, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. 2006. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD'06). ACM Press, New York.Google ScholarGoogle Scholar
  3. A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. 2005. Spamrank -- Fully automatic link spam detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05). 25--38.Google ScholarGoogle Scholar
  4. P. Boldi. 2005. Totalrank: Ranking without damping. In Proceedings of the 14th International Conference on World Wide Web Special Interest Tracks and Posters (WWW'05). ACM Press, New York, 898--899. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Callan, M. Hoy, C. Yoo, and L. Zhao. 2009. The clueweb09 data set. http://boston.lti.cs.cmu.edu/Data/clueweb09/.Google ScholarGoogle Scholar
  7. C. Castillo and B. D. Davison. 2011. Adversarial web search. Foundat. Trends Inf. Retr. 4, 5, 377--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Caverlee and L. Liu. 2007. Countering web spam with credibility-based link analysis. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing (PODC'07). ACM Press, New York, 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Q. Chen, S.-N. Yu, and S. Cheng. 2008. Link variable trustrank for fighting web spam. In Proceedings of the International Conference on Computer Science and Software Engineering (CSSE'08). Vol. 4, IEEE Computer Society, 1004--1007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Cormack, M. Smucker, and C. Clarke. 2011. Efficient and effective spam filtering and re-ranking for large web datasets. Inf. Retr. 14, 1--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. D. Davison. 2000. Recognizing nepotistic links on the web. In Proceedings of the Workshop on Artificial Intelligence for Web Search (AAAI'00). 23--28.Google ScholarGoogle Scholar
  12. A. Deif. 1982. Advanced Matrix Theory for Scientists and Engineers. Routledge.Google ScholarGoogle Scholar
  13. Z. Gyongyi and H. Garcia-Molina. 2005a. Link spam alliances. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). VLDB Endowment, 517--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Z. Gyongyi and H. Garcia-Molina. 2005b. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'05). 39--47.Google ScholarGoogle Scholar
  15. Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. 2004. Combating web spam with trustrank. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB'04). Vol. 30, VLDB Endowment, 576--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. R. Henzinger, R. Motwani, and C. Silverstein. 2002. Challenges in web search engines. SIGIR Forum 36, 2, 11--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Heymann, G. Koutrika, and H. Garcia-Molina. 2007. Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Comput. 11, 6, 36--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Q. Jiang, L. Zhang, Y. Zhu, and Y. Zhang. 2008. Larger is better: Seed selection in link-based anti-spamming algorithms. In Proceeding of the 17th International Conference on World Wide Web (WWW'08). ACM Press, New York, 1065--1066. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 604--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Krishnan and R. Raj. 2006. Web spam detection with anti-trust rank. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06). ACM Press, New York, 37--40.Google ScholarGoogle Scholar
  21. R. Lempel and S. Moran. 2001. Salsa: The stochastic approach for link-structure analysis. ACM Trans. Inf. Syst. 19, 2, 131--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. X. Liu, Y. Wang, S. Zhu, and H. Lin. 2013. Combating web spam through trust-distrust propogation with confidence. Pattern Recogn. Lett. 34, 13, 1462--1469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Metaxas. 2009. Using propagation of distrust to find untrustworthy web neighborhoods. In Proceedings of the 4th International Conference on Internet and Web Applications and Services (ICIW'09). IEEE Computer Society, 516--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Nie, B. Wu, and B. Davison. 2007. Winnowing wheat from the chaff: Propagating trust to sift spam from the web. In Proceedings of the 30th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR'07). Vol. 23, 869--870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. 1999. Analysis of a very large web search engine query log. SIGIR Forum 33, 6--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Spirin and J. Han. 2012. Survey on web spam detection: Principles and algorithms. ACM SIGKDD Explor. Newslett. 13, 2, 50--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. I. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. Cunningham. 1999. Weka: Practical machine learning tools and techniques with Java implementations. http://researchcommons.waikato.ac.nz/bitstream/handle/10289/1040/uow-cs-wp-1999-11.pdf?sequence=1&isAllowed=y.Google ScholarGoogle Scholar
  28. B. Wu and K. Chellapilla. 2007. Extracting link spam using biased random walks from spam seed sets. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb'07). ACM Press, New York, 37--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Wu, V. Goel, and B. D. Davison. 2006a. Propagating trust and distrust to demote web spam. In Proceedings of the Workshop on Models of Trust for the Web (MTW'06).Google ScholarGoogle Scholar
  30. B. Wu, V. Goel, and B. D. Davison. 2006b. Topical trustrank: Using topicality to combat web spam. In Proceedings of the 15th International Conference on World Wide Web (WWW'06). ACM Press, New York, 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yahoo!. 2007. Yahoo! research: Web spam collections. http://barcelona.research.yahoo.net/webspam/datasets/Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.dsi.unimi.it/.Google ScholarGoogle Scholar
  32. L. Zhang, Y. Zhang, Y. Zhang, and X. Li. 2006. Exploring both content and link quality for anti-spamming. In Proceedings of the 6th IEEE International Conference on Computer and Information Technology (CIT'06). IEEE Computer Society, 37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. X. Zhang, B. Han, and W. Liang. 2009b. Automatic seed set expansion for trust propagation based anti-spamming algorithms. In Proceeding of the 11th International Workshop on Web Information and Data Management (WIDM'09). ACM Press, New York, 31--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. X. Zhang, Y. Wang, N. Mou, and W. Liang. 2011. Propagating both trust and distrust with target differentiation for combating web spam. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI'11).Google ScholarGoogle Scholar
  35. Y. Zhang, Q. Jiang, L. Zhang, and Y. Zhu. 2009b. Exploiting bidirectional links: Making spamming detection easier. In Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM Press, New York, 1839--1842. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Zhao, Q. Jiang, and Y. Zhang. 2008. From good to bad ones: Making spam detection easier. In Proceedings of the 8th IEEE International Conference on Computer and Information Technology Workshops (CIT'08). IEEE Computer Society, 129--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. B. Zhou and J. Pei. 2009. Link spam target detection using page farms. ACM Trans. Knowl. Discov. Data 3, 13:1--13:38. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Propagating Both Trust and Distrust with Target Differentiation for Combating Link-Based Web Spam

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!