skip to main content
research-article

Identifying Web Spam with the Wisdom of the Crowds

Authors Info & Claims
Published:01 March 2012Publication History
Skip Abstract Section

Abstract

Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam-detection techniques are usually designed for specific, known types of Web spam and are incapable of dealing with newly appearing spam types efficiently. With user-behavior analyses from Web access logs, a spam page-detection algorithm is proposed based on a learning scheme. The main contributions are the following. (1) User-visiting patterns of spam pages are studied, and a number of user-behavior features are proposed for separating Web spam pages from ordinary pages. (2) A novel spam-detection framework is proposed that can detect various kinds of Web spam, including newly appearing ones, with the help of the user-behavior analysis. Experiments on large-scale practical Web access log data show the effectiveness of the proposed features and the detection framework.

References

  1. Abernethy, J., Chapelle, O., and Castillo, C. 2008. WITCH: A new approach to Web spam detection. Yahoo! Res. rep. no. YR-2008-001.Google ScholarGoogle Scholar
  2. Agichtein, E., Brill, E., and Dumaism, S. 2006. Improving Web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, NY, 19--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia. ACM, New York, NY, 38--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bacarella, V., Giannotti, F., Nanni, M., and Pedreschi, D. 2004. Discovery of ads Web hosts through traffic data analysis. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, New York, NY, 76--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. 2006. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis.Google ScholarGoogle Scholar
  6. Bilenko, M. and White, R. W. 2008. Mining the search trails of surfing crowds: Identifying relevant websites from user activity. In Proceeding of the 17th International World Wide Web Conference. ACM, New York, NY, 51--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference. 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Buehrer, G., Stokes, J. W., and Chellapilla, K. 2008. A large-scale study of automated web search traffic. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York, NY, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cai, D., Yu, S., Wen, J., and Ma, W. 2004. Block-based web search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York, NY, 456--463. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Castillo, C. and Davison, B. 2011. Adversarial Web search. Found. Trends Inform. Retrieval 4, 5, 377--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Castillo, C., Corsi, C., Donato, D., Ferragina, P., and Gionis, A. 2008. Query-log mining for detecting spam. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York, NY, 17--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chellapilla, K. and Chickering, D. M. 2006. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web. 17--24.Google ScholarGoogle Scholar
  13. CNNIC (China Internet Network Information Center). 2009. Search engine user behavior research report.Google ScholarGoogle Scholar
  14. Cormack, G. V., Smucker, M. D., and Clarke, C. L. A. 2011. Efficient and effective spam filtering and re-ranking for large Web datasets. Inform. Retrieval. 1--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Craswell, N., Hawking, D., and Robertson, S. 2001. Effective site finding using link anchor information. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, NY, 250--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Davison, B. 2000. Recognizing nepotistic links on the Web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search. Tech. rep. WS-00-01. 23--28.Google ScholarGoogle Scholar
  17. Denis, F. 1998. PAC learning from positive statistical queries. In Proceedings of the 9th International Conference on Algorithmic Learning Theory. Lecture Notes in Computer Science, vol. 1501, 112--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam Webpages. In Proceedings of the 7th International Workshop on the Web and Databases. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Fuxman, A., Tsaparas, P., Achan, K., and Agrawal, R. 2008. Using the wisdom of the crowds for keyword generation. In Proceeding of the 17th International World Wide Web Conference. ACM, New York, NY, 61--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Geng, G., Wang, C., Li, Q., Xu, L., and Jin, X. 2007. Boosting the performance of web spam detection with ensemble under-sampling classification. In Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD’07). 583--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gyongyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. 1--9.Google ScholarGoogle Scholar
  22. Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with trustrank. In Proceedings of the 13th International Conference on Very Large Data Bases. 576--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Henzinger, M. R., Motwani, R., and Silverstein, C. 2003. Challenges in Web search engines. In Proceedings of the 18th International Joint Conference on Artificial Intelligence. 1573--1579. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jansen, J. B. 2007. Click fraud. Comput. 40, 7, 85--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Krishnan, V. and Raj, R. 2006. Web spam detection with anti-trust-rank. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google ScholarGoogle Scholar
  27. Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., and Li, H. 2008. BrowseRank: Letting Web users vote for page importance. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, NY, 451--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Liu, Y., Cen, R., Zhang, M., Ma, S., and Ru, L. 2008a. Identifying Web spam with user behavior analysis. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Liu, Y., Zhang, M., Ma, S., and Ru, L. 2008b. User behavior oriented Web spam detection. In Proceeding of the 17th International World Wide Web Conference (WWW’08). ACM, New York, NY, 1039--1040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Liu, Y., Zhang, M., Ma, S., and Ru, L. 2009. User browsing graph: Structure, evolution, and application. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09).Google ScholarGoogle Scholar
  31. Manevitz, L. M. and Yousef, M. 2002. One-class SVMs for document classification. Mach. Learn. 2, 139--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mitchell, T. 1997. Chapter 6: Bayesian Learning, Machine Learning, McGraw-Hill Education, New York, NY.Google ScholarGoogle Scholar
  33. Nigam, K., Mccallum, A. K., Thrun, S., and Mitchell, T. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2--3, 103--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW’06). ACM Press, New York, NY, 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Piskorski, J., Sydow, M., and Weiss, D. 2008. Exploring linguistic features for Web spam Detection: A preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). ACM, New York, NY, 25--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Song, R., Liu, H., Wen, J., and Ma, W. 2004. Learning block importance models for webpages. In Proceedings of the 13th international World Wide Web Conference (WWW’04). ACM, New York, NY, 203--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Svore, K., Wu, Q., Burges, C. and Raman, A. 2007. Improving Web spam classification using rank-time features. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Voorhees, E. M. 2001. The philosophy of information retrieval evaluation. In Revised Papers from the 2nd Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems (CLEF’01). 355--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Wang, Y., Ma, M., Niu, Y., and Chen, H. 2007. Spam double-funnel: Connecting Web spammers with advertisers. In Proceedings of the 16th International World Wide Web Conference (WWW’07). ACM, New York, NY, 291--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wu, B. and Davison, B. 2005. Cloaking and redirection: A preliminary study. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  42. Yu, H., Han, J., and Chang, K. C. 2004. PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Engin. 16, 1, 70--81. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Identifying Web Spam with the Wisdom of the Crowds

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on the Web
            ACM Transactions on the Web  Volume 6, Issue 1
            March 2012
            109 pages
            ISSN:1559-1131
            EISSN:1559-114X
            DOI:10.1145/2109205
            Issue’s Table of Contents

            Copyright © 2012 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 March 2012
            • Accepted: 1 June 2011
            • Revised: 1 March 2011
            • Received: 1 November 2009
            Published in tweb Volume 6, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!