Abstract
Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam-detection techniques are usually designed for specific, known types of Web spam and are incapable of dealing with newly appearing spam types efficiently. With user-behavior analyses from Web access logs, a spam page-detection algorithm is proposed based on a learning scheme. The main contributions are the following. (1) User-visiting patterns of spam pages are studied, and a number of user-behavior features are proposed for separating Web spam pages from ordinary pages. (2) A novel spam-detection framework is proposed that can detect various kinds of Web spam, including newly appearing ones, with the help of the user-behavior analysis. Experiments on large-scale practical Web access log data show the effectiveness of the proposed features and the detection framework.
- Abernethy, J., Chapelle, O., and Castillo, C. 2008. WITCH: A new approach to Web spam detection. Yahoo! Res. rep. no. YR-2008-001.Google Scholar
- Agichtein, E., Brill, E., and Dumaism, S. 2006. Improving Web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, NY, 19--26. Google Scholar
Digital Library
- Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia. ACM, New York, NY, 38--47. Google Scholar
Digital Library
- Bacarella, V., Giannotti, F., Nanni, M., and Pedreschi, D. 2004. Discovery of ads Web hosts through traffic data analysis. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, New York, NY, 76--81. Google Scholar
Digital Library
- Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. 2006. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis.Google Scholar
- Bilenko, M. and White, R. W. 2008. Mining the search trails of surfing crowds: Identifying relevant websites from user activity. In Proceeding of the 17th International World Wide Web Conference. ACM, New York, NY, 51--60. Google Scholar
Digital Library
- Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference. 107--117. Google Scholar
Digital Library
- Buehrer, G., Stokes, J. W., and Chellapilla, K. 2008. A large-scale study of automated web search traffic. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York, NY, 1--8. Google Scholar
Digital Library
- Cai, D., Yu, S., Wen, J., and Ma, W. 2004. Block-based web search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York, NY, 456--463. Google Scholar
Digital Library
- Castillo, C. and Davison, B. 2011. Adversarial Web search. Found. Trends Inform. Retrieval 4, 5, 377--486. Google Scholar
Digital Library
- Castillo, C., Corsi, C., Donato, D., Ferragina, P., and Gionis, A. 2008. Query-log mining for detecting spam. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York, NY, 17--20. Google Scholar
Digital Library
- Chellapilla, K. and Chickering, D. M. 2006. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web. 17--24.Google Scholar
- CNNIC (China Internet Network Information Center). 2009. Search engine user behavior research report.Google Scholar
- Cormack, G. V., Smucker, M. D., and Clarke, C. L. A. 2011. Efficient and effective spam filtering and re-ranking for large Web datasets. Inform. Retrieval. 1--25. Google Scholar
Digital Library
- Craswell, N., Hawking, D., and Robertson, S. 2001. Effective site finding using link anchor information. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, NY, 250--257. Google Scholar
Digital Library
- Davison, B. 2000. Recognizing nepotistic links on the Web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search. Tech. rep. WS-00-01. 23--28.Google Scholar
- Denis, F. 1998. PAC learning from positive statistical queries. In Proceedings of the 9th International Conference on Algorithmic Learning Theory. Lecture Notes in Computer Science, vol. 1501, 112--126. Google Scholar
Digital Library
- Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam Webpages. In Proceedings of the 7th International Workshop on the Web and Databases. 1--6. Google Scholar
Digital Library
- Fuxman, A., Tsaparas, P., Achan, K., and Agrawal, R. 2008. Using the wisdom of the crowds for keyword generation. In Proceeding of the 17th International World Wide Web Conference. ACM, New York, NY, 61--70. Google Scholar
Digital Library
- Geng, G., Wang, C., Li, Q., Xu, L., and Jin, X. 2007. Boosting the performance of web spam detection with ensemble under-sampling classification. In Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD’07). 583--587. Google Scholar
Digital Library
- Gyongyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. 1--9.Google Scholar
- Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with trustrank. In Proceedings of the 13th International Conference on Very Large Data Bases. 576--587. Google Scholar
Digital Library
- Henzinger, M. R., Motwani, R., and Silverstein, C. 2003. Challenges in Web search engines. In Proceedings of the 18th International Joint Conference on Artificial Intelligence. 1573--1579. Google Scholar
Digital Library
- Jansen, J. B. 2007. Click fraud. Comput. 40, 7, 85--86. Google Scholar
Digital Library
- Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google Scholar
Digital Library
- Krishnan, V. and Raj, R. 2006. Web spam detection with anti-trust-rank. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).Google Scholar
- Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., and Li, H. 2008. BrowseRank: Letting Web users vote for page importance. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, NY, 451--458. Google Scholar
Digital Library
- Liu, Y., Cen, R., Zhang, M., Ma, S., and Ru, L. 2008a. Identifying Web spam with user behavior analysis. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08). ACM, New York. Google Scholar
Digital Library
- Liu, Y., Zhang, M., Ma, S., and Ru, L. 2008b. User behavior oriented Web spam detection. In Proceeding of the 17th International World Wide Web Conference (WWW’08). ACM, New York, NY, 1039--1040. Google Scholar
Digital Library
- Liu, Y., Zhang, M., Ma, S., and Ru, L. 2009. User browsing graph: Structure, evolution, and application. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09).Google Scholar
- Manevitz, L. M. and Yousef, M. 2002. One-class SVMs for document classification. Mach. Learn. 2, 139--154. Google Scholar
Digital Library
- Mitchell, T. 1997. Chapter 6: Bayesian Learning, Machine Learning, McGraw-Hill Education, New York, NY.Google Scholar
- Nigam, K., Mccallum, A. K., Thrun, S., and Mitchell, T. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2--3, 103--134. Google Scholar
Digital Library
- Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW’06). ACM Press, New York, NY, 83--92. Google Scholar
Digital Library
- Piskorski, J., Sydow, M., and Weiss, D. 2008. Exploring linguistic features for Web spam Detection: A preliminary study. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). ACM, New York, NY, 25--28. Google Scholar
Digital Library
- Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large Web search engine query log. SIGIR Forum 33, 1, 6--12. Google Scholar
Digital Library
- Song, R., Liu, H., Wen, J., and Ma, W. 2004. Learning block importance models for webpages. In Proceedings of the 13th international World Wide Web Conference (WWW’04). ACM, New York, NY, 203--211. Google Scholar
Digital Library
- Svore, K., Wu, Q., Burges, C. and Raman, A. 2007. Improving Web spam classification using rank-time features. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’07). Google Scholar
Digital Library
- Voorhees, E. M. 2001. The philosophy of information retrieval evaluation. In Revised Papers from the 2nd Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems (CLEF’01). 355--370. Google Scholar
Digital Library
- Wang, Y., Ma, M., Niu, Y., and Chen, H. 2007. Spam double-funnel: Connecting Web spammers with advertisers. In Proceedings of the 16th International World Wide Web Conference (WWW’07). ACM, New York, NY, 291--300. Google Scholar
Digital Library
- Wu, B. and Davison, B. 2005. Cloaking and redirection: A preliminary study. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.Google Scholar
- Yu, H., Han, J., and Chang, K. C. 2004. PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Engin. 16, 1, 70--81. Google Scholar
Digital Library
Index Terms
Identifying Web Spam with the Wisdom of the Crowds
Recommendations
Identifying web spam with user behavior analysis
AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the webCombating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for newly-appeared spam. With user ...
Fighting against web spam: a novel propagation method based on click-through data
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalCombating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam ...
User behavior oriented web spam detection
WWW '08: Proceedings of the 17th international conference on World Wide WebCombating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for recently-appeared spam. With user ...






Comments