Abstract
An ads-portal domain refers to a Web domain that shows only advertisements, served by a third-party advertisement syndication service, in the form of ads listing. We develop a machine-learning-based classifier to identify ads-portal domains, which has 96% accuracy. We use this classifier to measure the prevalence of ads-portal domains on the Internet. Surprisingly, 28.3/25% of the (two-level) *.com/*.net web domains are ads-portal domains. Also, 41/39.8% of *.com/*.net ads-portal domains are typos of well-known domains, also known as typo-squatting domains. In addition, we use the classifier along with DNS trace files to estimate how often Internet users visit ads-portal domains. It turns out that ∼5% of the two-level *.com, *.net, *.org, *.biz and *.info web domains on the traces are ads-portal domains and ∼50% of these accessed ads-portal domains are typos. These numbers show that ads-portal domains and typo-squatting ads-portal domains are prevalent on the Internet and successful in attracting many visits. Our classifier represents a step towards better categorizing the web documents. It can also be helpful to search engines ranking algorithms, helpful in identifying web spams that redirects to ads-portal domains, and used to discourage access to typo-squatting ads-portal domains.
- Banerjee, A., Barman, D., Faloutsos, M., and Bhuyan, L. N. 2008. Cyber-fraud is one typo away. In Proceedings of the Infocom Mini-Conference.Google Scholar
- Breiman, L. 1996. Bagging predictors. Mach. Learn. 24, 2, 123--140. Google Scholar
Digital Library
- Breiman, L. 2001. Random Forests. Mach. Learn. 45, 1, 5--32. Google Scholar
Digital Library
- Cohen, W. 1995. Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning.Google Scholar
Cross Ref
- Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines: and Other Kernel-Based Learning Methods. Cambridge University Press. Google Scholar
Digital Library
- Drucker, H., Vapnik, V., and Wu, D. 1999. Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10, 5, 1048--1054. Google Scholar
Digital Library
- Esfandiari, B. and Nock, R. 2005. Adaptive filtering of advertisements on Web pages. In Proceedings of the International World Wide Web Conference (WWW). Google Scholar
Digital Library
- F-Secure. 2005. Googkle.com installed malware by exploiting browser vulnerabilities. http://www.f-secure.com/v-descs/googkle.shtml.Google Scholar
- Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and Berners-Lee, T. 1999. Hypertext transfer protocol -- HTTP/1.1. RFC 2616.Google Scholar
- Freund, Y. and Schapire, R. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the European Conference on Computational Learning Theory. Google Scholar
Digital Library
- Google. 2006. Google SOAP Search API. http://code.google.com/apis/soapsearch/.Google Scholar
- Google. 2008. Google adsense. http://www.google.com/adsense.Google Scholar
- Gusfield, D. 1998. Algorithms on Strings, Trees and Sequences. Cambridge University Press. Google Scholar
Digital Library
- Iba, W. and Langley, P. 1992. Induction of one-level decision trees. In Proceedings of the 9th International Conference on Machine Learning. Google Scholar
Digital Library
- Joachims, T. 2001. A statistical learning model of text classification with support vector machines. In Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval. Google Scholar
Digital Library
- Kawakita, M., Minami, M., Eguchi, S., and Lennert-Cody, C. E. 2005. An introduction to the predictive technique AdaBoost with a comparison to generalized additive models. In Fisheries research.Google Scholar
- Kohavi, R. 1995. The power of decision tables. In Proceedings of the European Conference on Machine Learning. Google Scholar
Digital Library
- Kushmerick, N. 1999. Learning to remove Internet advertisements. In Proceedings of the 3rd International Conference on Autonomous Agents. Google Scholar
Digital Library
- McAfee. 2007. McAfee's study of typosquatting. www.mcafee.com/typosquatters.Google Scholar
- McAfee. 2008. McAfee siteadvisor. http://www.siteadvisor.com/.Google Scholar
- Mitchell, T. 1997. Machine Learning. McGraw Hill. Google Scholar
Digital Library
- Mockapetris, P. 1987. Domain names—Implementation and specification. RFC 1035.Google Scholar
- Mozdev. 2008. AdBlock. http://adblock.mozdev.org/.Google Scholar
- Mozilla. 2009. JavaScript. https://developer.mozilla.org/en/JavaScript.Google Scholar
- Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam Web pages through content analysis. In Proceedings of the International World Wide Web Conference (WWW). Google Scholar
Digital Library
- Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google Scholar
Digital Library
- Quinlan, J. 1993. c4.5: Programs for Machine Learning. Morgan Kaufmann. Google Scholar
Digital Library
- Quinlan, J. R. 1996. Bagging, boosting, and c4.5. In Proceedings of the 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference. Google Scholar
Digital Library
- Raggett, D., Hors, A. L., and Jacobs, I. 1998. HTML 4.0 specification. http://www.w3.org/TR/1998/REC-html40-19980424.Google Scholar
- Wang, Y.-M., Beck, D., Wang, J., Verbowski, C., and Daniels, B. 2006. Strider Typo-Patrol: Discovery and Analysis of Systematic Typo-Squatting. In Proceedings of the Usenix SRUTI Workshop. Google Scholar
Digital Library
- Wang, Y.-M., Ma, M., Niu, Y., and Chen, H. 2007. Spam double-funnel: connecting Web spammers with advertisers. In Proceedings of the International World Wide Web Conference (WWW). Google Scholar
Digital Library
- Wikipedia. 2008. Type-in traffic. http://en.wikipedia.org/wiki/Type-in_traffic.Google Scholar
- Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann. Google Scholar
Digital Library
- Yahoo. 2007. Yahoo! directory. http://dir.yahoo.com/.Google Scholar
- Yahoo. 2008. Yahoo search Web services. http://developer.yahoo.com/search/web/V1/spellingSuggestion.html.Google Scholar
Index Terms
Ads-portal domains: Identification and measurements
Recommendations
Mechanism of Parked Domains Recognition Based on Authoritative DNS Servers
WSSE '20: Proceedings of the 2nd World Symposium on Software EngineeringAt present, there are a large number of parked domains, which seriously affect online users when surfing. To identify parked domains effectively, a new technique was proposed based on authoritative Domain Name Server (DNS). In this way, suspected ...
Spam double-funnel: connecting web spammers with advertisers
WWW '07: Proceedings of the 16th international conference on World Wide WebSpammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam - redirection spam - where one can identify spam pages by the third-party ...
Analysis and detection of web spam by means of web content
IRFC'12: Proceedings of the 5th conference on Multidisciplinary Information RetrievalWeb Spam is one of the main difficulties that crawlers have to overcome. According to Gyöngyi and Garcia-Molina it is defined as "any deliberate human action that is meant to trigger an unjustifiably favourable relevance or importance of some web pages ...






Comments