skip to main content
research-article

Ads-portal domains: Identification and measurements

Published:29 April 2010Publication History
Skip Abstract Section

Abstract

An ads-portal domain refers to a Web domain that shows only advertisements, served by a third-party advertisement syndication service, in the form of ads listing. We develop a machine-learning-based classifier to identify ads-portal domains, which has 96% accuracy. We use this classifier to measure the prevalence of ads-portal domains on the Internet. Surprisingly, 28.3/25% of the (two-level) *.com/*.net web domains are ads-portal domains. Also, 41/39.8% of *.com/*.net ads-portal domains are typos of well-known domains, also known as typo-squatting domains. In addition, we use the classifier along with DNS trace files to estimate how often Internet users visit ads-portal domains. It turns out that ∼5% of the two-level *.com, *.net, *.org, *.biz and *.info web domains on the traces are ads-portal domains and ∼50% of these accessed ads-portal domains are typos. These numbers show that ads-portal domains and typo-squatting ads-portal domains are prevalent on the Internet and successful in attracting many visits. Our classifier represents a step towards better categorizing the web documents. It can also be helpful to search engines ranking algorithms, helpful in identifying web spams that redirects to ads-portal domains, and used to discourage access to typo-squatting ads-portal domains.

References

  1. Banerjee, A., Barman, D., Faloutsos, M., and Bhuyan, L. N. 2008. Cyber-fraud is one typo away. In Proceedings of the Infocom Mini-Conference.Google ScholarGoogle Scholar
  2. Breiman, L. 1996. Bagging predictors. Mach. Learn. 24, 2, 123--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Breiman, L. 2001. Random Forests. Mach. Learn. 45, 1, 5--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cohen, W. 1995. Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning.Google ScholarGoogle ScholarCross RefCross Ref
  5. Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines: and Other Kernel-Based Learning Methods. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Drucker, H., Vapnik, V., and Wu, D. 1999. Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10, 5, 1048--1054. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Esfandiari, B. and Nock, R. 2005. Adaptive filtering of advertisements on Web pages. In Proceedings of the International World Wide Web Conference (WWW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F-Secure. 2005. Googkle.com installed malware by exploiting browser vulnerabilities. http://www.f-secure.com/v-descs/googkle.shtml.Google ScholarGoogle Scholar
  9. Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and Berners-Lee, T. 1999. Hypertext transfer protocol -- HTTP/1.1. RFC 2616.Google ScholarGoogle Scholar
  10. Freund, Y. and Schapire, R. 1995. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the European Conference on Computational Learning Theory. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Google. 2006. Google SOAP Search API. http://code.google.com/apis/soapsearch/.Google ScholarGoogle Scholar
  12. Google. 2008. Google adsense. http://www.google.com/adsense.Google ScholarGoogle Scholar
  13. Gusfield, D. 1998. Algorithms on Strings, Trees and Sequences. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Iba, W. and Langley, P. 1992. Induction of one-level decision trees. In Proceedings of the 9th International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Joachims, T. 2001. A statistical learning model of text classification with support vector machines. In Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kawakita, M., Minami, M., Eguchi, S., and Lennert-Cody, C. E. 2005. An introduction to the predictive technique AdaBoost with a comparison to generalized additive models. In Fisheries research.Google ScholarGoogle Scholar
  17. Kohavi, R. 1995. The power of decision tables. In Proceedings of the European Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kushmerick, N. 1999. Learning to remove Internet advertisements. In Proceedings of the 3rd International Conference on Autonomous Agents. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. McAfee. 2007. McAfee's study of typosquatting. www.mcafee.com/typosquatters.Google ScholarGoogle Scholar
  20. McAfee. 2008. McAfee siteadvisor. http://www.siteadvisor.com/.Google ScholarGoogle Scholar
  21. Mitchell, T. 1997. Machine Learning. McGraw Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mockapetris, P. 1987. Domain names—Implementation and specification. RFC 1035.Google ScholarGoogle Scholar
  23. Mozdev. 2008. AdBlock. http://adblock.mozdev.org/.Google ScholarGoogle Scholar
  24. Mozilla. 2009. JavaScript. https://developer.mozilla.org/en/JavaScript.Google ScholarGoogle Scholar
  25. Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam Web pages through content analysis. In Proceedings of the International World Wide Web Conference (WWW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Porter, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Quinlan, J. 1993. c4.5: Programs for Machine Learning. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Quinlan, J. R. 1996. Bagging, boosting, and c4.5. In Proceedings of the 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Raggett, D., Hors, A. L., and Jacobs, I. 1998. HTML 4.0 specification. http://www.w3.org/TR/1998/REC-html40-19980424.Google ScholarGoogle Scholar
  30. Wang, Y.-M., Beck, D., Wang, J., Verbowski, C., and Daniels, B. 2006. Strider Typo-Patrol: Discovery and Analysis of Systematic Typo-Squatting. In Proceedings of the Usenix SRUTI Workshop. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Wang, Y.-M., Ma, M., Niu, Y., and Chen, H. 2007. Spam double-funnel: connecting Web spammers with advertisers. In Proceedings of the International World Wide Web Conference (WWW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Wikipedia. 2008. Type-in traffic. http://en.wikipedia.org/wiki/Type-in_traffic.Google ScholarGoogle Scholar
  33. Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yahoo. 2007. Yahoo! directory. http://dir.yahoo.com/.Google ScholarGoogle Scholar
  35. Yahoo. 2008. Yahoo search Web services. http://developer.yahoo.com/search/web/V1/spellingSuggestion.html.Google ScholarGoogle Scholar

Index Terms

  1. Ads-portal domains: Identification and measurements

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!