skip to main content
research-article

CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

Published:01 September 2011Publication History
Skip Abstract Section

Abstract

Phishing is a plague in cyberspace. Typically, phish detection methods either use human-verified URL blacklists or exploit Web page features via machine learning techniques. However, the former is frail in terms of new phish, and the latter suffers from the scarcity of effective features and the high false positive rate (FP). To alleviate those problems, we propose a layered anti-phishing solution that aims at (1) exploiting the expressiveness of a rich set of features with machine learning to achieve a high true positive rate (TP) on novel phish, and (2) limiting the FP to a low level via filtering algorithms.

Specifically, we proposed CANTINA+, the most comprehensive feature-based approach in the literature including eight novel features, which exploits the HTML Document Object Model (DOM), search engines and third party services with machine learning techniques to detect phish. Moreover, we designed two filters to help reduce FP and achieve runtime speedup. The first is a near-duplicate phish detector that uses hashing to catch highly similar phish. The second is a login form filter, which directly classifies Web pages with no identified login form as legitimate.

We extensively evaluated CANTINA+ with two methods on a diverse spectrum of corpora with 8118 phish and 4883 legitimate Web pages. In the randomized evaluation, CANTINA+ achieved over 92% TP on unique testing phish and over 99% TP on near-duplicate testing phish, and about 0.4% FP with 10% training phish. In the time-based evaluation, CANTINA+ also achieved over 92% TP on unique testing phish, over 99% TP on near-duplicate testing phish, and about 1.4% FP under 20% training phish with a two-week sliding window. Capable of achieving 0.4% FP and over 92% TP, our CANTINA+ has been demonstrated to be a competitive anti-phishing solution.

References

  1. 3sharp report. 2006. Gone phishing: Evaluating anti-phishing tools for windows. http://www.3sharp.com/projects/antiphishing/gone-phishing.pdf.Google ScholarGoogle Scholar
  2. Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 2, 121--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chen, T.-C., Dick, S., and Miller, J. 2010. Detecting visually similar web pages: Application to phishing detection. ACM Trans. Intern. Tech. 10, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chou, N., Ledesma, R., Teraguchi, Y., and Mitchell, J. C. 2004. Client-side defense against web-based identity theft. In Proceedings of the 11th Annual Network and Distributed System Security Symposium (NDSS’04).Google ScholarGoogle Scholar
  5. Cortes, C. and Mohri, M. 2003. Auc optimization vs. error rate minimization. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’03).Google ScholarGoogle Scholar
  6. Cova, M., Kruegel, C., and Vigna, G. 2008. There is no free phish: An analysis of “free” and live phishing kits. In Proceedings of the 2nd USENIX Workshop on Offensive Technologies (WOOT’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dhamija, R. and Tygar, J. D. 2005. The battle against phishing: Dynamic security skins. In Proceedings of the 2005 Symposium on Usable privacy and security (SOUPS’05). 77--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Fawcett, T. 2006. An introduction to roc analysis. Patt. Recog. Lett. 27, 861--874. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 649--656. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Fetterly, D., Manasse, M., and Najork, M. 2003. On the evolution of clusters of near-duplicate web pages. In Proceedings of the 1st Conference on Latin American Web Congress (LA-WEB’03). 37--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Garera, S., Provos, N., Chew, M., and Rubin, A. D. 2007. A framework for detection and measurement of phishing attacks. In Proceedings of the 2007 ACM Workshop on Recurring Malcode (WORM’07). 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Le, A., Markopoulou, A., and Faloutsos, M. 2010. Phishdef: Url names say it all. CoRR abs/1009.2275.Google ScholarGoogle Scholar
  13. Liu, W., Huang, G., Liu, X., Zhang, M., and Deng, X. 2005. Detection of phishing Web pages based on visual similarity. In Proceedings of the 14th International Conference on World Wide Web (WWW’05). (Special Interest Tracks and Posters). 1060--1061. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ludl, C., McAllister, S., Kirda, E., and Kruegel, C. 2007. On the effectiveness of techniques to detect phishing sites. In Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Lecture Notes in Computer Science, vol. 4579, 20--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. McCall, T. 2007. Gartner survey. http://www.gartner.com/it/page.jsp?id=565125.Google ScholarGoogle Scholar
  16. Medvet, E., Eurecom, E. K., and Kruegel, C. 2008. Visual-similarity-based phishing detection. In Proceedings of the 4th International Conference on Security and Privacy in Communication Networks (SecureComm’08). 30--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Moore, T. and Clayton, R. 2007. Examining the impact of Web site take-down on phishing. In Proceedings of the Anti-phishing Working Groups 2nd Annual eCrime Researchers Summit. 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. NIST. 1995. Secure hash standard. Federal Information Processing Standards Publication 180-1. National Institute of Standards and Technology (NIST).Google ScholarGoogle Scholar
  19. Pan, Y. and Ding, X. 2006. Anomaly based web phishing page detection. In Proceedings of the 22nd Annual Computer Security Applications Conference (ACSAC’06). 381--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. PhishTank. http://www.phishtank.com/stats.php.Google ScholarGoogle Scholar
  21. PhishTank. http://data.phishtank.com/data/online-valid/.Google ScholarGoogle Scholar
  22. Sheng, S., Kumaraguru, P., Acquisti, A., Cranor, L., and Hong, J. 2009. Improving phishing countermeasures: An analysis of expert interviews. In Proceedings of the 4th APWG eCrime Researchers Summit.Google ScholarGoogle Scholar
  23. Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., and Zhang, C. 2009. An empirical analysis of phishing blacklists. In Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09).Google ScholarGoogle Scholar
  24. Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques 2nd Ed. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xiang, G. and Hong, J. 2009. A hybrid phish detection approach by identity discovery and keywords retrieval. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 571--580. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xiang, G., Pendleton, B. A., Hong, J. I., and Rose, C. P. 2010. A hierarchical adaptive probabilistic approach for zero hour phish detection. In Proceedings of the 15th European Symposium on Research in Computer Security (ESORICS’10). 268--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Zadrozny, B., Langford, J., and Abe, N. 2003. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining. 435--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Zhang, Y., Hong, J., and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing web sites. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 639--648. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Information and System Security
        ACM Transactions on Information and System Security  Volume 14, Issue 2
        September 2011
        199 pages
        ISSN:1094-9224
        EISSN:1557-7406
        DOI:10.1145/2019599
        Issue’s Table of Contents

        Copyright © 2011 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 September 2011
        • Accepted: 1 May 2011
        • Revised: 1 December 2010
        • Received: 1 May 2010
        Published in tissec Volume 14, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!