Abstract
Phishing is a plague in cyberspace. Typically, phish detection methods either use human-verified URL blacklists or exploit Web page features via machine learning techniques. However, the former is frail in terms of new phish, and the latter suffers from the scarcity of effective features and the high false positive rate (FP). To alleviate those problems, we propose a layered anti-phishing solution that aims at (1) exploiting the expressiveness of a rich set of features with machine learning to achieve a high true positive rate (TP) on novel phish, and (2) limiting the FP to a low level via filtering algorithms.
Specifically, we proposed CANTINA+, the most comprehensive feature-based approach in the literature including eight novel features, which exploits the HTML Document Object Model (DOM), search engines and third party services with machine learning techniques to detect phish. Moreover, we designed two filters to help reduce FP and achieve runtime speedup. The first is a near-duplicate phish detector that uses hashing to catch highly similar phish. The second is a login form filter, which directly classifies Web pages with no identified login form as legitimate.
We extensively evaluated CANTINA+ with two methods on a diverse spectrum of corpora with 8118 phish and 4883 legitimate Web pages. In the randomized evaluation, CANTINA+ achieved over 92% TP on unique testing phish and over 99% TP on near-duplicate testing phish, and about 0.4% FP with 10% training phish. In the time-based evaluation, CANTINA+ also achieved over 92% TP on unique testing phish, over 99% TP on near-duplicate testing phish, and about 1.4% FP under 20% training phish with a two-week sliding window. Capable of achieving 0.4% FP and over 92% TP, our CANTINA+ has been demonstrated to be a competitive anti-phishing solution.
- 3sharp report. 2006. Gone phishing: Evaluating anti-phishing tools for windows. http://www.3sharp.com/projects/antiphishing/gone-phishing.pdf.Google Scholar
- Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 2, 121--167. Google Scholar
Digital Library
- Chen, T.-C., Dick, S., and Miller, J. 2010. Detecting visually similar web pages: Application to phishing detection. ACM Trans. Intern. Tech. 10, 2. Google Scholar
Digital Library
- Chou, N., Ledesma, R., Teraguchi, Y., and Mitchell, J. C. 2004. Client-side defense against web-based identity theft. In Proceedings of the 11th Annual Network and Distributed System Security Symposium (NDSS’04).Google Scholar
- Cortes, C. and Mohri, M. 2003. Auc optimization vs. error rate minimization. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’03).Google Scholar
- Cova, M., Kruegel, C., and Vigna, G. 2008. There is no free phish: An analysis of “free” and live phishing kits. In Proceedings of the 2nd USENIX Workshop on Offensive Technologies (WOOT’08). Google Scholar
Digital Library
- Dhamija, R. and Tygar, J. D. 2005. The battle against phishing: Dynamic security skins. In Proceedings of the 2005 Symposium on Usable privacy and security (SOUPS’05). 77--88. Google Scholar
Digital Library
- Fawcett, T. 2006. An introduction to roc analysis. Patt. Recog. Lett. 27, 861--874. Google Scholar
Digital Library
- Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 649--656. Google Scholar
Digital Library
- Fetterly, D., Manasse, M., and Najork, M. 2003. On the evolution of clusters of near-duplicate web pages. In Proceedings of the 1st Conference on Latin American Web Congress (LA-WEB’03). 37--45. Google Scholar
Digital Library
- Garera, S., Provos, N., Chew, M., and Rubin, A. D. 2007. A framework for detection and measurement of phishing attacks. In Proceedings of the 2007 ACM Workshop on Recurring Malcode (WORM’07). 1--8. Google Scholar
Digital Library
- Le, A., Markopoulou, A., and Faloutsos, M. 2010. Phishdef: Url names say it all. CoRR abs/1009.2275.Google Scholar
- Liu, W., Huang, G., Liu, X., Zhang, M., and Deng, X. 2005. Detection of phishing Web pages based on visual similarity. In Proceedings of the 14th International Conference on World Wide Web (WWW’05). (Special Interest Tracks and Posters). 1060--1061. Google Scholar
Digital Library
- Ludl, C., McAllister, S., Kirda, E., and Kruegel, C. 2007. On the effectiveness of techniques to detect phishing sites. In Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Lecture Notes in Computer Science, vol. 4579, 20--39. Google Scholar
Digital Library
- McCall, T. 2007. Gartner survey. http://www.gartner.com/it/page.jsp?id=565125.Google Scholar
- Medvet, E., Eurecom, E. K., and Kruegel, C. 2008. Visual-similarity-based phishing detection. In Proceedings of the 4th International Conference on Security and Privacy in Communication Networks (SecureComm’08). 30--36. Google Scholar
Digital Library
- Moore, T. and Clayton, R. 2007. Examining the impact of Web site take-down on phishing. In Proceedings of the Anti-phishing Working Groups 2nd Annual eCrime Researchers Summit. 1--13. Google Scholar
Digital Library
- NIST. 1995. Secure hash standard. Federal Information Processing Standards Publication 180-1. National Institute of Standards and Technology (NIST).Google Scholar
- Pan, Y. and Ding, X. 2006. Anomaly based web phishing page detection. In Proceedings of the 22nd Annual Computer Security Applications Conference (ACSAC’06). 381--392. Google Scholar
Digital Library
- PhishTank. http://www.phishtank.com/stats.php.Google Scholar
- PhishTank. http://data.phishtank.com/data/online-valid/.Google Scholar
- Sheng, S., Kumaraguru, P., Acquisti, A., Cranor, L., and Hong, J. 2009. Improving phishing countermeasures: An analysis of expert interviews. In Proceedings of the 4th APWG eCrime Researchers Summit.Google Scholar
- Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., and Zhang, C. 2009. An empirical analysis of phishing blacklists. In Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09).Google Scholar
- Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques 2nd Ed. Morgan Kaufmann. Google Scholar
Digital Library
- Xiang, G. and Hong, J. 2009. A hybrid phish detection approach by identity discovery and keywords retrieval. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 571--580. Google Scholar
Digital Library
- Xiang, G., Pendleton, B. A., Hong, J. I., and Rose, C. P. 2010. A hierarchical adaptive probabilistic approach for zero hour phish detection. In Proceedings of the 15th European Symposium on Research in Computer Security (ESORICS’10). 268--285. Google Scholar
Digital Library
- Zadrozny, B., Langford, J., and Abe, N. 2003. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining. 435--442. Google Scholar
Digital Library
- Zhang, Y., Hong, J., and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing web sites. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 639--648. Google Scholar
Digital Library
Index Terms
CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites
Recommendations
Cantina: a content-based approach to detecting phishing web sites
WWW '07: Proceedings of the 16th international conference on World Wide WebPhishing is a significant problem involving fraudulent email and web sites that trick unsuspecting users into revealing private information. In this paper, we present the design, implementation, and evaluation of CANTINA, a novel, content-based approach ...
Classification of Anti-phishing Solutions
AbstractPhishing is an online fraud through which phisher gains unauthorized access to the user system to lure the personal credentials (such as username, password, credit/debit card number, validity, CVV number, and pin) for financial gain. Phishing can ...
A Hybrid System to Find & Fight Phishing Attacks Actively
WI-IAT '11: Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01Traditional anti-phishing methods and tools always worked in a passive way to receive users' submission and determine phishing URLs. Usually, they are not fast and efficient enough to find and take down phishing attacks. We analyze phishing reports from ...






Comments