ABSTRACT
Unsolicited e-mail (spam) is a severe problem due to intrusion of privacy, online fraud, viruses and time spent reading unwanted messages. To solve this issue, Collaborative Filtering (CF) and Content-Based Filtering (CBF) solutions have been adopted. We propose a new CBF-CF hybrid approach called Symbiotic Data Mining (SDM), which aims at aggregating distinct local filters in order to improve filtering at a personalized level using collaboration while preserving privacy. We apply SDM to spam e-mail detection and compare it with a local CBF filter (i.e. Naive Bayes). Several experiments were conducted by using a novel corpus based on the well known Enron datasets mixed with recent spam. The results show that the symbiotic strategy is competitive in performance when compared to CBF and also more robust to contamination attacks.
References
- E. Turban, R. Sharda, J. Aronson, and D. King, Business Intelligence, A Managerial Approach. Prentice-Hall, 2007.Google Scholar
- V. Cheng and C. Li, "Personalized Spam Filtering with Semi-supervised Classifier Ensemble," in IEEE/WIC/ACM International Conference on Web Intelligence, 2006. Google Scholar
Digital Library
- C. Kanich, C. Kreibich, K. Levchenko, B. Enright, G. Voelker, V. Paxson, and S. Savage, "Spamalytics: An Empirical Analysis of Spam Marketing Conversion," in Computer and Communications Security Conference (CCS'08). ACM, 2008, pp. 27-31. Google Scholar
Digital Library
- J. Méndez, I. Cid, D. Glez-Peña, M. Rocha, and F. Fdez-Riverola, "A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters," in 8th Industrial Conference on Data Mining, Springer, Ed., vol. LNAI 5077, 2008, pp. 213-227. Google Scholar
Digital Library
- A. Ramachandran and N. Feamster, "Understanding the Network-Level Behavior of Spammers," in SIGCOMM'06, ACM, Ed., 2006, pp. 291-302. Google Scholar
Digital Library
- Z. Zhong, L. Ramaswamy, and K. Li, "ALPACAS: A Large-scale Privacy-Aware Collaborative Anti-spam System," in IEEE INFOCOM, 2008, pp. 556-564.Google Scholar
- M. Chang, W. Yih, and C. Meek, "Partitioned Logistic Regression for Spam Filtering," in 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 97-105. Google Scholar
Digital Library
- S. Hershkop and S. Stolfo, "Combining Email Models for False Positive Reduction," in 11th ACM SIGKDD Int. Conference on Knowledge discovery and data mining, 2005, pp. 21-24. Google Scholar
Digital Library
- A. Gray and M. Haahr, "Personalised, Collaborative Spam Filtering," in 1st Conference on E-Mail and Anti-Spam CEAS, 2004.Google Scholar
- B. Nelson, M. Barreno, F. Chi, A. Joseph, B. Rubinstein, U. Saini, C. Sutton, J. Tygar, and K. Xia, "Exploiting Machine Learning to Subvert Your Spam Filter," in 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats. ACM Press, 2008, pp. 1-9. Google Scholar
Digital Library
- K. Yu, A. Schwaighofer, V. Tresp, W. Ma, and H. Zhang, "Collaborative Ensemble Learning: Combining Collaborative and Content-Based Information Filtering via Hierarchical Bayes," in 19th International Conference on Uncertainty in Artificial Intelligence (UAI). ACM, 2003, pp. 353-360. Google Scholar
Digital Library
- F. Provost, Advances in Distributed and Parallel Knowledge Discovery. MIT Press, 2000, ch. Distributed data mining: Scaling up and beyond. Google Scholar
Digital Library
- V. Metsis, I. Androutsopoulos, and G. Paliouras, "Spam Filtering with Naive Bayes - Which Naive Bayes?" in Third Conference on Email and Anti-Spam (CEAS), 2006, pp. 125- 134.Google Scholar
- R. Beckermann, A. McCallum, and G. Huang, "Automatic categorization of email into folders: benchmark experiments on Enron and SRI corpora," University of Massachusetts Amherst, IR-418, 2004.Google Scholar
- R. Segal, "Combining global and personal anti-spam filtering," in Forth Conference on Email and Anti-Spam (CEAS), 2007.Google Scholar
- A. Kosmopoulos, G. Paliouras, and I. Androutsopoulos, "Adaptive Spam Filtering Using Only Naive Bayes Text Classifiers," in CEAS 2008 - Fifth Conference on Email and Anti-Spam, August 2008.Google Scholar
- R. Grossman, M. Hornick, and G. Meyer, "Data Mining Standards Initiatives," Communications of ACM, vol. 45, no. 8, pp. 59-61, 2002. Google Scholar
Digital Library
- T. Fawcett, "An introduction to ROC analysis," Pattern Recognition Letters, vol. 27, pp. 861-874, 2006. Google Scholar
Digital Library
- A. Flexer, "Statistical Evaluation of Neural Networks Experiments: Minimum Requirements and Current Practice," in Proceedings of the 13th European Meeting on Cybernetics and Systems Research, vol. 2, Vienna, Austria, 1996, pp. 1005-1008.Google Scholar
- R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-00-3, http://www.Rproject. org, (Accessed 26 March 2008).Google Scholar
- I. Feinerer, K. Hornik, and D. Meyer, "Text Mining Infrastructure in R," Journal of Statistical Software, vol. 25, no. 1-54, 2008.Google Scholar
- R. Kewley, M. Embrechts, and C. Breneman, "Data Strip Mining for the Virtual Design of Pharmaceuticals with Neural Networks," IEEE Trans Neural Networks, vol. 11, no. 3, pp. 668-679, May 2000. Google Scholar
Digital Library
Index Terms
Symbiotic Data Mining for Personalized Spam Filtering

Paulo Cortez

Comments