10.1109/WI-IAT.2009.30acmconferencesArticle/Chapter ViewAbstractPublication PageswiConference Proceedings
ARTICLE

Symbiotic Data Mining for Personalized Spam Filtering

ABSTRACT

Unsolicited e-mail (spam) is a severe problem due to intrusion of privacy, online fraud, viruses and time spent reading unwanted messages. To solve this issue, Collaborative Filtering (CF) and Content-Based Filtering (CBF) solutions have been adopted. We propose a new CBF-CF hybrid approach called Symbiotic Data Mining (SDM), which aims at aggregating distinct local filters in order to improve filtering at a personalized level using collaboration while preserving privacy. We apply SDM to spam e-mail detection and compare it with a local CBF filter (i.e. Naive Bayes). Several experiments were conducted by using a novel corpus based on the well known Enron datasets mixed with recent spam. The results show that the symbiotic strategy is competitive in performance when compared to CBF and also more robust to contamination attacks.

References

  1. E. Turban, R. Sharda, J. Aronson, and D. King, Business Intelligence, A Managerial Approach. Prentice-Hall, 2007.Google ScholarGoogle Scholar
  2. V. Cheng and C. Li, "Personalized Spam Filtering with Semi-supervised Classifier Ensemble," in IEEE/WIC/ACM International Conference on Web Intelligence, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Kanich, C. Kreibich, K. Levchenko, B. Enright, G. Voelker, V. Paxson, and S. Savage, "Spamalytics: An Empirical Analysis of Spam Marketing Conversion," in Computer and Communications Security Conference (CCS'08). ACM, 2008, pp. 27-31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Méndez, I. Cid, D. Glez-Peña, M. Rocha, and F. Fdez-Riverola, "A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters," in 8th Industrial Conference on Data Mining, Springer, Ed., vol. LNAI 5077, 2008, pp. 213-227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Ramachandran and N. Feamster, "Understanding the Network-Level Behavior of Spammers," in SIGCOMM'06, ACM, Ed., 2006, pp. 291-302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Z. Zhong, L. Ramaswamy, and K. Li, "ALPACAS: A Large-scale Privacy-Aware Collaborative Anti-spam System," in IEEE INFOCOM, 2008, pp. 556-564.Google ScholarGoogle Scholar
  7. M. Chang, W. Yih, and C. Meek, "Partitioned Logistic Regression for Spam Filtering," in 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 97-105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Hershkop and S. Stolfo, "Combining Email Models for False Positive Reduction," in 11th ACM SIGKDD Int. Conference on Knowledge discovery and data mining, 2005, pp. 21-24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Gray and M. Haahr, "Personalised, Collaborative Spam Filtering," in 1st Conference on E-Mail and Anti-Spam CEAS, 2004.Google ScholarGoogle Scholar
  10. B. Nelson, M. Barreno, F. Chi, A. Joseph, B. Rubinstein, U. Saini, C. Sutton, J. Tygar, and K. Xia, "Exploiting Machine Learning to Subvert Your Spam Filter," in 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats. ACM Press, 2008, pp. 1-9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Yu, A. Schwaighofer, V. Tresp, W. Ma, and H. Zhang, "Collaborative Ensemble Learning: Combining Collaborative and Content-Based Information Filtering via Hierarchical Bayes," in 19th International Conference on Uncertainty in Artificial Intelligence (UAI). ACM, 2003, pp. 353-360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Provost, Advances in Distributed and Parallel Knowledge Discovery. MIT Press, 2000, ch. Distributed data mining: Scaling up and beyond. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. V. Metsis, I. Androutsopoulos, and G. Paliouras, "Spam Filtering with Naive Bayes - Which Naive Bayes?" in Third Conference on Email and Anti-Spam (CEAS), 2006, pp. 125- 134.Google ScholarGoogle Scholar
  14. R. Beckermann, A. McCallum, and G. Huang, "Automatic categorization of email into folders: benchmark experiments on Enron and SRI corpora," University of Massachusetts Amherst, IR-418, 2004.Google ScholarGoogle Scholar
  15. R. Segal, "Combining global and personal anti-spam filtering," in Forth Conference on Email and Anti-Spam (CEAS), 2007.Google ScholarGoogle Scholar
  16. A. Kosmopoulos, G. Paliouras, and I. Androutsopoulos, "Adaptive Spam Filtering Using Only Naive Bayes Text Classifiers," in CEAS 2008 - Fifth Conference on Email and Anti-Spam, August 2008.Google ScholarGoogle Scholar
  17. R. Grossman, M. Hornick, and G. Meyer, "Data Mining Standards Initiatives," Communications of ACM, vol. 45, no. 8, pp. 59-61, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Fawcett, "An introduction to ROC analysis," Pattern Recognition Letters, vol. 27, pp. 861-874, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Flexer, "Statistical Evaluation of Neural Networks Experiments: Minimum Requirements and Current Practice," in Proceedings of the 13th European Meeting on Cybernetics and Systems Research, vol. 2, Vienna, Austria, 1996, pp. 1005-1008.Google ScholarGoogle Scholar
  20. R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-00-3, http://www.Rproject. org, (Accessed 26 March 2008).Google ScholarGoogle Scholar
  21. I. Feinerer, K. Hornik, and D. Meyer, "Text Mining Infrastructure in R," Journal of Statistical Software, vol. 25, no. 1-54, 2008.Google ScholarGoogle Scholar
  22. R. Kewley, M. Embrechts, and C. Breneman, "Data Strip Mining for the Virtual Design of Pharmaceuticals with Neural Networks," IEEE Trans Neural Networks, vol. 11, no. 3, pp. 668-679, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Symbiotic Data Mining for Personalized Spam Filtering

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader
                      About Cookies On This Site

                      We use cookies to ensure that we give you the best experience on our website.

                      Learn more

                      Got it!