ABSTRACT
We describe experiments with a Naive Bayes text classifier in the context of anti-spam E-mail filtering, using two different statistical event models: a multi-variate Bernoulli model and a multinomial model. We introduce a family of feature ranking functions for feature selection in the multinomial event model that take account of the word frequency information. We present evaluation results on two publicly available corpora of legitimate and spam E-mails. We find that the multinomial model is less biased towards one class and achieves slightly higher accuracy than the multi-variate Bernoulli model.
References
- Ion Androutsopoulos, John Koutsias, Konstantinos V. Chandrinos, and Constantine D. Spyropoulos. 2000a. An experimental comparison of Naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In N. J. Belkin, P. Inwersen, and M.-K. Leong, editors, Proc. 23rd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), pages 160--167, Athens, Greece. Google Scholar
Digital Library
- Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sakkis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos. 2000b. Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach. In H. Zaragoza, P. Gallinari, and M. Rajman, editors, Proc. Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), pages 1--13, Lyon, France.Google Scholar
- Xavier Carreras and Lluís Màrquez. 2001. Boosting trees for anti-spam email filtering. In Proc. International Conference on Recent Advances in Natural Language Processing (RANLP-01), Tzigov Chark, Bulgaria.Google Scholar
- William W. Cohen, 1996. Learning rules that classify e-mail. In Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pages 18--25, Stanford, CA. AAAI Press.Google Scholar
- Thomas M. Cover and Joy A. Thomas. 1991. Elements of Information Theory. John Wiley, New York. Google Scholar
Digital Library
- Pedro Domingos and Michael Pazzani. 1997. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29:103--130. Google Scholar
Digital Library
- Harris Drucker, Donghui Wu, and Vladimir N. Vapnik. 1999. Support vector machines for spam categorization. IEEE Trans. on Neural Networks, 10(5):1048--1054. Google Scholar
Digital Library
- David D. Lewis. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proc. 10th European Conference on Machine Learning (ECML98), volume 1398 of Lecture Notes in Computer Science, pages 4--15, Heidelberg. Springer. Google Scholar
Digital Library
- Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for Naive Bayes text classification. In Proc. AAAI-98 Workshop on Learning for Text Categorization, pages 41--48. AAAI Press.Google Scholar
- Dunja Mladenić and Marko Grobelnik. 1999. Feature selection for unbalanced class distribution and Naive Bayes. In I. Bratko and S. Dzeroski, editors, Proc. 16th International Conference on Machine Learning (ICML-99), pages 258--267, San Francisco, CA. Morgan Kaufmann Publishers. Google Scholar
Digital Library
- Jason D. M. Rennie. 2000. ifile: An application of machine learning to e-mail filtering. In Proc. KDD-2000 Workshop on Text Mining, Boston, MA.Google Scholar
- Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. 1998. A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the AAAI Workshop, pages 55--62, Madison Wisconsin. AAAI Press. Technical Report WS-98-05.Google Scholar
- Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos. 2001. Stacking classifiers for anti-spam filtering of e-mail. In L. Lee and D. Harman, editors, Proc. 6th Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), pages 44--50, Pittsburgh, PA. Carnegie Mellon University.Google Scholar
- Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning (ICML-97), pages 412--420. Google Scholar
Digital Library
Index Terms
(auto-classified)A comparison of event models for Naive Bayes anti-spam e-mail filtering



Comments