10.3115/1067807.1067848dlproceedingsArticle/Chapter ViewAbstractPublication PageseaclConference Proceedings
ARTICLE
Free Access

A comparison of event models for Naive Bayes anti-spam e-mail filtering

ABSTRACT

We describe experiments with a Naive Bayes text classifier in the context of anti-spam E-mail filtering, using two different statistical event models: a multi-variate Bernoulli model and a multinomial model. We introduce a family of feature ranking functions for feature selection in the multinomial event model that take account of the word frequency information. We present evaluation results on two publicly available corpora of legitimate and spam E-mails. We find that the multinomial model is less biased towards one class and achieves slightly higher accuracy than the multi-variate Bernoulli model.

References

  1. Ion Androutsopoulos, John Koutsias, Konstantinos V. Chandrinos, and Constantine D. Spyropoulos. 2000a. An experimental comparison of Naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In N. J. Belkin, P. Inwersen, and M.-K. Leong, editors, Proc. 23rd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), pages 160--167, Athens, Greece. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sakkis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos. 2000b. Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach. In H. Zaragoza, P. Gallinari, and M. Rajman, editors, Proc. Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), pages 1--13, Lyon, France.Google ScholarGoogle Scholar
  3. Xavier Carreras and Lluís Màrquez. 2001. Boosting trees for anti-spam email filtering. In Proc. International Conference on Recent Advances in Natural Language Processing (RANLP-01), Tzigov Chark, Bulgaria.Google ScholarGoogle Scholar
  4. William W. Cohen, 1996. Learning rules that classify e-mail. In Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pages 18--25, Stanford, CA. AAAI Press.Google ScholarGoogle Scholar
  5. Thomas M. Cover and Joy A. Thomas. 1991. Elements of Information Theory. John Wiley, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Pedro Domingos and Michael Pazzani. 1997. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29:103--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Harris Drucker, Donghui Wu, and Vladimir N. Vapnik. 1999. Support vector machines for spam categorization. IEEE Trans. on Neural Networks, 10(5):1048--1054. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. David D. Lewis. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proc. 10th European Conference on Machine Learning (ECML98), volume 1398 of Lecture Notes in Computer Science, pages 4--15, Heidelberg. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for Naive Bayes text classification. In Proc. AAAI-98 Workshop on Learning for Text Categorization, pages 41--48. AAAI Press.Google ScholarGoogle Scholar
  10. Dunja Mladenić and Marko Grobelnik. 1999. Feature selection for unbalanced class distribution and Naive Bayes. In I. Bratko and S. Dzeroski, editors, Proc. 16th International Conference on Machine Learning (ICML-99), pages 258--267, San Francisco, CA. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jason D. M. Rennie. 2000. ifile: An application of machine learning to e-mail filtering. In Proc. KDD-2000 Workshop on Text Mining, Boston, MA.Google ScholarGoogle Scholar
  12. Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. 1998. A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the AAAI Workshop, pages 55--62, Madison Wisconsin. AAAI Press. Technical Report WS-98-05.Google ScholarGoogle Scholar
  13. Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos. 2001. Stacking classifiers for anti-spam filtering of e-mail. In L. Lee and D. Harman, editors, Proc. 6th Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), pages 44--50, Pittsburgh, PA. Carnegie Mellon University.Google ScholarGoogle Scholar
  14. Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning (ICML-97), pages 412--420. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

(auto-classified)
  1. A comparison of event models for Naive Bayes anti-spam e-mail filtering

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        DL Hosted proceedings cover image
        EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
        April 2003
        394 pages
        ISBN:1333567890

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 12 April 2003

        Qualifiers

        • ARTICLE

        Acceptance Rates

        Overall Acceptance Rate 128 of 431 submissions, 30%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!