skip to main content
research-article

A Weak-Region Enhanced Bayesian Classification for Spam Content-Based Filtering

Published:02 April 2023Publication History
Skip Abstract Section

Abstract

This article proposes an improved Bayesian scheme by focusing on the region in which Bayesian may fail to correctly identify labels and improve classification performance by handling those errors. Bayesian method, as a probabilistic classifier, uses Bayes’ theorem to calculate the probability of an instance belonging to a class, where the class label with a maximum probability is assigned to the instance. In a spam detection problem, it can be considered that the prediction of the Bayesian classifier is weak when the probability obtained for classes spam and non-spam are close to each other. Therefore, we define a threshold to determine weak prediction against strong prediction. A hybrid strategy using a two-layer Bayesian approach is presented: basic Bayesian (BBayes) and corrected weak region Bayesian (CWRBayes), which are concerned with strong and weak predictions, respectively. Both techniques, BBayes and CWRBayes, have the same classification mechanism, but they use different feature selection mechanisms. The proposed methods are implemented and evaluated over two datasets of spam e-mails, and the results show that the proposed method has better performance than the baseline of the naïve Bayesian and some other Bayesian variants.

REFERENCES

  1. [1] Zhang Yinghui, Zheng Dong, and Deng Robert H.. 2018. Security and privacy in smart health: Efficient policy-hiding attribute-based access control. IEEE Internet of Things Journal 5, 3 (2018), 21302145.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Wang Chen, Shen Jian, Liu Qi, Ren Yongjun, and Li Tong. 2018. A novel security scheme based on instant encrypted transmission for internet of things. Security and Communication Networks 2018 (2018), 1–7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Islam Rafiqul and Xiang Yang. 2010. E-mail classification using data reduction method. In Proceedings of the 2010 5th International ICST Conference on Communications and Networking in China. IEEE, 15.Google ScholarGoogle Scholar
  4. [4] Dedeturk Bilge Kagan and Akay Bahriye. 2020. Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Applied Soft Computing 91 (2020), 106229.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Douzi Samira, AlShahwan F., Lemoudden Mouad, and Ouahidi B.. 2020. Hybrid e-mail spam detection model using artificial intelligence. International Journal of Machine Learning and Computing 10, 2 (2020), 316322.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Guzella Thiago S. and Caminhas Walmir M.. 2009. A review of machine learning approaches to spam filtering. Expert Systems with Applications 36, 7 (2009), 1020610222.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Seifollahi Sattar, Bagirov Adil, Layton Robert, and Gondal Iqbal. 2017. Optimization based clustering algorithms for authorship analysis of phishing e-mails. Neural Processing Letters 46, 2(2017), 411425. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Wang Bin, Jones Gareth J. F., and Pan Wenfeng. 2006. Using online linear classifiers to filter spam e-mails. Pattern Analysis and Applications 9, 4 (2006), 339351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Androutsopoulos Ion, Koutsias John, Chandrinos Konstantinos V., and Spyropoulos Constantine D.. 2000. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Association for Computing Machinery, New York, NY, 160167. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Crawford Elisabeth, Koprinska Irena, and Patrick Jon. 2004. Phrases and feature selection in e-mail classification. In Proceedings of the ADCS. 5962.Google ScholarGoogle Scholar
  11. [11] Kołcz Aleksander, Chowdhury Abdur, and Alspector Joshua. 2004. The impact of feature selection on signature-driven spam detection. In Proceedings of the 1st Conference on E-mail and Anti-Spam.Google ScholarGoogle Scholar
  12. [12] Azam Nouman. 2006. Comparative Study of Features Space Reduction Techniques for Spam Detection. Master’s thesis. National University of Sciences and Technology, Pakistan.Google ScholarGoogle Scholar
  13. [13] Ma Wanli, Tran Dat, and Sharma Dharmendra. 2009. A novel spam e-mail detection system based on negative selection. In Proceedings of the 2009 4th International Conference on Computer Sciences and Convergence Information Technology. IEEE, 987992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Kumar R. Kishore, Poonkuzhali G., and Sudhakar P.. 2012. Comparative study on e-mail spam classifier using data mining techniques. In Proceedings of the International Multiconference of Engineers and Computer Scientists. 1416.Google ScholarGoogle Scholar
  15. [15] Aski Ali Shafigh and Sourati Navid Khalilzadeh. 2016. Proposed efficient algorithm to filter spam using machine learning techniques. Pacific Science Review A: Natural Science and Engineering 18, 2 (2016), 145149.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Shrivastava Shubhi and Anju R.. 2017. Spam mail detection through data mining techniques. In Proceedings of the 2017 International Conference on Intelligent Communication and Computational Techniques. IEEE, 6164.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Thonnard Olivier, Vervier Pierre-Antoine, and Dacier Marc. 2016. Spammers operations: A multifaceted strategic analysis. Security and Communication Networks 9, 4 (2016), 336356.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Khamis Siti Aqilah, Foozy Cik Feresa Mohd, Aziz Mohd Firdaus Ab, and Rahim Nordiana. 2020. Header based e-mail spam detection framework using support vector machine (SVM) technique. In Proceedings of the International Conference on Soft Computing and Data Mining. Springer, 5765.Google ScholarGoogle Scholar
  19. [19] Mardi Vanyashree, Kini Anvaya, Sukanya V. M., and Rachana S.. 2020. Text-based spam tweets detection using neural networks. In Proceedings of the Advances in Computing and Intelligent Systems. Springer, 401408.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Özgür Levent, Güngör Tunga, and Gürgen Fikret. 2004. Spam mail detection using artificial neural network and Bayesian filter. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning. Springer, 505510.Google ScholarGoogle Scholar
  21. [21] Gupta Vashu, Mehta Aman, Goel Akshay, Dixit Utkarsh, and Pandey Avinash Chandra. 2019. Spam detection using ensemble learning. In Proceedings of the Harmony Search and Nature Inspired Optimization Algorithms. Springer, 661668.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Soyemi Jumoke and Hammed Mudasiru. 2020. Detection and classification of legitimate and spam e-mails using k-nearesest. International Journal of Computer Applications 175, 18 (2020), 2832.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] DeBarr Dave and Wechsler Harry. 2012. Spam detection using random boost. Pattern Recognition Letters 33, 10 (2012), 12371244.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Gupta Ayushi, Palwe Sushila, and Keskar Devyani. 2020. Fake e-mail and spam detection: User feedback with Naives Bayesian approach. In Proceedings of the International Conference on Computational Science and Applications. Springer, 4147.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Cormack Gordon V., Hidalgo José María Gómez, and Sánz Enrique Puertas. 2007. Spam filtering for short messages. In Proceedings of the 16th ACM Conference on Information and Knowledge Management.Association for Computing Machinery, New York, NY,313320. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Sokolov Mark, Olufowobi Kehinde, and Herndon Nic. 2020. Visual spoofing in content based spam detection. arXiv:2004.05265. Retrieved from https://arxiv.org/abs/2004.05265.Google ScholarGoogle Scholar
  27. [27] Wang Chenran, Zhang Danyi, Huang Suye, Li Xiangyang, and Ding Leah. 2021. Crafting adversarial e-mail content against machine learning based spam e-mail detection. In Proceedings of the 2021 International Symposium on Advanced Security on Software and Systems. 2328.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Koggalahewa Darshika Niranjan, Xu Yue, and Foo Ernest. 2020. Spam detection in social networks based on peer acceptance. In Proceedings of the Australasian Computer Science Week Multiconference.Association for Computing Machinery, New York, NY, 7 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Tang B., He H., Baggenstoss P. M., and Kay S.. 2016. A Bayesian classification approach using class-specific features for text categorization. IEEE Transactions on Knowledge and Data Engineering 28, 6 (2016), 16021606. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Rennie Jason D. M., Shih Lawrence, Teevan Jaime, and Karger David R.. 2003. Tackling the poor assumptions of Naive Bayes text classifiers. In Proceedings of the 20th International Conference on Machine Learning. 616623.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Feng Guozhong, Wang Han, Sun Tie li, and Zhang Libiao. 2016. A term frequency based weighting scheme using Naïve Bayes for text classification. Journal of Computational and Theoretical Nanoscience 13, 1 (2016), 319326.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Taheri Sona, Yearwood John, Mammadov Musa A., and Seifollahi Sattar. 2014. Attribute weighted Naive Bayes classifier using a local optimization. Neural Computing and Applications 24, 5 (2014), 9951002. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Duda Richard O., Hart Peter E., and Stork David G.. 2000. Pattern Classification (2nd. Ed.). Wiley-Interscience.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Aitchison J. and Dunsmore I. R.. 1975. Statistical Prediction Analysis. Cambridge University Press. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Balamurugan Appavu alias, Rajaram Ramasamy, Pramala S., Rajalakshmi S., Jeyendran C., and Prakash J. Dinesh Surya. 2011. NB+: An improved naive Bayesian algorithm. Knowledge-based Systems 24, 5 (2011), 563569.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Liu James N. K., Li Bavy N. L., and Dillon Tharam S.. 2001. An improved naive Bayesian classifier technique coupled with a novel input solution method [rainfall prediction]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 31, 2 (2001), 249256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Langley Pat and Sage Stephanie. 1994. Induction of selective Bayesian classifiers. Uncertainty Proceedings 1994. Morgan Kaufmann.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Jiang Liangxiao and Guo Yuanyuan. 2005. Learning lazy naive Bayesian classifiers for ranking. In Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence. IEEE, 5–pp.Google ScholarGoogle Scholar
  39. [39] Wang Li-Min, Yuan Sen-Miao, Li Ling, and Li Hai-Jun. 2004. Boosting Naive Bayes by active learning. In Proceedings of the 2004 International Conference on Machine Learning and Cybernetics. 13831386DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Bozzeda Fabio, Zangrilli Maria Paola, and Defeo Omar. 2016. Assessing sandy beach macrofaunal patterns along large-scale environmental gradients: A Fuzzy Naïve Bayes approach. Estuarine Coastal and Shelf Science 175, June (2016), 7078. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Zheng Zijian. 1998. Naive Bayesian classifier committees. In Proceedings of the 10th European Conference on Machine Learning. Springet-Verlag, 196207.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Diao Lili, Hu Keyun, Lu Yuchang, and Shi Chunyi. 2002. A method to boost Naïve Bayesian classifiers. In Proceedings of the Advances in Knowledge Discovery and Data Mining. Chen Ming-Syan, Yu Philip S., and Liu Bing (Eds.), Springer, Berlin, 115122.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Vilalta Ricardo and Rish Irina. 2003. A decomposition of classes via clustering to explain and improve Naive Bayes. In Proceedings of the Machine Learning: ECML 2003. Lavrač Nada, Gamberger Dragan, Blockeel Hendrik, and Todorovski Ljupčo (Eds.), Springer, Berlin, 444455.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Nurnberger A., Borgelt Christian, and Klose Aljoscha. 1999. Improving Naive Bayes classifiers using neuro-fuzzy learning. In Proceedings of the ICONIP’99. ANZIIS’99 & ANNES’99 & ACNN’99. 6th International Conference on Neural Information Processing. IEEE, 154159.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Zheng Fei and Webb Geoffrey I.. 2008. Semi-naive Bayesian Classification. Citeseer.Google ScholarGoogle Scholar
  46. [46] Zheng Fei and Webb Geoffrey I.. 2010. Semi-Naive Bayesian Learning. Springer US, Boston, MA, 889892. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Kuchipudi Bhargav, Nannapaneni Ravi Teja, and Liao Qi. 2020. Adversarial machine learning for spam filters. In Proceedings of the 15th International Conference on Availability, Reliability, and Security. 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Zhou Bing, Yao Yiyu, and Luo Jigang. 2014. Cost-sensitive three-way e-mail spam filtering. Journal of Intelligent Information Systems 42, 1 (2014), 1945.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Domingos Pedro and Pazzani Michael. 1996. Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In Proceedings of the 13th International Conference on Machine Learning. 105112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Androutsopoulos Ion, Paliouras Georgios, and Michelakis Eirinaios. 2004. Learning to Filter Unsolicited Commercial e-mail. “ DEMOKRITOS”, National Center for Scientific Research.Google ScholarGoogle Scholar
  51. [51] Lin Jerry Chun-Wei, Shao Yinan, Zhou Yujie, Pirouz Matin, and Chen Hsing-Chung. 2019. A Bi-LSTM mention hypergraph model with encoding schema for mention extraction. Engineering Applications of Artificial Intelligence 85, October (2019), 175181.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Lin Jerry Chun-Wei, Shao Yinan, Djenouri Youcef, and Yun Unil. 2021. ASRNN: A recurrent neural network with an attention model for sequence labeling. Knowledge-based Systems 212, January (2021), 106548.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Shao Yinan, Lin Jerry Chun-Wei, Srivastava Gautam, Jolfaei Alireza, Guo Dongdong, and Hu Yi. 2021. Self-attention-based conditional random fields latent variables model for sequence labeling. Pattern Recognition Letters 145, May (2021), 157164.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Xia Tian and Chen Xuemin. 2020. A discrete hidden Markov model for SMS spam detection. Applied Sciences 10, 14 (2020), 5011.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Isra’a AbdulNabi and Qussai Yaseen. 2021. Spam e-mail detection using deep learning techniques. Procedia Computer Science 184, 2021 (2021), 853858.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Srinivasan Sriram, Ravi Vinayakumar, Alazab Mamoun, Ketha Simran, Ala’M Al-Zoubi, and Padannayil Soman Kotti. 2021. Spam e-mails detection based on distributed word embedding with deep learning. In Proceedings of the Machine Intelligence and Big Data Analytics for Cybersecurity Applications. Springer, 161189.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Madisetty Sreekanth and Desarkar Maunendra Sankar. 2018. A neural network-based ensemble approach for spam detection in Twitter. IEEE Transactions on Computational Social Systems 5, 4 (2018), 973984.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Barushka Aliaksandr and Hajek Petr. 2019. Review spam detection using word embeddings and deep neural networks. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, 340350.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Seifollahi Sattar, Piccardi Massimo, and Jolfaei Alireza. 2021. An embedding-based topic model for document classification. ACM Transactions on Asian and Low-resource Language Information Processing 20, 3, (2021), 13 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Li Jiwei, Cardie Claire, and Li Sujian. 2013. TopicSpam: A topic-model based approach for spam detection. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 217221. Retrieved from https://aclanthology.org/P13-2039.Google ScholarGoogle Scholar
  61. [61] Chowdhury Ratul, Das Kumar Gourav, Saha Banani, and Bandyopadhyay Samir Kumar. 2020. A method based on NLP for Twitter spam detection. Preprints 2020, 2020070648. Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805.Google ScholarGoogle Scholar
  63. [63] Rojas-Galeano Sergio. 2021. Using BERT encoding to tackle the mad-lib attack in SMS spam detection. arXiv:2107.06400. Retrieved from https://arxiv.org/abs/2107.06400.Google ScholarGoogle Scholar
  64. [64] Ashokkumar P., Shankar G. Siva, Srivastava Gautam, Maddikunta Praveen Kumar Reddy, and Gadekallu Thippa Reddy. 2021. A two-stage text feature selection algorithm for improving text classification. ACM Transactions on Asian and Low-resource Language Information Processing 20, 3, (2021), 19 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Makkar Aaisha, Garg Sahil, Kumar Neeraj, Hossain M. Shamim, Ghoneim Ahmed, and Alrashoud Mubarak. 2020. An efficient spam detection technique for IoT devices using machine learning. IEEE Transactions on Industrial Informatics 17, 2 (2020), 903912.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Bhanarkar Onkar Vilas. 2020. The Mechanism to Detect Spam emails in Marathi Language Using NLP. Ph.D. Dissertation. Dublin, National College of Ireland.Google ScholarGoogle Scholar

Index Terms

  1. A Weak-Region Enhanced Bayesian Classification for Spam Content-Based Filtering

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
      March 2023
      570 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3579816
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 April 2023
      • Online AM: 11 July 2022
      • Accepted: 3 January 2022
      • Revised: 15 November 2021
      • Received: 26 March 2021
      Published in tallip Volume 22, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!