skip to main content
research-article

SpotSpam: Intention Analysis–driven SMS Spam Detection Using BERT Embeddings

Published:19 September 2022Publication History
Skip Abstract Section

Abstract

Short Message Service (SMS) is one of the widely used mobile applications for global communication for personal and business purposes. Its widespread use for customer interaction, business updates, and reminders has made it a billion-dollar industry in “Text Marketing.” Along with valid SMS, a tsunami of spam messages also pop up that serve various purposes for the sender and the majority of them are fraudulent. Filtering spam SMS in an accurate manner is a crucial and challenging task that will benefit human lives both mentally and economically. Some of the challenges in the filtering of spam SMS include less number of characters, texts in informal languages, lack of public SMS spam corpus, and so on. Focusing solely on the textual features of the SMS is a major handicap of the existing methods, as it lacks in dynamically adapting to the increasing number of new keywords and jargon. In this article, we develop an intention-based approach of SMS spam filtering that efficiently handles dynamic keywords by focusing on the semantics of the words. We capture both semantic and textual features of the short-text messages based on 13 pre-defined intention labels. Moreover, the contextual embeddings of the texts are generated using various pre-trained NLP (Natural Language Processing) models. Finally, intention scores are computed for the pre-defined labels and a bunch of supervised learning classifiers are employed for filtering as spam or ham. Our approaches are evaluated on the SMS Spam Collection [24] benchmark dataset, and extensive experimentation shows interesting results. Our model did remarkably well with an accuracy of 98.07%, Precision and Recall of ∼ 0.97, which is better than few of the existing state-of-the-art alternatives. Though the accuracy of our approach is not the best among other existing approaches, the model is highly stable due to its emphasis on extracting the contextual features from the text through intention labels.

REFERENCES

  1. [1] Abayomi-Alli Olusola, Misra Sanjay, Abayomi-Alli Adebayo, and Odusami Modupe. 2019. A review of soft techniques for SMS spam classification: Methods, approaches and applications. Eng. Applic. Artif. Intell. 86 (2019), 197212. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Abdulhamid Shafi’I Muhammad, Latiff Muhammad Shafie Abd, Chiroma Haruna, Osho Oluwafemi, Abdul-Salaam Gaddafi, Abubakar Adamu I., and Herawan Tutut. 2017. A review on mobile SMS spam filtering techniques. IEEE Access 5 (2017), 1565015666. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Aggarwal Charu C. and Zhai ChengXiang. 2012. Mining Text Data. Springer. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Tiago A. Almeida, José María G. Hidalgo, and Akebo Yamakami. 2011. Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM Symposium on Document Engineering (DocEng’11). Association for Computing Machinery, New York, NY, USA, 259–262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Blanzieri Enrico and Bryl Anton. 2009. A survey of learning-based techniques of email spam filtering. Artif. Intell. Rev. 29 (2009), 6392.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Gordon V. Cormack, José María Gómez Hidalgo, and Enrique Puertas Sánz. 2007. Spam filtering for short messages. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management (CIKM’07). Association for Computing Machinery, New York, NY, USA, 313–320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Cormack Gordon V., Hidalgo José María Gómez, and Sanz Enrique Puertas. 2007. Feature engineering for mobile (SMS) spam filtering. In SIGIR.Google ScholarGoogle Scholar
  8. [8] Dada Emmanuel Gbenga, Bassi Joseph Stephen, Chiroma Haruna, Abdulhamid Shafií Muhammad, Adetunmbi Adebayo Olusola, and Ajibuwa Opeyemi Emmanuel. 2019. Machine learning for email spam filtering: Review, approaches and open research problems. Heliyon 5, 6 (2019), e01802. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Delany Sarah Jane, Buckley Mark, and Greene Derek. 2012. SMS spam filtering: Methods and data. Exp. Syst. Applic. 39, 10 (2012), 98999908. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.Google ScholarGoogle Scholar
  11. [11] Education IBM Cloud. 2020. Text mining. Retrieved from https://www.ibm.com/cloud/learn/text-mining.Google ScholarGoogle Scholar
  12. [12] Face Hugging. 2020. BERT. Retrieved from https//huggingface.co/transformers/model_doc/bert.html.Google ScholarGoogle Scholar
  13. [13] Face Hugging. 2020. DistilBERT. Retrieved from https//huggingface.co/transformers/model_doc/distilbert.html.Google ScholarGoogle Scholar
  14. [14] Face Hugging. 2020. RoBERTa. Retrieved from https//huggingface.co/transformers/model_doc/roberta.html.Google ScholarGoogle Scholar
  15. [15] Face The Hugging. 2020. Summary of the tokenizers. Retrieved from https//huggingface.co/transformers/tokenizer_summary.html.Google ScholarGoogle Scholar
  16. [16] José María Gómez Hidalgo, Guillermo Cajigas Bringas, Enrique Puertas Sánz, and Francisco Carrero García. 2006. Content based SMS spam filtering. In Proceedings of the 2006 ACM Symposium on Document Engineering (DocEng’06). Association for Computing Machinery, New York, NY, USA, 107–114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Han Jiawei, Kamber Micheline, and Pei Jian. 2011. Data Mining: Concepts and Techniques, 3rd ed. Morgan Kaufmann Publishers, Elsevier.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Limited Soprano Design. 2019. The power of mobile messaging. Retrieved from https//info.sopranodesign.com/the-power-of-mobile-communications.Google ScholarGoogle Scholar
  19. [19] Narayan Akshay and Saxena P.. 2013. The curse of 140 characters: Evaluating the efficacy of SMS spam detection on Android. In SPSM’13.Google ScholarGoogle Scholar
  20. [20] News EETelecom. 2021. Malicious COVID-19 vaccine SMS that compromises Android phones spreading: Cyber agency. Retrieved from https//telecom.economictimes.indiatimes.com/news/malicious-covid-19-vaccine-sms-that-compromises-android-phones-spreading-cyber-agency/82522677.Google ScholarGoogle Scholar
  21. [21] Popovac Milivoje, Karanovic Mirjana, Sladojevic Srdjan, Arsenovic Marko, and Anderla Andras. 2018. Convolutional neural network based SMS spam detection. In TELFOR. 14. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Raj H., Wei-hong Yao, Banbhrani Santosh Kumar, and Dino Soomro Pir. 2018. LSTM based short message service (SMS) modeling for spam classification. In ICMLT.Google ScholarGoogle Scholar
  23. [23] Rajaraman Anand, Leskovec Jure, and Ullman Jeffrey D.. 2014. Mining of Massive Datasets. Cambridge University Press. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Repository UCI Machine Learning. 2012. SMS spam collection dataset. Retrieved from https//archive.ics.uci.edu/ml/datasets/sms+spam+collection.Google ScholarGoogle Scholar
  25. [25] Roy Pradeep Kumar, Singh Jyoti Prakash, and Banerjee Snehasish. 2020. Deep learning to filter SMS spam. Fut. Gen. Comput. Syst. 102 (2020), 524533. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Sinhmar Abhinav, Malhotra Vinamra, Yadav R. K., and Kumar Manoj. 2022. Spam detection using genetic algorithm optimized LSTM model. In Computer Networks and Inventive Communication Technologies, Smys S., Bestak Robert, Palanisamy Ram, and Kotuliak Ivan (Eds.). Springer Singapore, 5972.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Sousa Gustavo, Pedronette Daniel, Papa João, and Guilherme Ivan. 2021. SMS spam detection through skip-gram embeddings and shallow networks. 41934201. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Uysal Alper Kursat, Gunal Serkan, Ergin Semih, and Gunal Efnan Sora. 2012. A novel framework for SMS spam filtering. In INISTA. 14.Google ScholarGoogle Scholar
  29. [29] Vaswani Ashish, Shazeer Noam M., Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. ArXiv: abs/1706.03762 (2017).Google ScholarGoogle Scholar
  30. [30] Wang C., Zhang Y., Chen X., Liu Zhiyu, Shi L., Chen G., Qiu F., Ying C., and Lu W.. 2011. A behavior-based SMS antispam system. IBM J. Res. Devel. 54 (1 2011), 1-16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Xia Tian and Chen Xuemin. 2021. A weighted feature enhanced hidden Markov model for spam SMS filtering. Neurocomputing 444 (2021), 4858. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Yadav Kuldeep, Kumaraguru Ponnurangam, Goyal Atul, Gupta Ashish, and Naik Vinayak. 2011. SMSAssassin: Crowdsourcing driven mobile-based system for SMS spam filtering. Association for Computing Machinery, New York, NY. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Zhang Xipeng, Xiong Gang, Hu Yuexiang, Zhu Fenghua, Dong Xisong, and Nyberg Timo R.. 2016. A method of SMS spam filtering based on AdaBoost algorithm. In WCICA. 23282332. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. SpotSpam: Intention Analysis–driven SMS Spam Detection Using BERT Embeddings

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 16, Issue 3
      August 2022
      155 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/3555790
      Issue’s Table of Contents

      ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 September 2022
      • Online AM: 24 May 2022
      • Accepted: 12 May 2022
      • Revised: 12 March 2022
      • Received: 7 October 2021
      Published in tweb Volume 16, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!