Data Augmentation for Smishing Detection: A Theory-based Prompt Engineering Approach

Smishing, which refers to social engineering attacks delivered through mobile devices such as smartphones, poses significant threats, yet limited data hinder the development of effective countermeasures. To tackle this, we propose a novel prompt engineering method for data augmentation in smishing detection. Distinguished by its utilization of insights from social science on smishing mechanisms, our approach offers a promising avenue for improving machine learning models in combating smishing attacks.


OVERVIEW
Smishing, a portmanteau of "SMS" (Short Message Service) and "phishing," is one of the dominant approaches of social engineering attacks, manipulating mobile text messages to deceive individuals and to illegally obtain information [6].Smishing, in general, comprises texts, URLs, self-answering links (SALs), contact information, such as phone numbers or email addresses that help convince its targets that the message is authentic and, consequently, make them provide desired information [2,3].
The primary objectives of smishing are to pilfer user credentials, deploy malicious software on mobile devices, or execute other harmful actions [8].The harm inflicted on victims of smishing takes various forms, including the disclosure of private information to malicious actors and the manipulation of human emotions such as curiosity, fear, and empathy [5,6].
Unlike other types of social engineering attacks, smishing has unique characteristics that render its targets more vulnerable.Specifically, Smishing messages are transmitted through mobile devices, which heightens the vulnerability of their targets due to factors such as small screens, limited user awareness, and frequent credential input [4].Therefore, it is crucial to provide potential victims with preventive mechanisms that can enhance their resilience to smishing.
In recent literature, machine learning with labeled smishing data has emerged as a promising mechanism for preventing the potential harm that can be caused by smishing [1].This approach involves computationally identifying the semantic and syntactic characteristics of smishing, enabling the differentiation between messages with malicious intent and those with benign intent [1].However, prior studies of smishing detection have been facing a significant challenge in the development of machine learning models: the shortage of training data [2,3].
Particularly, prior studies in smishing detection have encountered difficulties in distinguishing smishing messages from spam messages [2].Spam messages represent another type of social engineering attacks that, unlike smishing, are merely irritating and do not inherently contain malicious intent [4].Given that smishing and spam messages exhibit similar linguistic structures, it is exceedingly challenging to distinguish the two types of social engineering attacks with limited data [2].
In this study, to tackle the issue of limited data in smishing detection, we propose a novel data augmentation approach.Specifically, we leverage large language models (LLMs) and the prompt learning approach for data generation.While there have a few studies with a similar concept of augmenting data through prompt learning and LLMs [7], our approach stands out because it employs prompt desings based on psychological theories, i.e., the principles of persuasion [5,9], which explain the mechanism behind smishing [5].We expect that these theories will assist us in crafting better prompts compared to ad-hoc designs, thereby enabling us to harness the full potential of LLMs for data augmentation.

PROPOSED METHOD
The overview of our data augmentation process is illustrated in Figure 1.First, we to use the dataset published by Mishra and Soni [4], where they have collected three types of messages transmitted via mobile phones: smishing, spam, and ham messages.Among these, we will exclude ham messages (i.e., normal messages) because our primary objective is to develop an approach for data augmentation that helps distinguish smishing from spam messages.where   denotes a prompt that assigns a role to an LLM (as either smishing generator or phishing generator),  _ refers to a prompt that explains a specific persuasion type to be employed by an LLM for generating new samples,   is a prompt that provides an LLM with an example of smishing (or phishing) messages, and   is a newly generated sample.Note that the ad-hoc data augmentation in Figure 1 excludes  _ from the input.
A key to generating proper new samples in our proposed approach is the  _ , i.e., theory-based prompts.As a kernel theory, we employ the principles of persuasion [9].In general, social engineering attacks, such as smishing, capitalize on social psychological triggers to promote the process of persuasion.Specifically, Ferreira and Teles [5] have suggested the following five components that influence the effectiveness of persuasion in a real-world context.(5) Reciprocation and commitment, integrity: People often hold the belief that when someone makes a commitment, it will be honored, and they feel compelled to reciprocate the commitment with a favor.
We plan to annotate existing data into one of these components and utilize the labeled data to generate prompts (i.e.,  _ ) as shown in Figure 1.Through this approach, we contend that we can systematically capture the nuances inherent in the crafting and delivery of smishing messages.That is, by incorporating the multiple theoretical components utilized in the creation of real-world smishing messages into our prompts, we assert that our newly generated samples will provide a comprehensive representation of smishing messages.

Figure 1 .
Figure 1.Methodological FrameworkAs mentioned above, we employ a prompt-based approach, utilizing an LLM, for data augmentation, which has the following input-output structure:Input:{  , [],  _ , [],    , [],    , … , [],    }; Output: {  , [],   , … , [],   },where   denotes a prompt that assigns a role to an LLM (as either smishing generator or phishing generator),  _ refers to a prompt that explains a specific persuasion type to be employed by an LLM for generating new samples,   is a prompt that provides an LLM with an example of smishing (or phishing) messages, and   is a newly generated sample.Note that the ad-hoc data augmentation in Figure1excludes  _ from the input.A key to generating proper new samples in our proposed approach is the  _ , i.e., theory-based prompts.As a kernel theory, we employ the principles of persuasion[9].In

( 1 )
Authority: This refers to individuals' inclination to easily be persuaded by the authority of an expert.(2) Social proof: Individuals tend to believe what the majority of people do or seem to believe.(3) Liking, similarity, and deception: People tend to follow others whom they know, like, or are familiar with.(4) Distraction: Individuals have tendency to focus on what they can gain or lose, during strong emotional states.