FS3: Few-Shot and Self-Supervised Framework for Efficient Intrusion Detection in Internet of Things Networks

Securing the Internet of Things is critical for its successful deployment in various industries. While Machine Learning techniques have shown promise for intrusion detection in the Internet of Things, existing methods require large amounts of labeled training data; moreover, they encounter challenges with the presence of extreme class imbalance, i.e., some classes are underrepresented in the datasets used. Supervised methods rely on extensive labeled data, which can be costly and time-consuming to obtain. Class imbalance in datasets further exacerbates the challenge by skewing the model’s learning process toward the majority classes, leading to poor detection of attacks belonging to minority classes. This issue is particularly pronounced in the Internet of Things environments due to diverse devices and the varying frequency of intrusions targeting them. To overcome these challenges, we present a Few-Shot and Self-Supervised framework, called , for detecting intrusions in IoT networks. works in three phases. The first phase employs self-supervised learning to learn latent patterns and robust representations from unlabeled data. The second phase introduces Few-shot learning with contrastive training. Few-shot learning enables the model to learn from a few labeled examples, thereby eliminating the dependency on a large amount of labeled data. Contrastive Training addresses the class imbalance issue by improving the discriminative power of the model. The third phase introduces a novel K-Nearest neighbor algorithm that sub-samples the majority class instances to further reduce imbalance and improve overall performance. Experimental results based on three publicly available benchmark datasets demonstrate the efficacy of in addressing the challenges posed by the limited availability of labeled data as well as class imbalance in datasets. Our proposed framework , utilizing only of labeled data, outperforms fully supervised state-of-the-art models by up to and with respect to the metrics precision and F1 score, respectively.


INTRODUCTION
Internet of Things (IoT) has become an integral part of our lives, with billions of devices connected through wired and wireless networks.For example, the number of active IoT devices surpassed 10 billion in 2021.According to a report by Cisco [6], this number is projected to increase to over 500 billion by 2030.The proliferation of IoT devices has revolutionized various domains, ranging from healthcare to transportation, by enabling seamless connectivity and intelligent automation [11].For instance, cell phones, thermostats, and doorbell cameras have already made a significant impact on various aspects of society, encompassing industry and everyday life.However, the rapid expansion of IoT networks has led to a significant increase in the attack surface for malicious actors, and hence these networks are increasingly becoming attractive targets for malicious actors.As a prime example, a series of distributed denial of service (DDoS) attacks took place in the United States in 2016, exploiting the vulnerabilities of IoT devices through the Mirai malware [7].
Intrusion detection systems (IDSes) are essential for protecting IoT networks from attacks and play a pivotal role in safeguarding IoT networks against unauthorized access, data breaches, and other security threats.Traditional IDSes often rely on signature-based approaches, which struggle to keep up with the evolving threat landscape.Furthermore, IoT devices commonly employ specialized protocols and display unique traffic patterns.These inherent characteristics capture the intricacies of IoT network traffic.As a result, the utilization of Machine Learning (ML) has become prevalent in securing IoT devices as well as optimizing various other aspects such as coordinating wireless devices for efficient spectrum usage [24].ML models have demonstrated promising results in detecting anomalous patterns in network traffic and identifying potential intrusions [54].However, the effectiveness of these models heavily relies on the availability of large amounts of labeled training data, which can be scarce to obtain in real-world IoT environments, especially for new attack vectors.Since annotating large datasets with accurate labels is a resource-intensive and often expensive task, especially in domains like cybersecurity, where expert knowledge is required.Additionally, datasets for IoT networks are often characterized by class imbalance, since the occurrence of intrusions targeting different device types and functionalities varies significantly.Specifically, within any given time window, we often    observe two contrasting scenarios: either there is a significantly higher number of intrusion attempts, or there is a significantly smaller number of such events.This inherent variability in attack occurrences is a fundamental characteristic of IoT network traffic.Conversely, when there is a sudden surge in attacks (and the system was trained using balanced data), it can lead to poor detection systems.Our intent in focusing on class-imbalance datasets was precisely to reflect these real-world fluctuations in IoT intrusion detection.Notably, the ratio of the number of samples in the minority classes to that of the majority classes can vary from 1:100 to 1:1000, and beyond.Consequently, the minority classes representing these intrusions are often underrepresented in the training data, leading to biased model learning and suboptimal detection performance for these critical threats.The issue of imbalanced datasets is often overlooked or addressed through oversampling and undersampling techniques, which can lead to overfitting or underfitting problems.Overfitting arises from the inclusion of exact replicas of original samples, while underfitting occurs due to inadequate data samples by undersampling.To avoid the overfitting problem that arises due to simple oversampling, adding synthetic data generated to minority classes has been proposed [12].Furthermore, traditional loss functions, such as crossentropy loss, do not properly attend to minority class instances during model training, since they perform averaged-gradient updates.To deal with this challenge, methods employing specialized loss function such as focal loss to allow dynamically scaled-gradient updates has been used [13].This results in down-weighing of easy instances, thereby compelling the model to concentrate on difficult misclassified examples.Nonetheless, all of these approaches are fully supervised and require a huge amount of labeled training data.Class imbalance in datasets is a frequent issue in machine learning, arising when the distribution of samples in a dataset is skewed or biased.This can result in a model bias during training, ultimately affecting its performance adversely.Our objective is to reduce the influence of class imbalance problems in deep learning classifiers while also tackling the scarcity of large labeled training data in real-world scenarios.
We address these issues in this paper and propose FS3, a novel Few-Shot and Self-Supervised framework for intrusion detection in IoT networks.An overview of our proposed framework FS3 is presented in Figure 1.FS3 overcomes the limitations of state-of-the-art approaches by leveraging self-supervised learning (SSL), few-shot learning (FSL) with contrastive training, and a novel sub-sampled K-Nearest Neighbor (KNN) algorithm.In the first phase, FS3 employs SSL to extract latent patterns and robust representations from unlabeled data.By capitalizing on the inherent structure of the data, FS3 reduces the dependence on extensive labeling, alleviating the burden of manual annotation.Specifically, we leverage attentive interpretable tabular learning [3] and pre-train a tabular multilayer perceptron (TabMLP) [58] as the backbone encoder that learns robust embeddings of the categorical as well as continuous features using masking objective.The second phase introduces FSL with contrastive training.By learning from a small number of labeled examples (e.g., 5-10 instances per class), the model becomes adaptable to dynamic IoT environments, where acquiring extensive labeled data for all possible intrusion scenarios is impractical.Our proposed approach effectively uses IoT-specific features, making our method well-suited for the detection of intrusions that are characteristic of IoT environments.Specifically, we fine-tune the encoder contrastively using the triplet loss function [61] to enhance the model's discriminative power, particularly for the minority classes.Essentially, this loss function imposes a constraint on the model to learn feature representations of the input samples in a manner that promotes proximity among samples belonging to the same class within the feature space, while ensuring greater separation between samples belonging to different classes.
Furthermore, FS3 introduces a novel sub-sampled KNN algorithm in the third phase.This algorithm selectively sub-samples instances from the majority class considering the distribution of the training data, reducing the class imbalance and further enhancing the performance of the model.By intelligently weighting the instances, FS3 achieves a more balanced representation of the classes, leading to improved intrusion detection capabilities across the entire spectrum of intrusion types.To optimize the process of similarity search and enhance the speed of inference, we employ Facebook AI Similarity Search (FAISS) [20] for storing the training samples (i.e., 20% of the labeled data).One of the primary motivations behind using only 20% of the training data is to reduce the labeling cost, which is a significant concern in many real-world scenarios.By effectively utilizing a smaller portion of the data, our approach addresses this practical constraint and can make machinelearning solutions more accessible to organizations with limited labeling resources.Specifically, using only 20% of the training data our approach tries to emulate where the machine learning practitioners have access to limited labeled training data.Similarly, another important consideration is the reduction in training time.
To evaluate the effectiveness of FS3, we conducted extensive experiments using three publicly available datasets from the IoT domain, namely, WUSTL-EHMS [17], WUSTL-IIoT [63], and BoT-IoT [23].These datasets represent diverse IoT scenarios and encompass samples for wide range of attack classes.We compare the performance of FS3 with several state-of-the-art IDSes presented in the literature -CNN-BiLSTM [45], PB-DID [59], and DBN-IDS [5].Furthermore, we trained a range of strong baseline models that employ traditional cross-entropy loss function, dice loss function, and random oversampling techniques.We evaluated both binary and multi-class classification intrusion detection tasks.The experimental results demonstrate the superior performance of FS3 with significant improvements over the state-of-the-art as well as baseline models in a wide range of experimental setups.Notably, our proposed framework FS3 outperforms fully supervised approaches by achieving up to 42.39% and 43.95% improvements with respect to precision and  1 score, respectively, while utilizing only 20% of the labeled data.It is also important to emphasize that our fine-tuning phase (i.e., FSL) only used 5 and 10 labeled training examples per class.These results highlight the remarkable efficacy of FS3 in addressing the challenges posed by limited labeled data availability and data imbalance in datasets, enabling more robust and accurate intrusion detection in IoT networks.All the relevant code is available at: github.com/MultifacetedIntrusionDetection/ID-FS3.
Contributions of this paper can be summarized as follows: • We propose a novel few-shot and self-supervised framework FS3 for intrusion detection in IoT networks -a highly critical yet underexplored area.• FS3 effectively leverages unlabeled data and enhances the discriminative capacity of the model in scenarios wherein limited labeled data is available.• We evaluate FS3 on three diverse publicly available datasets for both multi-class classification and binary classification tasks and show that it outperforms fully supervised state-ofthe-art models with respect to precision and  1 score by a large margin.
The remainder of the paper is organized as follows.In Section 2, we provide an overview of the learning paradigms employed in this paper.Section 3 introduces our proposed framework.The experimental setup is presented in Section 4 and the results of our performance evaluation are discussed in Section 5. Section 6 discusses the related work and Section 7 concludes the paper.

PRELIMINARIES
Our proposed framework FS3 employs a number of learning paradigms.In the following, we provide a brief overview of these paradigms.

Self-Supervised Learning
Self-supervised or unsupervised learning is a learning setting in machine learning that involves training models on unlabeled data without explicit guidance or supervision from human-labeled examples.In the context of deep learning, this learning approach focuses on discovering patterns, structures, and representations within the data without relying on predefined labels.Pretraining models in the fields of Natural Language Processing (NLP) and Computer Vision [19,35,36,47] train deep learning models on large-scale datasets with the objective of learning general-purpose representations of the input data.These pre-trained models serve as a foundation for various downstream tasks and can significantly improve performance, especially when labeled data is limited.In this work, we use TabNet [3] to perform unsupervised learning using unlabeled IoT network traffic data.The model consists of an encoder-decoder structure, where the encoder learns to capture important features from the input data, and the decoder predicts the masked or target variable based on the learned features.Tab-Net leverages sparse instance-wise feature selection.The model employs a sequential multi-step architecture, where each step contributes to a portion of the decision-making process based on the selected features.It incorporates nonlinear processing of the selected features and the concept of ensembling through higher dimensions and multiple steps.These design choices contribute to TabNet's ability to effectively learn from unlabeled data, capture complex relationships, and improve overall performance in various tasks.In TabNet, the same -dimensional features  ∈ R × are passed to each decision step, where  represents the batch size.This ensures that the model operates consistently on the input features across all decision steps within a given batch.An encoder is used to do the multi-step processing.TabNet incorporates a learnable mask, denoted as  ∈ R × , to enable a soft selection of salient features in each step.By employing a sparse selection of the most important features, TabNet ensures that the learning capacity of each decision step is not wasted on irrelevant features.This approach enhances the model's parameter efficiency, allowing it to focus on the most relevant information and optimize its performance.Finally, A decoder is used to reconstruct the tabular features from the encoded representations produced by the encoder.The decoder component is responsible for transforming the encoded representations back into the original tabular feature space, enabling the reconstruction of the input data.We choose to utilize TabNet due to its self-supervised learning capabilities, which involve masking a portion of the elements in the dataset.This approach allows the model to gain an understanding of network traffic flow without  relying on explicit labels.In the second phase of our framework, the model is further trained contrastively, which enhances the discriminative power of the model.Finally, we introduce a sub-sampled KNN approach to make predictions and effectively detect intrusions in IoT networks.By leveraging these features of TabNet, we aim to improve the performance of our intrusion detection system.

Contrastive Training
Contrastive Training is a machine learning technique used for training models in a way that enhances their ability to discriminate between different instances or samples [53,57].It is commonly applied in tasks such as representation learning, where the objective is to learn a meaningful and compact representation of data.For example, the contrastive loss function is designed to minimize the distance between embeddings of positive pairs while maximizing the distance between embeddings of negative pairs.The objective is to update the weights of our embedding model (i.e., TabNet) in a way that satisfies this condition after each pass through the network.We use contrastive learning in such a way that it can tackle data imbalance effectively in the classifications task.In particular, enabling the model to focus on the distinctive features of minority class instances improves their representation learning process.In our work instead of using the traditional contrastive loss function, we use the triplet loss function.

FAISS and KNN
The FAISS library is designed to facilitate efficient similarity searches and clustering of dense vectors.It offers a wide range of comparison operations, including L2 distance, dot product, and cosine similarity.By incorporating indexing structures like hierarchical navigable small worlds (HNSW) [31] and navigating spreading-out graphs (NSG) [15], FAISS enables highly effective searching even in collections of billions of vectors.It is primarily implemented in C++ and relies on BLAS as its main dependency.Additionally, FAISS supports GPU acceleration for faster inference, allowing for both single and multi-GPU indexing using CUDA.Its Python interface ensures compatibility with all major deep learning frameworks.We utilize FAISS to index our training dataset (i.e., 20% of the total data) in the third phase of our proposed framework.Then, we employ our custom sub-sampled KNN classifier to make predictions, which are specifically tailored for our task and perform better than the classical KNN algorithm.

FS3: INTRUSION DETECTION FRAMEWORK 3.1 Task Formulation
We represent the input data as a two-dimensional matrix

Phase 1: Self-Supervised Learning
We use TabNet [3] as the backbone model for self-supervised learning.Specifically, we used the masking objective to mask 20% of the features in the input data.The remaining features are embedded into a high-dimensional vector space and are used to reconstruct the masked features.Figure 1 (left part) illustrates the SSL process.We take the embeddings of the categorical features, along with the numerical features, as inputs.These inputs are then fed through a series of dense layers, which form the Multi-Layer Perceptron (MLP).Each dense layer consists of multiple neurons or units and applies a non-linear activation function to transform the input data.The output of each layer is passed as the input to the next layer until the final layer, which produces the desired output.During this phase, a pre-trained encoder that has been trained using an unlabeled dataset is produced.In our work, we incorporate two dense layers with a dropout rate of 0.1%.The two dense layers have 200 and 100 neurons, respectively.The encoder utilizes an embedding dimension of 100 for two (WUSTL-IIoT and BoT-IoT) datasets, while the third (WUSTL-EHMS) dataset utilizes an embedding dimension of 70.

Phase 2: Few-Shot Learning and Contrastive Training
We leverage Few-shot learning (FSL) with contrastive training to further train the pre-trained model (from Phase 1) by utilizing only a few labeled samples per class.In our framework FS3, we randomly select  labeled samples (i.e., -shot) for each of the  classes.To train the model, we employ contrastive training that focuses on learning discriminative features for each class.Specifically, we use triplet loss function [18,41,52] to train the encoder contrastively.Unlike the traditional contrastive loss function, the triplet loss function incorporates triplets within each training sample (see Figure 2).A triplet consists of an anchor data point, a positive data point (belonging to the same class as the anchor), and a negative data point (belonging to a different class than the anchor).By utilizing triplets, the objective is to minimize the distance between the anchor and the positive point while simultaneously maximizing the distance between the anchor and the negative point during each gradient update.We can define the triplet loss as: where   is the label of anchor sample,   is the label of positive sample, and   is the label of negative sample.  is the distance between the anchor and the positive samples in the embedding space and   is the distance between the anchor and the negative sample, and margin () is a hyperparameter that controls the minimum distance between the anchor-positive and anchor-negative pairs.In our experiments, we use the Euclidean distance to calculate the distance.
Figure 2 illustrates how training data is prepared for contrastive training of the model using FSL and provides an overview of the contrastive training using triplet loss.Specifically, the total number of training samples to train the model is  * . takes values between 5 and 10 in our experiments.We also use a miner function to identify hard pairs from the training samples [51].We use Multi-Similarity Miner (MSM) [33], which selects both the hardest positive and hardest negative samples within each similarity margin for each anchor.The loss function is then computed based on these selected samples.This approach effectively addresses the data imbalance issue between the samples from majority and minority classes, improving the overall discriminative power of the model.We conduct the contrastive training procedure five times for both 5-Shot and 10-Shot scenarios.We report the average results of the five runs in our experiments to show the robustness of the method.

Phase 3: Nearest Neighbor Classification
Phase 3 of FS3 does not involve any training or finetuning.We employ the FAISS library to create an index for efficient and scalable retrieval of similar network traffic data.This indexed data is then utilized for making predictions.This phase only selects 20% of the training data, instead of the entire dataset.We feed the samples into the fine-tuned encoder (from Phase 2) to generate embedding vectors of training samples.To further reduce the class imbalance issue, we introduce a sub-sampled KNN algorithm.Specifically, we define the weight of each class as follows: In Eq 2,   is the weight assigned to the th class. represents a hyperparameter that controls the sub-sampling process,   denotes the fraction of samples belonging to the th class, and  is a constant that defines the minimum weight assigned to each class.In our experiments, the value for  is set to 0.1.Particularly, We use  to sub-sample only the majority classes and the weight for a class  will only be reduced if   > .That is, minority classes are not affected by it, whereas the Eq 2 ensures that as the relative number of samples increases for a class, the corresponding probability of reducing its weight also increases.To illustrate, consider a dataset where t = 10-5, the total number of samples in the training data is 860011, and the total number of samples for the ith class type is 152 (minority class).In this case, we can calculate pi as 152/860011, resulting in Wi = 0.7621.If ith class type is majority class with 56379 samples, then calculated pi is 56379/860011 and resulting Wi = 0.9876.This is the way minority class type can get weight comparable to majority class to reduce the impact of imbalance in the dataset.Our proposed sub-sampled KNN algorithm provides a balanced weighting approach for each class, akin to Goldilocks, as opposed to classical KNN which is biased toward majority classes and the weighted-KNN variant by weighting it with the inverse of class size that gives equal weight to each class.For our experiments using the two datasets WUSTL-IIoT and BoT-IoT, we set the value of  to 10 −5 , while in the other experiment using the dataset WUSTL-EHMS, we use  = 10 −4 .To perform the final classification using sub-sampled KNN, we repeat the experiment five times for each shot, utilizing the five fine-tuned encoders from Phase 2. In all experiments, we use the same 20% labeled data.

EXPERIMENTAL SET UP 4.1 Evaluation Criteria
We utilized the following quantitative metrics to assess the performance of various ML classifiers: (i) Precision (Pre), (ii) Recall (Rec), and (iii)  1 -Score [22,34].In our experiments, we calculated the macro average for all three metrics, which is the recommended method for evaluating models on imbalanced datasets.Pre represents the algorithm's ability to predict different types of intrusions accurately.Rec indicates the proportion of actual intrusions correctly detected by the algorithm.The  1 -Score is the reciprocal of the arithmetic mean of Pre and Rec, representing the harmonic mean of the two metrics.

Datasets used in the Experiments
WUSTL-EHMS.The WUSTL-EHMS dataset [17] originates from a real-time Enhanced Healthcare Monitoring System (EHMS) that captures network flow metrics and patients' biometrics.It encompasses four components: medical sensors, gateways, networks, and control with visualization.Patient data is collected by sensors and transmitted through gateways, switches, and routers to the server.However, there is a risk of data interception before it reaches the server, particularly due to man-in-the-middle attacks involving we utilized the open-source code provided at the CNN-BiLSTM repository [8].PB-DID.To leverage the benefits of shared features between datasets, we employ PB-DID [59] by utilizing an auxiliary dataset that shares similarities with the main dataset.Our approach consist of two experiments.In the first experiment, we train PB-DID on the Bot-IoT and WUSTL-IIoT datasets, which have four common features.We merge the samples of the Normal, DoS, and Reconnaissance class types from the WUSTL-IIoT dataset into the Bot-IoT dataset.The merged dataset is used for training the model, while the original Bot-IoT testing set is used for evaluation.In the second experiment, we train PB-DID on the WUSTL-IIoT and WUSTL-EHMS datasets, which share ten common features.Similar to the first experiment, we merged the attack class samples from both datasets and the normal class samples from both datasets.Performance evaluation of PB-DID in this experiment is conducted using the provided opensource code [9].DBN-IDS.We also compare our work with recently published work DBN-IDS [5].In our work, we utilize the proposed architecture of the DBN model for all three datasets.The model consists of five RBMs stacked with (49, 128), (128, 256), (256, 128), (128, 128), and (128, 64) visible/hidden nodes per RBM, respectively.The output from the last RBM is connected to a fully connected layer with 5 nodes and 2 for both multi-class and binary classification using the Softmax function.To address the data imbalance issue, we employ a combination of SMOTE and undersampling techniques.For the performance evaluation of BBN-IDS in this experiment, we utilize the provided open-source code [10].
CTGANSamp: models were trained on the training datasets balanced using synthetic samples.To address data imbalance in the datasets, we also utilize synthetic instances generated using CTGAN [12,55].CTGAN employs a GAN-based model to generate synthetic tabular data based on the original tabular data.For the Bot-IoT training dataset, we add synthetic data using CTGAN, resulting in a total of 69215 samples for the Normal attack type and 71238 samples for the Theft attack type.However, we choose not to introduce additional synthetic samples for the DDoS, DoS, and Reconnaissance attack classes due to their already sufficiently large sample sizes.In the WUSTL-IIoT dataset, synthetic samples are added to the Reconnaissance, Command Injection, and Backdoor attack classes, resulting in 54447, 70867, and 60231 training samples, respectively.As Normal and DoS already have a sufficient number of samples, therefore no new samples are added to them.In the case of the WUSTL-EHMS training dataset, synthetic samples were only added to the attack class, increasing the total number of samples in that class to 9472.For our experiments, we trained the models on these balanced datasets using the cross-entropy loss function.We refer to these models as FNN-CTGANSamp and CNN-CTGANSamp.Focal: models were trained using focal loss function.To evaluate our method, we utilize the focal loss function [13,26], originally designed for object detection tasks.The focal loss is especially beneficial when there is a substantial class imbalance between foreground and background classes during training.In our implementation, we adopt the focal loss as a specialized loss function.
The focal loss has two hyperparameters:  and .During training on the Bot-IoT dataset, we set  and  as 2 and 1 for FNN, and 5 and 5 for CNN, respectively.When training on the WUSTL-IIoT dataset, we set  and  as 2.5 and 0.15 for FNN, and 2.0 and 0.3 for CNN, respectively.Finally, for training on the WUSTL-EHMS dataset, we set  and  as 2 and 2 for FNN, and 2 and 0.2 for CNN, respectively.
In our experiments, we denote the models trained using the focal loss function as FNN-Focal and CNN-Focal, representing the FNN and CNN architectures, respectively.

Baseline
Models.ORG: models were trained using original datasets (i.e., without balancing the dataset).For training the classifiers, we utilized the original training samples from the datasets and employed a traditional loss function, specifically the cross-entropy loss function.RND: models were trained on the datasets, balanced using random oversampling.To address the class imbalance in the training datasets, we employed random oversampling.This technique involves duplicating samples from the minority classes randomly, thereby balancing the dataset.During the oversampling process, a representative sample from each subject was selected independently to maintain the integrity of the population [43].After applying random oversampling to balance the training datasets, the number of training samples for each class type in Bot-IoT, WUSTL-IIoT, and WUSTL-EHMS became approximately 1233052, 797261, and 10275, respectively.In our experiments, we denote these models as FNN-RND and CNN-RND, respectively.Both models were trained using the cross-entropy loss function.
Dice: models were trained using dice loss function.To tackle data imbalance in image segmentation tasks, the widely used dice loss [44,48] is employed.This loss is based on the dice coefficient, which measures the similarity between predicted and ground truth segmentation masks.The dice coefficient is computed by dividing the sum of the ground truth and predicted values by twice the intersection of the ground truth and predicted values.This coefficient serves as the foundation for calculating the dice loss.In our study, we applied the dice loss function to all three original training datasets for training deep learning (DL) models in both multi-class and binary-class classification tasks.The dice loss is used as one of the methods for performance comparison, with a smoothing parameter of 1 × 10 −7 incorporated in the calculation.In our experiments, we refer to the models trained using the dice loss function as FNN-Dice and CNN-Dice, corresponding to the FNN and CNN architectures, respectively.While all the baseline and state-of-the-art models are trained in a fully supervised manner, our proposed FS3 takes a different approach by utilizing only 20% of the labeled data for the final classification task.This reduction in labeled data usage distinguishes our method from the others and highlights its potential for achieving comparable or even superior performance with a significantly smaller labeled dataset.

RESULTS OF PERFORMANCE EVALUATION 5.1 Quantitative Analysis
Table 4 provides a comparison of the performance of all competing methods, including our proposed FS3, with respect to precision, recall, and  1 score across all three datasets.For FS3, we present two types of few-shot learning: 5-Shot and 10-Shot.We conduct the Phase 2 experiment five times by randomly selecting 5 or 10 samples from the training sets.In Phase 3, we use the same 20% of labeled data to perform sub-sampled KNN classification five times and we report the average of the results obtained from these five iterations.WUSTL-EHMS.FS3 shows significant improvement in performance over all the baseline and state-of-the-art models on the WUSTL-EHMS dataset with respect to all metrics.Specifically, in the AVG 5-Shot configuration, we observe substantial increases of 4% in precision, 33% in the recall, and 22% in  1 score compared to FNN-ORG.AVG 10-Shot exhibits significant superiority over CNN-ORG with respect to precision, recall, and  1 score.Furthermore, when compared to specialized loss functions such as Dice and Focal loss, both AVG 5-Shot and AVG 10-Shot outperform with respect to all metrics.For instance, AVG 10-Shot achieves improvements of 1% in precision, 33% in the recall, and 22% in  1 score compared to FNN-Focal.Similarly, AVG 5-Shot shows improvements of 3% in precision, 33% in the recall, and 23% in  1 score when compared to CNN-Focal.Our FS3, under both AVG 5-Shot and AVG 10-Shot, outperforms all state-of-the-art models with respect to all metrics.When compared to CNN-BiLSTM, AVG 5-Shot exhibits a remarkable improvement of 8% in precision, 36% in the recall, and 24% in  1 score.Furthermore, our FS3 consistently outperforms PB-DID (i.e.,  1 score of 0.4664), DBN-IDS (i.e.,  1 score of 0.7222), and DBN-IDS (i.e.,  1 score of 0.7222) by a substantial margin.FS3 leverages self-supervised learning in Phase 1, allowing the encoder to gain a deep understanding of the network traffic data without relying on class labels.In Phase 2, the encoder is trained contrastively using a small number of instances, enabling it to learn discriminative representations for the samples.Ultimately, FS3 achieves a better balance between precision and recall using the hyperparameter , resulting in an improved overall performance of the model, leading to a notable improvement over other competing models.WUSTL-IIoT.Our proposed approach exhibits substantial improvements over all the baseline and state-of-the-art models on the WUSTL-IIoT dataset, particularly with respect to precision and  1 score.Specifically, in the AVG 5-Shot configuration, we observe a significant increase of 69% in precision, 63% in the recall, and 53% in the  1 score compared to FNN-ORG.We observe a similar trend with both FNN-CTGANSamp and CNN-CTGANSamp.Similarly, AVG 10-Shot demonstrates significant superiority over CNN-ORG in terms of precision, recall, and  1 score.Additionally, when compared to specialized loss functions such as Dice and Focal loss, both AVG 5-Shot and AVG 10-Shot outperform with respect to all metrics.For instance, AVG 10-Shot has improved 93% in precision, 140% in the recall, and 123% in  1 score, compared to FNN-Focal.Similarly, AVG 5-Shot has improved 8% in precision, 2% in the recall, and 0.61% in  1 score, compared to CNN-Focal.Both AVG 5-Shot and AVG 10-Shot outperform all state-of-the-art models with respect to all metrics.When compared to CNN-BiLSTM, AVG 5-Shot exhibits a notable improvement of 23% in precision, 56% in the recall, and 37% in  1 score.Moreover, our proposed FS3 outperforms both PB-DID ( 1 of 0.0214) and DBN-IDS ( 1 of 0.3851) by a significant margin.Although our proposed approach may not achieve the same level of recall as CNN-RND and FNN-RND (i.e., 0.7720 and 0.7630, respectively), it outperforms them with respect to precision and  1 score.The reason for the improved precision lies in Phase 2 of our approach, where we train the encoder contrastively using a small number of instances.Furthermore, by adjusting the hyperparameter , we can achieve an optimal trade-off between precision and recall, leading to an overall enhancement in the model's performance.In contrast, random oversampling or undersampling approaches such as CNN-RND, FNN-RND, and DBN-IDS involve duplicating or removing existing samples, which can result in overfitting or underfitting of the model and ultimately lead to a lower  1 score.Specialized loss functions like Dice and focal loss can have a potential drawback where they tend to prioritize the minority class, which may result in decreased performance on the majority class.The reason for the poor performance of other models could be their architecture is not robust and use traditional methods to overcome the data imbalance issue.Based on the performance using the WUSTL-IIOT dataset, we conclude that FS3 consistently outperforms all competing models with respect to robust metrics such as precision and  1 score.Additionally, it successfully finds a better balance between precision and recall.BoT-IoT.In terms of precision and  1 score, FS3 demonstrates significant improvement over all baseline and state-of-the-art models on the BoT-IoT dataset.Specifically, in the AVG 10-Shot setting, we observe a precision improvement of 24% and an  1 score improvement of 11% compared to FNN-ORG.Similarly, FS3 AVG 10-Shot achieves a precision, recall, and  1 score that are 42%, 17%, and 43% higher, respectively than CNN-ORG.Notably, the AVG 10-Shot and AVG 5-Shot, consistently outperform the FNN-Dice and CNN-Dice with respect to all metrics.In a similar fashion, our proposed approach FS3 AVG 10-Shot performs better than FNN-Focal with respect to precision and  1 score by 13% and 4%, respectively.The improvement in precision can be attributed to the contrastive training of the encoder in FS3, which utilizes a small number of instances.This training approach allows the model to focus on extracting relevant and discriminative features.Additionally, by tuning the hyperparameter , we can find an optimal balance between precision and recall, resulting in an overall enhancement in the model's performance.The same trend can be observed for CNN-Focal.Furthermore, our approach consistently outperforms the state-of-the-art models (CNN-BiLSTM, PB-DID, DBN-IDS) across all metrics.Although our proposed approach may not attain the same level of recall as FNN-CTGANSamp (i.e., 0.8652), it is worth noting that FNN-CTGANSamp exhibits poor precision and  1 score.The reason for this discrepancy could be CTGANSamp, which adds a significant number of synthetic samples to the original training set, resulting in an increased overhead for the model and potential overfitting issues.

Qualitative Analysis
We draw 200 samples from the WUSTL-IIoT testing set and perform t-SNE projection to analyze how different methods performed on this sample.For qualitative analysis, we specifically chose CNN-RND, CNN-Focal, and FS3 (5-Shot), which achieved  1 scores of 59.42%, 69.74%, and 74.45%, respectively, on the WUSTL-IIoT dataset.
In Figure 3(a), the ground truth is visualized, where each data sample in the five attack classes is correctly marked using the colors red, green, blue, yellow, and black.To aid visual interpretation, we grouped the selected samples into four distinct groups, regardless of their attack class types.These groups are represented by encircled regions in four different colors: (i) blue, (ii) black, (iii) red, and (iv) green.Within the blue circle of the ground truth, the majority of samples belong to the DoS attack type, while a few of them are of the Command Injection type.When applying CNN-RND, all the samples within the blue circle are incorrectly classified as Backdoor.However, both CNN-Focal and FS3 correctly classify the DoS, and CNN-Focal missed a command injection.The black circle in the ground truth contains a large number of samples belonging to the Backdoor attack type, along with some samples of the Command Injection and Reconnaissance types.When using CNN-RND, all the Backdoor samples within the black circle are misclassified as Command Injection or DoS.Similarly, CNN-Focal classifies all of them as either normal or DoS.On the other hand, FS3 predicts the samples similar to the ground truth, correctly identifying most of them as Backdoor.However, it does misclassify some samples as Normal and DoS instead of Command Injection and Reconnaissance, respectively.The majority of samples within the red circle in the ground truth (Figure 3(a)) correspond to the Reconnaissance attack type, with some samples being DoS.However, in Figure 3

Ablation Study
In our implementation, we incorporate various components within our model, each of which plays a distinct role in determining the overall performance.Consequently, it becomes crucial to establish methods for quantifying the individual contributions of these components toward the overall model performance.This allows us to gain a deeper understanding of how each part influences the model's effectiveness and aids in the evaluation and optimization of our approach.Table 5 presents the ablation study of our proposed approach FS3 .We employ different strategies for applying the KNN algorithm after each phase of our proposed method.We conduct a comparison between our proposed sub-sampled KNN with classical KNN (without considering weights), and the Inverse of class size approach (weighting class types based on their inverse frequency).The results obtained in Phase 2, where the encoder is trained contrastively with either a 5-shot or 10-shot approach, show a significant improvement over the results obtained in Phase 1 across all metrics for all the KNN strategies employed in the final classification.This improvement highlights the effectiveness of training the encoder using a few instances with triplet loss function and the impact it has on the overall performance of the model.For instance, in Phase 3, utilizing our Sub-Sampled KNN on WuStl-EHMS, we observe significant improvements in terms of precision, recall, and  1 score compared to classical KNN in Phase 1. Specifically, our proposed approach achieves approximately 13% higher precision, 23% higher recall, and 19% higher  1 score on the WuStl-EHMS dataset.In Phase 3, when utilizing our proposed sub-sampled KNN, we observe significant improvements over classical KNN in terms of precision and  1 score.Specifically, our approach achieves approximately 38% improvement in precision and 3% improvement in  1 score compared to classical KNN.Although the recall score in Phase 3 using our approach is not as high as that of classical KNN or the inverse of class size strategy in Phase 2, it is still comparable to them.The notable advantage of our approach is the improved  1 score achieved using the proposed Sub-Sampled KNN.These results highlight the effectiveness of our method in enhancing the overall performance of intrusion detection.

RELATED WORKS 6.1 Machine Learning Frameworks
Several ML frameworks have been proposed by researchers to address security issues in IoT [2,4,27,30,56].Yang et al. [56] developed an intelligent IoT network ML framework using Software Defined Network (SDN) and Network Function Virtualization.et al. [4] developed an ML framework using SDN and Network Function Virtualization to handle various IoT threats.Arachchige et al. [2] proposed PriModChain, a framework for ensuring the privacy of Industrial Internet of Things (IIoT) data.Liu et al. [27] presented a malicious node detection framework for handling a specific type of insider attack in IoT, called a conditional packet manipulation attack.Makkar et al. [30] proposed an ML framework for detecting spam in IoT networks.In addition, Dina et al. [12] utilized a DL model to balance data by incorporating synthetic data.

Using ML for Solving Miscellaneous Problems in IoT
Roy et al. [38] propose a two-layer hierarchical intrusion detection mechanism for IoT networks that uses machine learning.This model can effectively detect intrusions while satisfying the resource constraints of the IoT.By deploying multi-layered feedforward neural networks in the fog-cloud infrastructure, this model can utilize the resources in the fog layer to detect network attacks.Liang et al. [25] discuss the vulnerabilities of ML algorithms in detecting intrusions and how these algorithms can be used in launching cyberattacks.Sun et al. [49] model botnet attacks using ML.Amouri et al. [1] present an intrusion detection system for mobile IoT, while Sivananthan et al. [46] combine SDN and ML techniques to manage IoT devices.Finally, Zheng et al. [60] discuss the challenges in applying privacy-preserving ML methods developed for cloud computing systems in the context of IoT.Jha et al. [39] propose a technique for detecting unknown system vulnerabilities in IoT.Guerra et al. [16] observe that network traffic becomes obsolete over time, as attackers change their types and behavior.Wahab et al. [50] employ the Principle Component Analysis (PCA) method to study the change in the variance of the features across the intrusion detection data streams in IoT and present an online deep neural network that dynamically adjusts the sizes of the hidden layers to cope with these changes.Khan et al. [21] point out that the existing ML models used in cyber-security follow the black-box model and propose a method to solve this problem.
Ferrag et al. [14] conducted a study to compare the performance of centralized and federated deep learning with three popular deep learning approaches, using three different datasets.Zolanvari et al. [62,63] recognized the significance of machine learning (ML) and big data analytics in securing both IoT and IIoT.They built a real-world testbed to conduct cyber-attacks and develop an IDS that uses ML algorithms to detect backdoors, command injection, and SQL injection attacks.Moustafa [32] proposed a testbed architecture that allows the creation of dynamic testbed networks for IoT, enabling the interaction of edge, fog, and cloud tiers.They tested the architecture by executing real-world scenarios, including

( 2 )
Few-Shot Learning and Contrastive Training

Figure 1 :
Figure 1: Overview of our proposed Few-Shot and Self-Supervised framework: FS3 that requires a small amount of labeled data.

Figure 2 :
Figure 2: Overview of few-shot learning using contrastive training with triplet loss.
(b)   and (c), CNN-RND and CNN-Focal fail to correctly classify the samples within the green circle.On the other hand, FS3 demonstrates more accurate classification for most of these samples, except for a few cases where Reconnaissance attacks are misclassified as DoS attacks, as depicted in Figure3(d).The samples enclosed within the green circle in the ground truth (Figure3(a)) consist of attack types such as DoS, Reconnaissance, Normal, and Backdoor.In Figure3(b), CNN-RND misclassifies all the Reconnaissance samples as Normal.Similarly, in Figure3(c), CNN-Focal also misclassifies these samples as Reconnaissance.In contrast, FS3 (Figure3(d)) appears to correctly classify all the samples within the green circle.
Dina et al.

Table 2 :
Statistics of WUSTL-IIoT dataset.The dataset comprises 43 features, including 35 network flow features and 8 patients' biometric features.Samples are labeled as Normal or Attack, with attacker MAC addresses assigned a label of 1 and the rest labeled as 0 based on the Source MAC address feature.Similar to the WUSTL-IIoT dataset, the WUSTL-EHMS dataset lacks separate training and testing datasets.As a solution, the dataset is randomly split into training and testing datasets since it lacks a timestamp feature.Table 1 presents the distribution of the dataset after randomly splitting it into training and testing splits (Train and Test).As observed, a large portion of the samples belongs to the Normal class, indicating the prevalence of normal instances in the dataset.
[45]e the model's effectiveness.The Bot-IoT dataset encompasses four sub-components within the IoT testbed: simulation, networking platform, feature extraction, and forensics analytics.Alongside Normal traffic data, the dataset includes instances of various attack types, including DoS, Distributed Denial of Service (DDoS), Reconnaissance, and Theft.In total, the dataset comprises 15 features.The Bot-IoT dataset comprises both a training dataset and a testing dataset.The distribution of data items across different classes in the training and testing datasets of Bot-IoT is illustrated in Table3.Notably, the Normal class has 296 training samples (0.01% of the total), and the Theft class has 52 training samples (0.002% of the total).On the other hand, the training dataset has a higher percentage of DDoS and DoS samples, accounting for 53% and 45% of the dataset, respectively.4.3 Competing Methods4.3.1 State-of-the-art Models.CNN-BiLSTM.The CNN-BiLSTM architecture proposed by Sinha et al.[45]consists of multiple layers including a 1D-CNN layer, batch normalization, and Bi-LSTM layers.The 1D-CNN layer in CNN-BiLSTM utilizes the ReLU activation function and employs a maximum pool size of five.Batch normalization is applied to expedite the training process.Bi-LSTM layers are incorporated throughout the model in a progressive manner, doubling the kernel size at each iteration.Specifically, the first Bi-LSTM layer comprises 64 units, followed by a second layer with 128 units, and a final Bi-LSTM layer with 128 units.The last layer, which is a dense layer, is fully connected and employs the softmax activation function.To evaluate the performance of CNN-BiLSTM,