An Anomaly Detection Model Based on CNN-VAE for IoT Devices

Anomaly Detection on IoT devices is a widely studied task in industry. As deep learning methods developed, they have been a prevailing solution to this task, especially in the certain working condition of lacking anomaly samples. Among all the methods, for complicated data with changing period and transmitting noise in the real production environments, VAE appears to be potential. We first reconstruct the 1D time series to 2D tensors as it is hard to apply normal data augmentation methods on a time series dataset with continuous semantic. Then we used a CNN-VAE model, an improved reconstruction-based anomaly detection method, to compute the reconstruction error in an unsupervised way. The method can be integrated to an end-to-end framework as it is lightweight. Comparing with other anomaly detection methods on our dataset, our method showed the best results.


INTRODUCTION
The purpose of anomaly detection is to distinguish between normal and abnormal samples.Anomaly detection is required in many fields, such as medical imaging [18], video surveillance [17], industrial equipment production [1].How to predict the anomaly of the system in time, or detect the tiny anomaly data from the massive data, is a big challenge for the traditional anomaly detection technology.Traditional statistical methods and machine learning-based methods are fast and robust, but they are difficult to deal with high-dimensional or insufficiently correlated data [9].For some outputs from complex systems, these complicated patterns and multiple variations are always involved with each other, not obeying a certain distribution, making traditional models hard to judge.Since deep learning has a good performance in high-level data processing, information mining, prediction and other aspects, the anomaly detection system based on deep learning has been widely used [11,14,15].
However, unlike typical anomaly detection problems, anomaly detection on IoT devices faces unique challenges.In many cases, we do not need to detect whether the current sample is an anomaly, because for many IoT devices, it does not make sense to detect after the anomaly has occurred.Therefore, how to predict whether an exception will occur in the future period of time is the demand of IoT devices.Due to the limitations of the equipment, it is difficult to obtain sufficient anomaly data.Taking the accelerator power supply sample used in this article as an example, the cost of a power anomaly data sample is that the power supply stops working once, which makes the abnormal data collection of industrial iot devices extremely difficult.Industrial iot devices are constantly running, and the data is continuous and semantic, so it is also difficult to perform data enhancement through simple data transformation or GAN.Many devices generate data very quickly, but do not have the computing power to process it.Many IoT devices are cascaded, perhaps the current abnormal device is not abnormal, is the condition of the other device caused it to be abnormal, that is, the abnormal data can be extremely complex,making predicting precise time that the fault will happen from daily running data extremely challenging.
Subject to the above conditions, unsupervised anomaly detection may be a probable method to predict the time anomalies on IoT devices occur.This kind of methods can avoid impacts from lack of fault labels and quantity of anomaly records, as the training process is only on normal samples.Similarly, for IoT datasets with multiple kinds of exception types or exception types that has never occurred, unsupervised anomaly detection methods will also work well.
In our paper, we mainly focus on settling down the problem of lack of anomaly datasets.First of all, we observe that the data cycle of many IoT devices is uncertain, which makes reconstruction error cannot work properly.To settle down the intricate temporal variations, we transformed the original 1D time series to 2D inputs after analyzing the periods.Besides, noises during transmission may confuse the time series model as recent outliers have great influence on the model.But we change 1D tensors to 2D image-like tensors can improve noise resistance of the model as few outliers can hardly change the final classification of an image model.Then we trained an unsupervised Variational Autoencoder (VAE) model to construct the time series.
The rest of the paper is organized as follows.In Chapter 2 we describe how we reconstruct the data samples.In Chapter 3 we describe the framework of how VAE work on our datasets.In Chapter 3 we illustrate the experiment results of different models and in the last Chapter we summarized the methods and stated future works.
The main contributions of this paper are: • In this paper, a method of data set reconstruction using sliding window is proposed, which can change the time series data into 2D data, which improves the noise resistance of subsequent models.• We used un unsupervised CNN-VAE model to learn the hidden patterns to settle down the problem of lacking abnormal samples.
• We compared different methods on our dataset and our method performs best.

RELATED WORK
Machine learning is often applied to less complex IoT device anomaly detection, Bayesian networks, k-means, ELM, SVM, etc. Antonio Cansado et al. used Bayesian network to detect anomalies in the fields of manufacturing and astronomy [2].Gerhard Munz used k-means to detect traffic anomalies [8].Annie George [3] applied PCA and SVM on the KDD99 dataset of network IDS.However, the machine learning method has not only the disadvantage of lack of strong learning abilities, but also relies on feature engineering to a large extent, being not suitable for complex multivariate data series.
Deep learning is also commonly used in anomaly detection of IoT devices.Liaqat et al. proposed a CNN and Cuda DNN LSTM method to detect zombie networks [5].Bhuvaneswari et al. introduced vector convolution to construct intrusion detection system in fog based Internet of Things [10].In addition to supervised models, unsupervised models are also commonly used for anomaly detection.For example, For example, an ICS intrusion detection strategy based on bidirectional-GAN (BiGAN) is proposed in literature [7].
Similar to Gans, VAE can be reconstructed through the learning of data distribution, so it is particularly suitable for anomaly detection.
Shuyu Lin et al. proposed a VAE-LSTM method to detect anomalies [6].They use VAE to form features and then used LSTM to detect.Ji Hun Park et al. also sued LSTM-VAE to handle nuclear power faults [12].Tuan-Anh Pham et al. proposed an MST-VAE to detect anomalies in multivariate time series [13].Walaa Gouda et al. used VAE to detect satellite anomalies [4].

THE FRAMEWORK OF OUR METHOD
The whole framework of this IoT devices anomaly detection from time series is summarized as Fig. 1 3.1 Using sliding window to transform 1D time series to 2D samples As a typical IoT dataset, our sample on one time step from the data acquisition system have origin feature as follows: The law of data in different cycles is not the same, and the period of data may change at any time, so we need to first determine the period of data.Based on the periodic variability of time series, we propose a method to two-dimensional the original data.Take a time series (length T) as an example.This original sample organized along time, and at every time point it has its own feature (temperature, current, etc.).We calculate the period using the following method: (•) and  (•) denote the FFT calculation to map the origin data from time domain into the frequency domain and the calculation of the amplitude values.We only select the largest frequency as the sliding window to reconstruct the 2D tensors.For every time point, calculating its frequencies and select the one with the highest probability as the window size, then store them into a list waiting to use.For example, at  1 the period is  1 , then the window size at time  1 +  1 is  1 .The samples used for training are obtained from this method, and the prediction is also based on the data at each time point to map.At the junction of the cycle, the data of the new cycle is not enough to fill a cycle, so it is inevitable that the wrong data will be used.We set the threshold  as the window size when a rapid change in the period is detected (this article uses =30).
The origin data is 1D and is time series, only containing information at this time.Use time series-based models (e.g.LSTM, transformer etc.)is possible to capture the variations and information along the time, but Mariya Toneva et al. has proved that the rarity of the abnormal data makes it difficult for the network to learn true patterns [16].To transform the 1D time series to 2D can allow us to choose the balanced samples to train, capture the feature information and delay information simultaneously, and use more complicated models on the 2D tensors.

The VAE Model
We want to train VAE to learn some common probability distribution.The direct use of CNN model in the experiment proves that the probability distribution exists.This proves that VAE can be learned from latent pattern, that is, our data set follows a common probability distribution.
VAE mainly consists of two parts: Encoder and Decoder.The Encoder generates a hidden layer that is passed to the decoder, which is used to reconstruct the dataset.They can be represented by mathematics: ,  represent Encoder and Decoder separately.For a sample   , there is distribution  ( |  ) belonging to   , and VAE will restore the   sampled from distribution  ( |  ) of input sample to   .
Finally, the target ELBO(Evidence Lower Bound) to be optimized is obtained: After simplifying this equation, we get: For the sample   , its loss can be calculated as: The Encoder its two parameters to learn the distribution of the sample: In actual training,  2 is fitted because  2 is non-negative, and  2 does not require additional activation function processing.
Through Reparameterization Trick, VAE can be expressed as Fig. 4.
Since our dataset is continuous, the choice of encoder needs to be Gaussian MLP.And because our two-dimensional data is not fixed length, we use a CNN model as an Encoder and another CNN as a decoder.The CNN Model is equivalent to the MLP model to some extent, so the results obtained by using the CNN model and the Gaussian MLP should also be equivalent.
Therefore, the overall process framework of our model is: The original 1D data is analyzed by Fourier in which the most probable period is determined, and the sliding window is used to reconstruct the two-dimensional indefinite length data.Only the normal 2D tensor is then fed into a VAE Model that uses the CNN Model as an Encoder to learn hidden patterns.By learning normal samples, a threshold is selected.When the score of the reconstructed sample is greater than the threshold, the current sample is considered as an abnormal sample.
The metric we use is Precision, Recall.
• TP is the number of positive samples predicted and actually positive samples (here positive samples refer to abnormal samples).• FP is the number of samples predicted to be positive and actually negative, that is, the number of false positives.• TN is the number of negative samples predicted and actually negative, i.e. the number of negative samples correctly identified.• The number of abnormal samples is too small, so Accuracy is not considered as an evaluation index.

EXPERIMENT 4.1 VAE Model with Sliding Window
First define a sample of the exception.We need to predict the anomalies of IoT devices, so we set the label of the sample before 10 time nodes of the anomalies to be 1 (if the prediction comes too late, there will be no time for the IoT device to process, so failure prediction is of no practical use).That is to say, for a time seires that at a timestep (take 20220707 11:25:31 as an example) a failure happens, and the period of this part of time series is  (assuming  = 10), we take data at [ − − 10 :  − 10] as a sample(from 20220707 11:25:11 to 20220707 11:25:21).Under this definition, we select the abnormal time node and choose some time points(decided by FFT analysis) around it as a sample.
We used a three-layer CNN to model  and , and a two-layer CNN to decode hidden patterns.

BiLSTM Model without Sliding Window
Considering anomaly samples are rare and neural networks tend to forget previous samples, we used the following methods to enforce BiLSTM model to learn the anomaly pattern: For LSTM, we selected the abnormal time node and selected 200 time points around it as a sample.We remove the normal sample in the middle to ensure the proportion of abnormal and normal samples, while preserving the sequence of time.We also asked BiLSTM to predict failures 10 timesteps in advance.
Then, we modified the loss function to increase the influence of recall, that is, the penalty of failing to predict anomalies, so

TCN Model without Sliding Window
We then trained a TCN model (Temporal Convolutional Networks for Anomaly Detection in Time Series) to replace the simple LSTM model to detect the faults.We trained the TCN Model in an unsupervised way, and the IoU is calculated the same as the LSTM model.We adopted the multivariate Gaussian distribution in the paper to model the prediction error, and the anomaly fraction   was modeled as follows: represents the prediction error,  and Σ represent the mean and covariance of the prediction error distribution respectively.If a point has an outlier score   greater than a threshold, it is classified as an outlier, otherwise it is classified as a normal point.Similar to VAE, we use 10% of the average   on the train set as the threshold.

Pure CNN Model with Sliding Window
After data processing, we used a normal CNN model to directly detect the anomalies.The input data is analyzed by Fourier analysis as the VAE model.We trained a 6-layers-CNN model.
We feed a 4:1 ratio of positive and negative samples into the network for training, and then the same ratio for testing.

Results
The results of the experiments are listed in Table .2.
The two experiments of BiLSTM and TCN show that normal anomaly detection models without data preprocessing (changing 1D tensor to 2D tensor) can hardly learn the anomaly patterns, either using a supervised or unsupervised training method.
The pure CNN model clear shows that supervised learning methods cannot work properly as there are few anomaly samples that the neural network can hardly learn the hidden information well, although the data has already been preprocessed.

CONCLUSION
Anomaly detection is an important function for IoT devices.This paper proposes a sliding window and VAE based model for anomaly detection of IoT devices.By analyzing and modeling the whole time series data of IoT devices, it is possible to predict whether anomalies will occur in the future.Through Fourier analysis and changing to 2D tensors, we have preliminarily processed the data and obtained image-like samples, so we can use the image model as the 2D data have some anti-noise ability.We then reconstruct the data in an unsupervised method, taking full use of the normal data.Our model leverages two CNN models, being lightweight, can be ported to low-computing IoT devices, and guarantees predicting in time.

Figure 2 :
Figure 2: By using FFT to judge the period of the data, we only select the maximum period.

Figure 1 :
Figure 1: Flowchart of time series anomaly detection on IoT devices

Figure 3 :
Figure 3: Sliding window maps the raw data from 1D to 2D tensor.

Figure 4 :
Figure 4: The training process of VAE model.

Table 1 :
Frequency of Special Characters

Table 2 :
Detection Results of Different Methods