Understanding Integrity of Time Series IoT Datasets through Local Outlier Detection with Steep Peak and Valley

With substantial advances in emerging and enabling technologies in IoT sensors, a vast amount of IoT-based environmental data allows preparation for adverse impacts by providing helpful information for predictive and precise services. However, data acquired by IoT sensors can be corrupted by external environmental factors, which can negatively affect the integrity of data interpretation. To address this problem, a prior study proposed outlier detection techniques using transform-based sparse profiles. However, it would lose its worth without an evaluation methodology for data integrity after probing datasets by outlier detection. In addition, it did not consider data with steep peaks or data that is dependent on other data, which is common in real-world scenarios such as soil moisture data used in this paper. Therefore, we propose a process of preprocessing defective soil moisture sensor data using local pattern-based outlier detection (LPOD) and evaluating the integrity of data after outlier detection. Our paper specifically aims to: 1) detect outliers of original soil IoT datasets to eliminate fault data possibly giving wrong decisions using local and global outlier detection (OD); 2) exploit the results of statistical evaluation to determine whether the outliers have been well eliminated; and 3) find the ground truth pattern of soil IoT datasets considering precipitation. Experiments using real-world soil moisture datasets show that the LPOD method outperforms other statistical outlier detection methods, suggesting that the preprocessed data can improve the integrity of IoT datasets.


INTRODUCTION
The rapid development of Internet of Things (IoT) technology has led to the continuous collection of large amounts of data in various places, such as homes, offices, and even agricultural farms.Scientists have focused on data mining related to environmentrelated phenomena, which are known to impact agriculture substantially [3,14,18].Effectively managing soil moisture data from sensors gives actionable knowledge, such as automation of irrigation, to help farmers [21].Sensors typically acquire data and record them in a time order, thus constituting a time series.The proliferation of time series datasets generated by modern environment IoT sensors can help informative discoveries reach better and faster decisions if farming industries leverage the datasets correctly.However, it needs to have confidence that there are no silent data faults in the acquired real-time IoT datasets because data faults can adversely affect the integrity of data mining.Therefore, sensor data quality or outlier detection to improve data integrity plays a fundamental role in adopting corresponding IoT sensors [2,12].
Our solution is to eliminate outliers and assess data fidelity using statistical methods.An efficient and effective outlier elimination empowers data fidelity by timely correcting anomaly situations [1,13].Outlier detection (OD) has become a data analytics field of interest for many researchers.It is now one of the main tasks of time series data analytics in wide-ranging domains [11].To detect sensor failures or outliers (anomalies), identifying unusual instances that deviate significantly from the majority of data is underpinned.There are various outlier detection techniques, such as statistical-based, distance-based, clustering-based, and density-based, to identify and remove abnormal instances.In this case, statistical detection methods have the limitations of high computational cost.They might also suffer from the curse of dimensionality when applied to large datasets.To improve such a problem, a distance-based detection method has been proposed, which detects outliers by calculating the distance between all data objects.However, this method has the limitation of not being able to identify outliers properly when the data distribution is complex [19].Due to the abovementioned problems, we designed an IoT sensor outlier detection elimination model to improve and reflect the data characteristics.Then, we evaluate the reliability of data eliminated outliers using soil moisture data, which has the statistical characteristic of peak and declining time series and is collected from April 29, 2023, to September 4, 2023, in two farm-land spots.The significant contributions of our approach presented in this paper are as follows: • • By directly integrating the environmental datasets collected from soil moisture and weather, we find a ground truth pattern for the outlier detection model.

PRELIMINARIES 2.1 Outliers of Soil Datasets
The soil moisture datasets acquired from real-world orchard sites can contain point and collective outliers, as depicted in Figure 1.
• Point Outliers: This type of outlier commonly occurs in a data point comparatively far from the whole dataset.For example, there are three-point outliers,  1 ,  2 , and  3 , in Figure 1a.For illustrative purposes, point outliers in this figure show relatively high deviations from the original data points.• Collective Outliers: When a subset of data points is abnormal to the entire dataset, those are called collective outliers.For example, Figure 1b contains one collective outlier period from  1 to  4 .This type of outlier is defined as a sequence of data points making an outlier pattern [8].
Outlier detection (OD) is finding the patterns of an outlier or a fault in datasets whose behavior is not as expected [5,9], like the example anomalous data points in Figure 1.We first explore the statistical techniques that form the fundamentals of OD.Then, we introduce signal transform-based OD and LPOD methods to detect anomalies based on time series data characteristics [15,16].
• Z-score: The Z-score is a statistical measure of how many standard deviations away a given observation is from the mean.We use the Z-score to detect anomalous data points from a dataset's mean () in terms of standard deviations ().
Given _ (  ) at time , one can detect an outlier if it is higher than a predefined threshold  .If outlier data is detected more than  times, we change detected outlier data into normal using interpolation.The choice of  is critical because it determines which outliers are selected.For instance, using a specific  can determine the range of outliers eliminated if a small  can lead to the loss of valid data.Therefore, it needs to consider a trade-off in setting an appropriate  to preserve normal data.• Transform-based: The transform-based approach exploits spatial-temporal data characteristics to detect outlier patterns.This technique capitalizes on their overall patterns being spatiotemporally smooth in time-series datasets.In that case, transformation techniques can be more effective because the transformed data usually explicitly reveals the data's correlation [17,23].We apply the inverse transform to these selected coefficients to reconstruct data without outliers for outlier elimination.

Time-Series Data Characteristics
This study aims to detect and correct sensor outliers and then statistically validate the accuracy of the data.The data collected by soil moisture sensors steep peaks and valleys, and the values differ for each interval of the data.It is essential to consider their characteristics.In particular, the collected data may exhibit various characteristics such as trends and seasonality, which can be summarized as follows [22]: • Trend: Soil moisture data collected directly in this study shows an increasing trend over time, as shown in Figure 2 declining points ( 1 to  3 ).The declining points represent the time just before the next rainfall, which indicates that the soil moisture is increasing over time.
• Seasonality: Soil moisture data shows a pattern of increasing value at a specific point, such as the interval  in Figure 2, and then gradually decreasing.The cycle is the time from the start of the first precipitation to the start of the subsequent precipitation.Under these considerations of time series characteristics, the data must satisfy the stationarity.Stationarity implies that the data exhibit constant mean and variance over time and lack trends or seasonality.However, real-world data collected often exhibits trends and seasonality.This research involves preprocessing the data to transform it into a stationary form and then using statistical testing methods to validate this transformation.Converting the data into a stationary form eliminates trends and seasonality, making it suitable for applying statistical models.This process aims to enhance data reliability and enable more accurate analysis and modeling.

OUTLIER DETECTION APPROACH
Figure 3 shows an overall process for detecting outliers and assessing data validity, which consists of four steps.As shown in Figure 3, we first detect outliers in raw data using OD such as Zscore, Transform-based OD, and LPOD.After removing anomalies, we verify stationarity to fit the ARIMA model to the data.After fitting the ARIMA model, we compare the performance of each data processing method using the Root Mean Square Error(RMSE) value.This analysis outcome will suggest the most effective OD method and data validation method.

Transform-based OD
We adapt transform-based OD (proposed in [16]) in conjunction with three OD techniques in Section 2.1.In detail, the transformbased OD approach begins by transforming original datasets using DCT (Discrete Cosine Transform).Let us consider x, which expresses the transformed components of the original datasets () after DCT to model the correlation between the transformed coefficients and energy (or information) represented among them.
Thus, each coefficient component has its energy coefficient defined as:  ( x, ). ( x, ) is formulated as the energy concentration () contained in the number of coefficients components, denoted as , of the entire transformed components ( x ), which is calculated as: ,  = 1, 2, ...,  ,  ≤  . ( refers to the number of dominant coefficients to represent related datasets [14].When data is reconstructed even using only -dominant coefficients, data fidelity is improved by deleting outliers.

Local pattern-based OD
The soil moisture data has steep peaks and valleys, and the values differ for each data interval, as shown in Figure 4.In this case, if outliers are removed by considering the entire data set, it has a limitation in that it cannot reflect the characteristics of each interval.Therefore, global algorithms such as Z-Score do not correctly remove outliers in data with such complex patterns and data distributions.We propose an LPOD algorithm that reconstructs data without outliers by investigating the difference between adjacent observations in a local of the data, not in the global data area.LPOD is a method of reconstructing data without outliers by investigating the difference between adjacent observations in a subset of data, not the entire dataset.In this case, we use a sliding window technique to move the data interval to detect outliers in each interval [20].The window size was set as a percentage of the total data set.At this time, we set the normal range for each interval, as shown in the bands of Figure 4, and we identify outliers as data that exceeds this range.We calculate the average and standard deviation of the data using statistical methods to construct the relationship between the data and to determine the data in the normal range.We also calculate the mean and standard deviation of the data, such as Equations 3 and 4, to construct the relationship between the data statistically and to identify normal data categories.Next, the upper and lower ranges are set with the average, standard deviation, and threshold values to set the normal range, as in Equation 5. Finally, data that exceeds the upper and lower ranges are detected as outliers, as in Equation 6.At this time, the data detected as outliers are removed and reconstructed into data without outliers through linear interpolation. (3) ,  =  ± ℎℎ • .
(5)  In summary, LPOD sets data intervals by moving a window over the entire data set to reflect the different characteristics of each interval.A normal range band is estimated for each interval through statistical calculations, as shown in Figure 4. Data that exceeds this range is detected as an outlier and is marked as a red point.The detected outliers are removed, and the data is reconstructed by interpolation using linear interpolation.

Stationarity Evaluation
After the OD process, we perform statistical analysis to evaluate the data's reliability.The data with eliminated outlier patterns is closer to stationarity because it has more normal patterns than the original data.Therefore, it is necessary to determine whether the data with the outliers eliminated is normal and compare it with the raw data regarding stationarity.To verify stationarity, we use ADF, KPSS, and ACF graphs.
The ADF test is a unit root test for time series.If a unit root exists, the time series is not stationary.The null hypothesis of the ADF test is that the time series has a unit root, and the alternative hypothesis is that the time series does not have a unit root.The null hypothesis can be rejected if the ADF test statistic is less than the significance level value.In other words, the stationarity of the time series can be evaluated [7].In this study, the significance level value is set to 0.05, the commonly used level in statistics, and the probability of rejecting the null hypothesis is set to 5 percent.However, the ADF test can vary depending on the data, such as the sample size, so it is not ideal to use it alone.Therefore, it is desirable to judge stationarity by considering the ADF test results in conjunction with the KPSS test and the ACF graphs.
The KPSS test evaluates whether the variance of time series data is constant [6].It can evaluate stationarity from a different perspective than the ADF test.The null hypothesis of KPSS is the opposite of the null hypothesis of the ADF test.Therefore, if the test statistic of the KPSS test is greater than the significance level, the time series is considered to be stationarity.An autocorrelation function (ACF) graph is a statistical graph that measures the autocorrelation of time series data.Autocorrelation is the correlation between data points at a given time interval or lag [10].The ACF graph of data that satisfies stationarity should have autocorrelation values within the confidence interval (shown in sky blue in Figure 5) and converge to zero quickly.
The statistical validation process to determine the stationarity of data is as follows.First, if the ADF test statistic is less than 0.05 and the KPSS test statistic is more significant than 0.05, the time series is considered stationary at the beginning.Then, if the time series converges to a value close to 0 in the ACF graph, the data is finally considered stationarity.

ARIMA Evaluation
We use the ARIMA model to validate the reliability of soil moisture data for each outlier elimination method.Since the ARIMA model can evaluate the accuracy by predicting the data and comparing the predicted values with the observed values, we can assess data validity by evaluating the model accuracy of data that has stationarity.For example, in the case of original data with outliers, the specific patterns of the data may be distorted due to the outliers described in Section 2.1, making it difficult to fit the model correctly.In contrast, data with stationary characteristics of the time series is expected to show a more apparent pattern than the original data and fit the model better.As a result, the validity of the data can be evaluated by comparing the accuracy of the original data and the data with stationarity.
The data need to exhibit stationarity to fit data to the ARIMA model.However, data measured with soil moisture sensors represent non-stationary time series data.Therefore, before fitting the ARIMA model, it is crucial to determine whether the data is stationary using the stationarity evaluation Indicators in Section 3.3.Root Mean Square Error (RMSE) is used as the evaluation metric, and its formula is given in Equation 7 [4].RMSE is a standard metric used to assess the difference between This metric is employed to evaluate the performance of ARIMA models, with lower values indicating more accurate predictions by the model.

RESULTS AND DISCUSSION 4.1 Datasets
To evaluate the proposed process to detect outliers in soil moisture sensors, we collect a real dataset from IoT stations installed in two farm-land spots in South Korea: Andong and Uiseong.The data used in this study are collected from soil moisture sensors and weather sensors.The weather sensor collects data on temperature, wind direction and speed, ground temperature, relative humidity, solar radiation, sunshine hours, and precipitation.The soil moisture data are a total of 35,946 data points collected at 5-minute intervals for 155 days from April 29, 2023, to September 4, 2023.The weather sensor data are 1,406 data points collected at 1-hour intervals for 84 days from May 1, 2023, to July 3, 2023.Each sensor is continuously monitored and collected at each measurement interval.

Stationarity Evaluation
Three statistical tests were performed to determine the stationarity of data collected from IoT sensors.The data used for the tests included the original data without preprocessing and the data with outliers eliminated using three OD techniques described in Section 2.1.Table 1 shows the results of the ADF and KPSS tests.We can make several observations from these results.In the case of ADF, if the statistical value is less than the significance level of 0.05, it is considered stationary.In the case of KPSS, if it is more significant than 0.05, it is considered stationary.However, all data showed non-stationarity in the KPSS test, so it was considered non-stationary.Most time series datasets measured and observed in reality are non-stationary, as shown above.Therefore, The differencing, which subtracts the previous value in a time series from the current value, was applied to reevaluate the stationarity of the data.This step keeps the mean of the series constant over time and reduces its time dependency, which makes it achieve stationarity.Table 2 shows the differencing results for all data sets.As we can see, the ADF and KPSS tests meet the significance level condition, thus being considered stationary.Finally, we visualize the stationarity of the data through the ACF graphs.In Figure 5, the x-axis of the graph, lag, represents previous data points from the current data point.For example, if the data is measured at 1-hour intervals and the lag is set to 2, the lag is 2 hours before the current data point.Therefore, the y-axis of the graph, ACF, compares the current data point with the data point 2 hours before.If the lag is 0, it is always 1 because it is the autocorrelation of the current data point with itself.Therefore, it is excluded, and the graph is analyzed.
In Figure 5, the cases of (5a) show significant deviations from the confidence interval around lag 10 and 20.Therefore, they are evaluated to be non-stationary data.The data detected as outliers using three preprocessing methods are converging stably based on a specific lag point (z-score(5b):8, transform-based OD(5c):27 LPOD(5d):6), as shown in Figure 5. Therefore, the data can be evaluated as stationary when using the three techniques of Z-Score, Transform-based OD, and LPOD.In Section 4.2, we conducted the same stationarity experiments using datasets from the Andong and Uiseong sites.The experimental results were similar, so we only included the experimental results using the datasets collected at Andong in this paper.

ARIMA Evaluation
After confirming the stationarity of the data, statistical analysisbased models can be used to assess model accuracy by comparing predicted values to observed values.The training and testing data are divided into an 8:2 ratio to employ the ARIMA model, with RMSE used to gauge model performance.Table 3 and Figure 6 present the training outcomes with the ARIMA model.Figure 6 shows the overall patterns of the original data and the data after outlier data elimination and whether the model accurately predicts test values.We present the experimental results of the OD with the highest RMSE performance only in this paper.Verifying if the test and predicted curves match and reviewing the RMSE values in Table 3 is essential to assess this visually.In this case, decreasing the RMSE value signifies that the model makes precise predictions.Accurate predictions imply that the data exhibits a discernible pattern, affirming the validity of the data by minimizing the influence of trends and seasonality from a statistical modeling perspective.In both Andong and Uiseong, as shown in Table 3, the data with outliers removed using LPOD shows the best performance.The original data (6a, 6c) shows point and collective outliers throughout the data, so the test and predicted curves do not match.In LPOD (6b, 6d), both outlier phenomena are removed; we can confirm this since the two curves match.However, the part removed by the collective outlier in 6d looks unnatural because it is a simple linear interpolation algorithm, so there is room for improvement.

Finding Ground Truth Pattern
The current data shows outlier patterns, as shown in Figures 6a  and 6c.Outlier patterns correspond to data instances with two different outliers, as described in Section 2.1.Removing segments that show these outliers makes it possible to identify the regular patterns or cycles in the data.This can be used as an important feature in classification or regression problems.In Figure 7, the ground trust pattern has very steep peaks and valleys, and the values differ for each interval, which can vary depending on external factors such as environmental impacts.We applied various outlier techniques to identify this pattern, and the data removed by LPOD showed the highest performance.
The data patterns with outliers removed can be analyzed by considering the associated data together.Precipitation, in particular, plays a central role in supplying water to the soil.Therefore, we can utilize the relationship between soil moisture and precipitation data.Figure 7 shows the results of mapping precipitation data to soil moisture data after removing outliers.The blue dots in Figure 7 represent cases of precipitation occurrence.When precipitation occurs, steep peaks and valleys appear in soil moisture data.In addition, the values of the peaks and valleys in soil moisture data vary depending on the frequency and amount of precipitation.Therefore, a positive relationship was observed between precipitation and soil moisture data, which can be seen as a ground trust pattern.

CONCLUSION
With the adoption of IoT sensors, the data collected can be utilized for various purposes, such as data mining, classification, and prediction.However, the collected data may have data defects due to the influence of external environments.Therefore, this study evaluated various outlier detection techniques (Z-score, Transformbased OD, LPOD) based on real-world data to detect and eliminate outliers.The soil moisture data has various characteristics, such as steep peaks and different peak and valley values in each interval.Therefore, we propose an LPOD algorithm that takes these characteristics into account.To assess the validity of the data with outliers removed, we evaluated the stationary of the data and analyzed the accuracy of the model by fitting it to an ARIMA model.
The results showed that LPOD was the most effective method for outlier detection and improved results over the original data set in all measures, including ADF, KPSS, ACF, and RMSE.In addition, we could find regular patterns in the data with outliers removed.We found that precipitation data are positively correlated with soil moisture data, allowing us to find the ground truth pattern.
We propose a local pattern-based outlier detection (LPOD) algorithm that detects outliers based on local data.The soil moisture data in this study has steep peaks and valleys, and the values differ for each data interval.LPOD can reflect these data characteristics, showing higher outlier detection performance than algorithms that detect outliers based on global datasets (z-score and transform-based).• After detecting outliers and correcting or eliminating the data, we present a method to evaluate whether the data has been corrected accurately using statistical validation methods such as the Augmented Dickey-Fuller (ADF) test, Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, Auto Correlation Function (ACF), and Autoregressive Integrated Moving Average (ARIMA).• To assess data reliability, we extensively evaluate our outliers elimination approach based on several outlier detection models, including Z-score, transform-based, and LPOD.Our results demonstrate that LPOD presents superior prediction accuracy measured in RMSE compared to statistical outlier detection algorithms based on z-score and transform-based OD.

Figure 1 :
Figure 1: An illustration of two types of data outliers: (a) point and (b) collective outliers.

•
Local Pattern-based OD: The local pattern-based OD method detects outlier patterns by leveraging the characteristics of the soil moisture data used in this paper.The data has steep peaks and valleys, and the values in each data interval are different.The values of point outliers and collective outliers vary depending on the interval.Therefore, statistical techniques that consider the data locally are needed to detect outliers.

Figure 2 :
Figure 2: Characteristics of time series data from a soil moisture sensor.

Figure 3 :
Figure 3: The process of detecting outliers and evaluating the validity of data.

Figure 4 :
Figure 4: Outliers of soil moisture data in Uiseong detected by LPOD.

Figure 5 :
Figure 5: The ACF plot of the data processed by different outlier detection techniques: (a) original, (b) z-score, (c) transform based OD, and (d) LPOD.

Table 1 :
ADF and KPSS test results according to preprocessing method.

Table 2 :
ADF and KPSS test results after differencing.

Table 3 :
The ARIMA model's forecast results, with RMSE as the performance indicator.