Engineering Industry-Ready Anomaly Detection Algorithms

Practical values of anomaly detection algorithms, which are engi-neered and tested on open data, are often low as their real-world applications are rare. The underlying reason is the lack of consider-ation for practical needs (i.e., research context). Additionally, the validity of algorithms is a concern due to the absence of a proper research method being followed. This paper reports how we consid-ered the research context and followed the Design Science paradigm to engineer our algorithm. In this way, we can address a real-world application of automatic marine data quality control.


INTRODUCTION
Anomaly Detection (AD) is a process of identifying data instances that significantly deviate from the majority of measurements, and it has many real-world applications [1].AD has been an active and popular research topic for over six decades, but there are concerns regarding the validity and significance of current research advancements [1,2].Below, we outline four major problems.
Firstly, the majority of proposed AD algorithms were engineered and tested on open data with "a strong implicit assumption that doing well on one of the public datasets is a sufficient condition to declare an anomaly detection algorithm useful (and therefore warrant publication or patenting)" [2].Unfortunately, many open datasets were pointed out as flawed, and the validity of those algorithms become questionable [2].Open data was usually labeled by AD researchers who are non-expert in domains where it had been collected.However, it was not validated with domain experts.
Secondly, the majority of AD studies cherry-pick when it comes to comparing with state-of-the-art (SOTA) approaches [1].In the literature, almost all algorithms are shown to achieve the highest accuracy among all competing algorithms with respect to certain metrics to convince readers that it is the best so far.However, it is unclear how the competitors were selected for comparisons and this raises a concern that better algorithms were intentionally ignored or not reported.
Thirdly, it is unclear whether AD algorithms can be used in realworld systems.Taking the THOC approach [3] as an example, it was shown to achieve the highest accuracy on the power-demand and 2D-gesture datasets while being compared with 13 competitors, obtaining 45.68 and 63.31, respectively, by the F1 metric (see Table II of [3]).Those numbers are relatively low, given that 100 is the perfect score in the study's context, raising concerns regarding its practical readiness.
Lastly, in many cases, AD algorithms were motivated and developed without considering the need for practice.The majority of the algorithms output anomaly scores to be independent of specific thresholds, which may lead to potential bias [1,4].However, these numbers convey no meaning to end users, and they still have to decide on a threshold to filter anomalies from a subject dataset.
Engineering an AD algorithm is fundamentally a software engineering (SE) research task, as an AD solution can be used to improve SE practice, as shown in works like [5][6][7].Developing SE solutions must be driven by real-world requirements and constraints through collaborations with practitioners [8].It is important to consider the context of research; otherwise, the chance of a solution being applied in practice is low [9].This problem is visible in AD research.Dozens of algorithms have been engineered and tested on open data each year in the last decades [10].However, it is rare to see their real-world applications reported.Furthermore, as analyzed above, most AD studies did not follow any proper research methods, at least regarding comparative studies, making it difficult to assess the research contribution [11].
In our previous study [12], AD is identified as a promising solution to automate marine data quality control.The current process is done manually, taking up to six months to complete and is very subjective [13].Once realized, the time for data publication will be shortened, and the likelihood of having bad data visible in final datasets due to human mistakes will decrease.Figure 1 illustrates our study's research process using the Design Science paradigm [11], which is detailed in Section 2.

Figure 1: Research process
There is an abundant number of AD algorithms for time-series data, so we checked if there is an existing one that meet the industrial requirements to avoid reinventing the wheel.We relied on the work of [1], which open-sourced 71 state-of-the-art AD algorithms focusing on time-series data published in the last three decades.They cover all possible AD techniques.We selected 38 algorithms specifically designed for univariate time series.We reused hyperparameters optimized by [1], and in total, we executed 3900 experiments per dataset.The benchmark revealed that none of the benchmarked algorithms achieved the accuracy expected by the industry.
We decided to engineer our own AD solution.We analyzed the benchmarked algorithms with respect to our data to identify suitable computation mechanisms that may potentially give our algorithm good performance and understand their weaknesses, which are translated into 4 technological rules below.
Our benchmark shows that (1) semi-supervised learning algorithms achieve higher accuracy than unsupervised learning ones, and (2) using forecasting techniques would enhance the performance of semi-supervised algorithms.In addition, (3) the tested algorithms are inadaptive to unforeseen data distribution changes, formally known as concept drift [14].Concept drift happens quite often in marine sensor data.Underwater sensors are operated on limited battery resources, and when the battery is draining out, more noise is generated [15].Biofouling is an accumulation of microorganisms, plants, or algae on sensors, which changes measurement results unexpectedly, and it can completely cover a sensor after one year [16].Finally, (4) the AD research community agrees that the anomaly ratio is less than 1% of subject data [1,2], but it is too low with respect to our data (anomalies account for up to 19.2%).The majority of AD algorithms, which were engineered and tested on open data, adhere to that rule, so they cannot adequately capture enough anomalies in our data.
Besides those technological rules, we also integrated the normal operation value range of sensors.Global value range check is a major check within the marine data quality control process [17].Measurements whose values fall outside the normal operation value range are reported as bad data.The knowledge obtained was utilized to develop Adaptive Anomaly Detection (AdapAD).
We validated AdapAD in two aspects: technical and user acceptance.Technically, AdapAD not only satisfied two industrial requirements but also achieved higher accuracy than 40 competitive AD algorithms, comprising 38 benchmarked ones and two identified via snowballing that rely on [1] for validation, see https://bit.ly/3UaVn8Y.We collected user feedback for AdapAD regarding its usefulness in performing automatic marine data quality control.
We conducted a 10-minute presentation in a bi-annual workshop with the attendance of 11 marine organizations.The presentation covered the high-level design of AdapAD and examples of erroneous measurements that AdapAD accurately detected.We collected feedback through questionnaires, using a scale from 1 to 5 (1: very unlikely, 5: very likely) for two questions: (1) How would you rate the likelihood that AdapAD can speed up data quality control?; and (2) How would you rate the potential of using AdapAD for automatic data quality control?We received average scores of 4.3 for Q1 and 4.4 for Q2, indicating positive user acceptance.
Summarizing: (i) we analyzed 38 state-of-the-art algorithms with respect to our data to learn and utilize their strengths and address their weaknesses, (ii) we integrated common practices of marine data quality control while engineering AdapAD, (iii) AdapAD outperformed 40 state-of-the-art AD algorithms, (iv) we conducted user acceptance surveys with domain experts in automatic data quality control, and we received positive assessments, and (v) Ada-pAD was used to validate the data collected from the One Ocean Circumnavigation [18].