AutoOD: Automatic Outlier Detection

Outlier detection is critical in real world. Due to the existence of many outlier detection techniques which often return different results for the same data set, the users have to address the problem of determining which among these techniques is the best suited for their task and tune its parameters. This is particularly challenging in the unsupervised setting, where no labels are available for cross-validation needed for such method and parameter optimization. In this work, we propose AutoOD which uses the existing unsupervised detection techniques to automatically produce high quality outliers without any human tuning. AutoOD's fundamentally new strategy unifies the merits of unsupervised outlier detection and supervised classification within one integrated solution. It automatically tests a diverse set of unsupervised outlier detectors on a target data set, extracts useful signals from their combined detection results to reliably capture key differences between outliers and inliers. It then uses these signals to produce a "custom outlier classifier" to classify outliers, with its accuracy comparable to supervised outlier classification models trained with ground truth labels - without having access to the much needed labels. On a diverse set of benchmark outlier detection datasets, AutoOD consistently outperforms the best unsupervised outlier detector selected from hundreds of detectors. It also outperforms other tuning-free approaches from 12 to 97 points (out of 100) in the F-1 score.


INTRODUCTION
As data volumes continue to grow with the rise of social networks, digital currency, smart phones, connected vehicles, and other devices, there is an increasing need for techniques to support the discovery of outliers in data.Outliers correspond to rare items, events or observations which differ significantly from the majority of the data [3] and may indicate a problem, such as fraud, malfunctioning devices, or future catastrophic failures.With financial fraud causing a multibillion dollar loss to global economy each year [7] and Internet of Things (IoT) applications from fleet management, security surveillance to inventory management predicted by Forrester to become a many trillion dollar market in the next 15 years [1], the need for effective outlier detection technology is abundant.
This has led to a significant surge in developing outlier detection techniques [3] over the past decade.They include fitting data to a statistical distribution and highlighting values far from the mean or median [10,11], clustering data and identifying values outside common clusters [33], and discovering objects far from their neighbors [18,67].While previous research has resulted in a plethora of algorithms for detecting particular types of outliers, there are still challenging problems that hinder these algorithms from being useful in real applications by practitioners.State-of-the-art and Its Limitations.One critical challenge is how to choose the most effective solution from this stew of available techniques and tune their parameters [3].Users face several problems: first, no single algorithm adequately captures outliers across diverse data sets and problem domains.Thus customized algorithms have been proposed targeting different settings [3].An outlier detection method that works well on one data set might yield poor results on another data set.Selecting a method appropriate to the given task is challenging for domain scientists, requiring not only thorough domain understanding, familiarity with the data at hand and knowledge about the most critical differences between outliers and inliers, but also a good understanding of the wealth of available outlier detection methods and their characteristics.This complexity is compounded by the fact that often the data characteristics for which certain algorithms work well are not known or properly documented.
Second, choosing the best outlier detection method is further complicated by the fact that many such methods are governed by a number of parameters.Without appropriately tuned parameter settings, detection algorithms tend to be ineffective at identifying outliers [21].Although automatic parameter tuning methods have been proposed in Automated Machine Learning (AutoML) for supervised classification [25,32,61], these techniques are not adequate for solving the parameter tuning problem in the context of outlier detection.This is because outliers are rare events, making it hard to manually acquire a sufficient number of high quality outlier examples (labels) required for supervised learning.This is one reason why outlier detection techniques tend to be unsupervised [3].Unfortunately, labels are required by the state-of-the-art AutoML methods [25,32,61] for automatic cross validation.This renders AutoML ineffective at tuning these unsupervised outlier detection methods.Proposed Approach.To solve the above problems, we propose a automatic outlier detection approach (AutoOD).AutoOD is not a new outlier detection algorithm -instead it is a tuning-free approach that aims to best use existing outlier detection algorithms yet without requiring human labeling input.Our key intuition is that selecting one model from many alternate unsupervised anomaly detection models may not always work well.Instead, AutoOD targets combining the best of them.
The AutoOD Strategy.AutoOD uses a fundamentally new strategy that unifies the merits of unsupervised outlier detection techniques and supervised classification models.Unsupervised outlier detection does not require labeled data, but the accuracy of unsupervised techniques is often low due to the lack of supervision with domain knowledge [16,19].Compared to unsupervised techniques, supervised classification tends to achieve better accuracy, as long as a sufficient number of high quality labels are available [3].
Instead of first carefully selecting an appropriate outlier detection method for a given task and then tuning its parameters, AutoOD turns the unsupervised problem into a supervised problem.Specifically, it uses the results from many unsupervised outlier detectors in combination to automatically infer high quality labels by discovering objects from the input data that can reliably be detected as an outlier or an inlier.Using these automatically generated labels, AutoOD then trains a supervised classification model that makes inference on the remaining objects to produce the final outlier detection results.
In this way, AutoOD leverages supervised classification to achieve high accuracy in outlier detection, while no longer having to rely on domain experts to manually supply labels.
Methods to Automatically Produce Labels.The number and quality of the automatically produced labels are critical to AutoOD's effectiveness.To this end, we design two complementary methods, AutoOD-Augment and AutoOD-Clean.
AutoOD-Augment starts with discovering a small but reliable set of labels based on the strong consensus of all unsupervised outlier detectors.Next, leveraging the outlierness scores these detectors assign to each data object, AutoOD-Augment forms a feature space where each attribute represents one detector.It uses the set of reliable labels acquired so far and the outlierness score features to learn a machine learning model that predicts the performance of each detector for the data at hand.Leveraging this model, AutoOD-Augment iteratively prunes 'bad' detectors.AutoOD-Augment then continues to produce more labels based on the agreement among the predictions produced by the remaining 'good' outlier detectors.This way, the labels set is progressively augmented.
In contrast, AutoOD-Clean starts with a large but noisy set of labels and uses it to train a neural network.This set of labels could be formed by, for example, ensembling the results of all detectors.Leveraging the observation that the training loss on "correctly labeled" objects tends to be larger than that on any "mislabeled" objects in early epochs of training a deep neural network [62], AutoOD-Clean iteratively purifies the training data through its learning process, resulting in the removal of the objects with large early loss from the training data.As proven in Sec.5.2, a deep neural network enhanced with this proposed self-cleaning strategy is guaranteed to converge.Further, our experimental study demonstrates that the resulting classification model learned by AutoOD-Clean shows high accuracy.AutoOD-Clean complements AutoOD-Augment, especially when it is hard to get initial reliable labels.Experimental Results.We demonstrate the effectiveness of Au-toOD using a variety of benchmark outlier detection data sets [16,60].In particular, as we show in Sec.6.2, AutoOD consistently detects outliers with an accuracy higher than the best outlier detector among hundreds.Of note, AutoOD is able to do this without requiring any manual tuning nor human input in terms of pre-determined labels.Further, AutoOD significantly outperforms ensemble-based methods [55,58] and other tuning-free approaches by 12 to 97 points (out of 100) in the F-1 score.Contributions.In summary, key contributions of this work include: • We propose AutoOD that uses a set of unsupervised outlier detectors to automatically produce high quality outliers, requiring zero human input nor ground-truth labels.
• AutoOD unifies the merits of unsupervised outlier detection and supervised classification, achieving a high accuracy in detecting outliers, while requiring zero human input nor ground-truth labels.
• We propose two complementary solutions, AutoOD-Augment and AutoOD-Clean, to realize the AutoOD framework, making Au-toOD highly effective and robust in a variety of scenarios.Our theoretical analysis show that AutoOD-Augment and AutoOD-Clean are guaranteed to converge to a set of high quality labels.
• Our experiments on benchmark outlier detection datasets show that AutoOD consistently outperforms the best outlier detector among all candidate detectors.In fact, it achieves an accuracy comparable to supervised outlier classifiers trained with ground truth labels -without having the access and thus benefit of such ground truth knowledge!

PRELIMINARIES
Below, we briefly overview popular outlier detection techniques [3] including statistical-based outlier detection [3], distance-based outlier detection [35,54], density-based outlier detection [15,48], and Isolation Forest [44].AutoOD supports these techniques as build-in libraries, although other techniques could also simply be plugged in.Statistical-based Outlier Detection.Statistical-based methods detect outliers by discovering extreme values.In particular, the Mahalanobis method [3] models the entire dataset to be normally distributed around its mean in the form of a multivariate Gaussian distribution.Let  be the -dimensional mean vector of a -dimensional dataset, and Σ be its  ×  covariance matrix.In this case, the (, )th entry of the covariance matrix is equal to the covariance between the dimensions  and .Then the mahalanobis distance from a dimensional data object  to this distribution can be defined as: The Mahalanobis distance is used as outlier score.Meaning, the larger the Mahalanobis distance is, the more the object deviates from the data set distribution and thus more likely is an outlier.Distance-based Outlier Detection.Distance-based outlier detection computes outlier scores on the basis of nearest neighbor distances.Among many variations [9,36,54], the NN outlier [54] is very popular.Let   () denote the distance of object  from its -th nearest neighbor.The NN outlier ranks objects based on their   () distance.The top  objects in this ranking are then considered to be outliers.These objects have fewer points close to them and are thus intuitively stronger outliers.Density-based Outlier Detection.Density-based approaches consider ratios between the local density around an object and the local density around its neighboring objects.These approaches introduce the notion of local outliers as opposed to the global outliers discovered by distance-based outlier techniques.The concept of a local outlier is important since in many applications, different portions of a dataset can exhibit very different characteristics.It is thus meaningful to decide on the outlying possibility of an object based on other objects in its neighborhood.The most popular density-based outlier approach [15], called LOF, is to assign a local outlier factor (LOF) to each object of the dataset denoting its degree of outlierness.In general, LOF corresponds to the average of the ratio of the local density of an object p and those of p's k-nearest-neighbors.Intuitively, p's local outlier factor will be very high if its local density is much lower than those of its neighbors.Isolation Forest.An isolation forest is an ensemble of a set of isolation trees [44].In an isolation tree, the data is recursively partitioned with axis-parallel cuts at randomly chosen partition objects within randomly selected attributes.The aim is to isolate the instances into nodes with fewer and fewer instances until the objects are isolated into singleton nodes containing one instance only.In such cases, the tree branches containing outliers are noticeably less deep, because these objects are located in sparse regions.Here thus the distance of the leaf to the root is used as the outlier score.The final outlier score then is computed by averaging the path lengths of the objects in the different trees of the isolation forest.

OVERVIEW OF THE AUTOOD FRAMEWORK
In this section, we introduce the overall design of AutoOD.
Unsupervised detectors

Classification technique
Training data

Testing data
Step 1: Unsupervised outlier detection Step 2: Automatic reliable object discovery Step 3: Outlier classification

Fundamentals Underlying AutoOD
AutoOD Strategy.AutoOD uses a fundamentally new strategy to solve the method selection and parameter tuning problem in unsupervised outlier detection.This strategy unifies the merits of unsupervised outlier detection and supervised classification.Namely, unsupervised methods do not rely on a human expert to supply labels about ground truth anomalies -which often are hard to come by.However, the accuracy of unsupervised outlier detection is often low [16] due to the lack of human supervision.On the other hand, supervised classification techniques are able to achieve higher accuracy by training a binary classifier that classifies the objects into outliers or inliers.However, they rely on the availability of a sufficient number of high quality labels for training [21].Instead of selecting an appropriate outlier detection method and then tuning its parameters to get good results, AutoOD uses a large set of diverse unsupervised outlier detectors as labeling sources to automatically produce high quality labels, where an outlier detector corresponds to one specific unsupervised outlier detection method instantiated with a particular parameter setting.We then use these automatically produced labels to train a classification model to finally classify each object as being either an outlier or an inlier.As we will demonstrate in Sec.6.2, this allows AutoOD to achieve high accuracy in detecting outliers, without having to rely on human experts to supply high-quality labels.AutoOD thus captures the benefits of both unsupervised and supervised outlier detection.AutoOD Intuition.Without any manually supplied labels indicating ground truth, it appears extremely difficult to automatically discover the best unsupervised outlier detector among many alternatives.Even if it were possible to identify the best one, it would not be guaranteed to detect outliers with high accuracy.In fact, among all detectors there is at times no one clear winner that dominates all the other detectors.Given a diverse set of unsupervised detectors, each detector might discover some anomalies which other detectors would miss.
However, we observe that there typically tend to be some objects in the data that are clear outliers and inliers.Clear outliers, for example, often correspond to the objects that are well isolated from other objects, while clear inliers often correspond to objects residing deeply inside dense data clusters.Our intuition is that it is to be much easier to automatically identify such clear outliers or clear inliers compared to identifying the best overall detector method itself.Using these objects as reliable labels that characterize key differences between outliers and inliers, AutoOD thereby can fully explore the generalization ability of supervised machine learning [30] to learn a classification boundary that effectively infers the status of the remaining unsure objects.(1) Unsupervised Outlier Detection.

Components of the AutoOD Framework
Given an input data set, AutoOD first uses a set of unsupervised outlier detectors to detect outliers.Each detector corresponds to one outlier detection method in the built-in AutoOD library with a particular configuration of parameter values instantiated.For simplicity and ease of use, for each detection method, AutoOD uniformly picks some parameter configurations from a reasonable parameter range recommended by [16].
(2) Automatic Reliable Object Discovery.Next, based on these unsupervised detection results, AutoOD divides the input data into two subsets, "reliable objects" and "unsure objects".The reliable objects are those which AutoOD is confident are clearly inliers or outliers based on the detection results produced by the detectors.
(3) Outlier Classification.Finally, the automatically discovered reliable objects are used as training data for a binary outlier classification model.This model then produces predictions for the "unsure" objects whose labels remain to be unknown.This way, eventually our AutoOD assigns labels to all objects.AutoOD-Augment and AutoOD-Clean.Clearly, the effectiveness of AutoOD relies on the number and quality of the reliable objects discovered in the second step and used as labeled training data thereafter.In this work, we design two approaches to discover these reliable objects that complement each other.First, we propose an augmentation-based method, called AutoOD-Augment (Sec.4).AutoOD-Augment starts by automatically discovering a small but reliable set of objects to label and keeps augmenting this set iteratively.Second, we propose a cleaning-based method, called AutoOD-Clean (Sec.5).As opposed to AutoOD-Augment, AutoOD-Clean starts with a large set of noisy labels and keeps cleaning this set into an increasingly reliable set.In the next two sections, we introduce the two approaches in detail.

AUGMENTATION-BASED RELIABLE OBJECT DISCOVERY
AutoOD-Augment starts with discovering a small set of reliable objects and then iteratively augments this set until no new reliable objects can be found.

The Overview of AutoOD-Augment
Next, we overview the AutoOD-Augment approach that includes three key components, namely initial reliable object discovery, learning-based poor detector pruning, and reliable object set update.
(1) Initial Reliable Object Discovery.AutoOD-Augment identifies an initial label set of stable outliers/inliers using the strategy described below.In AutoOD, an object is considered to be reliablydecidable about its label status, if all unsupervised outlier detectors agree on its (outlier/inlier) label status.We call these the stable objects.The stable objects typically correspond to the clear inliers and outliers in the data.The intuition is that although different outlier detection methods use distinct techniques to detect outliers, they are based on the same principle.That is, an object is an outlier if it deviates significantly from the other observations [31].Therefore, they all tend to be good at capturing the clear inliers that are deeply resided inside of some big data clusters and the clear outliers that are far way from any other objects.
(2) Learning-based Poor Detector Pruning.Second, treating the stable outliers/inliers as reliable objects and hence ground truth labels, AutoOD trains a machine learning model to estimate the performance of the unsupervised detectors.Progressively pruning the bad detectors enables AutoOD to collect more and more stable inliers and outliers, thus gradually discovering more reliable objects.Note although AutoOD estimates the performance of each detector, the ultimate goal is not to find the best detector, but instead it is to automatically discover more reliable objects.
(3) Reliable Object Set Update.AutoOD-Augment leverages the concept of multi-view analysis [26] to update the reliable objects.That is, AutoOD learns multiple distinct outlier classification models that are trained on the same set of labels but that use different sets of features.Features we can leverage here include not only the attributes of the data itself, but also the intermediate results produced by the outlier detectors.The intuition is that if the classification models learned from different views of the data agree with each other on the prediction of some objects, potentially these objects are reliable objects.On the other hand, the objects on which the classification models have conflicting predictions will be removed from the set of reliable objects to purify the automatically produced labels.
Algorithm 1 AutoOD-Augment return Ids, D  Alg. 1 illustrate the overall process of AutoOD-Augment.First,  different outlier detectors d 1 , d 2 , . . ., d n generate a list of outlierness scores s 1 , s 2 , . . ., s n for each object   , respectively.Based on these outlierness scores, AutoOD-Augment puts together its initial set of reliable objects D  (Line 3, Alg.1).Our learning-based detector pruning strategy then is applied to discover and prune the poor detectors (Line 6, Alg. 1).Based on the remaining outlier detectors, AutoOD-Augment leverages the multi-view outlier classification to update reliable objects D  (Line 7, Alg. 1).The new D  is used in the next iteration as training data to further prune the poor detectors.In this process, D  gets larger and more accurate.AuotoOD-Augment terminates when D  does not change any more (Lines 8-9, Alg. 1).
Next, we introduce the three key components of AutoOD-Augment in more detail, namely initial reliable object discovery, learning-based poor detector pruning, and reliable object set update.

Initial Reliable Object Discovery
AutoOD-Augment starts by identifying stable inliers and stable outliers as initial set of what we call "reliable objects".Because outliers typically correspond to only a very small fraction of the data set, in some cases AutoOD may not be able to identify stable outliers in this initial set -especially when handling small datasets.To solve this issue, we relax the strict requirement of stable outliers when necessary.The first relaxation is if all detectors that use the same detection method   but different parameter settings pt 1 . . .pt m mark an object as outlier, then we will consider this object to be a stable outlier.For example, suppose AutoOD uses two outlier detection methods LOF [15] and Isolation Forest [44], and there is no stable outlier based on the strict stable outlier requirement.In this case, an object will be considered as a stable outlier if all detectors using LOF believe it is an outlier.While this or other relaxations we may explore in the future may introduce errors into the set of reliable objects, our AutoOD-Augment method is designed to fix these errors using an iterative learning process described in Sec.4.4.

Outlierness scores Logistic regression
Corresponding to detector

Learning-based Poor Detector Pruning
One principle of AutoOD-Augment is to tackle the problem of pruning poor detectors using machine learning, more specifically logistic regression.In this work, leveraging the unique properties of the logistic regression model in conjunction of unsupervised outlier detection AutoOD-Augment effectively discover poor detectors with a theoretical guarantee.
Logistic regression (LR) [34] is a classical binary classification model that predicts the probabilities of possible outputs.Consider a single input observation , represented by a vector of features [ 1 ,  2 , . . .,   ].The classifier consuming  then outputs outcome , with  ∈ {0, 1} with 1 meaning the observation is a member of the class  and 0 the observation is not a member of the class .We want to know the probability  ( = 1|) that the observation  is a member of the class C. In the outlier detection case,  ( = 1|) represents the probability that  is an outlier, while  ( = 0|) represents the probability that  is an inlier.
Given a training set of objects with class labels, logistic regression (LR) solves this classification problem by learning a vector of weights  = [ 1 ,  2 , . . .,   ] and a bias term.Each weight   is a real number, and is associated with one of the input features   .Formally, the LR model is defined as: where  denotes the weight vector and  the bias term.Solving for  ( = 1|), this gives: The weights and bias of the LR model are estimated from the training data.After we have learned the weights through the training process, the probabilities for each class can be computed by Eq. 3. If P(y=1|x; b,w) ≫ 0.5, the object is classified as an outlier.Alternatively, if P(y=1|x; b,w) is close to 0, the object is an inlier.On the other hand, if P(y=1|x; b,w) is around 0.5, the LR model is not certain enough about the status of the given object.
Outlier Detector Pruning With Logistic Regression.The outlier detector   produces an outlierness score  , for each object   in the dataset.As a real number, the larger the  , , the more possible that   believes   is an outlier.This outlierness score can be treated as an additional derived feature  ′  of object  produced by detector   .Thus, in total,  outlier detectors will produce  such derived features { ′ 1 ,  ′ 2 , ...,  ′  }.As depicted in Fig. 2, the outlier scores  , produced by  detectors for  objects form a  ×  matrix.Element  , represents the outlierness score that detector   assigns to object   .Correspondingly, the th row corresponds to feature  ′  produced by detector   across all objects, while the th column corresponds to features produced for object   by all the detectors.
As a preprocessing step, AutoOD normalizes the outlierness scores produced by different detectors into the same range to allow for ease of comparison.In the implementation, AutoOD uses Ro-bustScaler in scikit-learn [50] that is known to be robust to outliers.AutoOD then uses these derived features { ′ 1 ,  ′ 2 , ...,  ′  } produced by the  detectors and the set of stable labels produced in the first step to train an LR model.
As discussed above, LR assigns a weight   to each feature  ′  corresponding to one detector   .Intuitively, the weight   represents how important that input feature  ′  is to the classification decision [30].It can be positive or negative which would mean that the feature is (or is not) associated with the outlier class, respectively.Leveraging this insight, we define the pruning rule in Def.4.1 that prunes detectors based on the importance of their evidence to the LR model's classification decision.
By the pruning rule, if there are negative weights, AutoOD prunes the corresponding detectors immediately.Otherwise, Au-toOD prunes the detectors whose weights are at least one standard deviation smaller than the average weight over all detectors.Our experiments in Sec.6.6 confirm that this is effective in pruning the poor detectors.Theoretical Guarantee.Next, we formally show in Lemma 4.1 that the detectors discarded by the pruning rule are guaranteed to not perform better than the remaining detectors under the comparison criteria defined in Def. 4 If we set || || 2 2 to be Const, then using the Lagrangian multiplier method, we have: Then setting the derivative of  () w.r.t.  and   to zero will yield: At testing phase, this classifier takes an object as input, and based on the similarity of its attributes to objects in the stable data set, emits a confidence that the object is an outlier (or, not an outlier).In this work, while any binary classification model could be plugged in, in our implementation we utilize a Support Vector Machine (SVM) model as our classifier.This is because it does not require careful hyper-parameter tuning and can handle high dimensional data.
The outlier score model and data feature model are used to infer two different sets of probabilities of the outlier class for all objects in the entire input dataset (Lines 4,5).Thereafter, AutoOD-Augment iterates over the process below using the prediction results of the outlier score and data feature models as well as the remaining highquality detectors to discover the reliable objects.The intuition here is that the two models and the outlier detectors produce predictions for each object from different views of the data.If they are consistent on the prediction of the outlier status for one object, the label of this object is considered to be reliable.
In the update process, AutoOD-Augment first discovers the new stable outliers/inliers (Line 9).After pruning some poor detectors, the numbers of stable inliers and outliers naturally increase.Then from the prediction results of the two classification models, AutoOD-Augment identifies the confident inliers (Line 6) and outliers (Line 7).An object is a confident inlier or outlier if both classification models are very confident about their prediction of its status.DEFINITION 4.3.Confident Inliers/Outliers.Given an object  and some threshold  (0 < t < 1), if the classification model trained on the outlierness scores and the classification model trained on the raw features of data both predict  to be an inlier or outlier with a probability higher than , then  is a confident inlier or outlier.
By default, AutoOD sets the value of threshold  in Def.4.3 as 0.99.This corresponds to a very strict criteria for finding confident outliers and inliers to ensure their reliability.Our experiments show this value works well in all cases (suggesting this is not a parameter that needs to be tuned).
As discussed in Sec.4.3, the logistic regression model trained on the outlierness score features uses Equation 3 to assign each object a probability.It corresponds to the confidence of the model in its prediction of the object's class label.Naturally, we can use this probability to determine if the object is likely an outlier (or inliers).That is, an object is potentially a confident inlier/outlier if and only if  ( = 1|; , ) >  or  ( = 1|; , ) < 1 − .
However, standard SVMs trained on the raw features of the data do not produce a probability for each object, representing how confident SVM is in its prediction of this object's class.In this work, we use the well-known Platt scaling [49,53] method to calibrate the probabilities.Platt scaling trains the parameters of an additional sigmoid function to map the SVM outputs into probabilities.
AutoOD-Augment then unions the current stable and the confident outliers/inliers sets into two refreshed set of reliable outliers and inliers (Lines 10-11).Moreover, AutoOD-Augment removes the objects from the reliable object set if the two classification models have conflicting predictions on these objects (Lines 12-13).This further purifies the set of reliable objects.Termination Condition.AutoOD-Augment proceeds until the set of reliable objects does not change (Line 9 in Alg. 1).Next, we intuitively show that this process is guaranteed to converge.First, the detector pruning would stop, in the worst case when only one detector were to remain.Then the stable outliers and inliers thus would not change anymore.Now the update of the reliable object set is only driven by the predictions of the two classification models.The diminishing update of the set of reliable objects in turn gradually stabilizes these two models trained on it.This eventually leads to the stabilization of the set of confident inliers/outliers and the termination of AutoOD-Augment.Now the two models and the remaining unsupervised outlier detectors highly agree with each other on the predictions of the current reliable objects.Therefore, they tend to be indeed reliable.

AUTOOD-CLEAN: CLEANING-BASED RELIABLE OBJECT DISCOVERY
AutoOD-Clean trains a neural network to combine the predictions of the unsupervised detectors to produce a set of reliable objects.We first describe an important observation about the training process of a deep neural network which we leverage, and then we introduce the details of the AutoOD-Clean approach.Key Observation.Fig. 3 shows the evolution of model accuracy for "mislabeled" and "correctly labeled" objects as a function of training epochs when training a neural network model using a noisy training dataset [62].We consider the common setup where training proceeds in epochs.We then inspect the evolution of the accuracy of the model on the training objects.That is, after each epoch, we take the model at that stage and see whether or not it makes an error for each of the training objects.Here we assume we have access to the ground truth of this noisy training dataset.As shown in Fig. 3, the accuracy on "correctly labeled" objects is higher than on the "mislabeled" objects, especially in the initial epochs of training.
Leveraging the above observation, we are now ready to propose an approach to discover reliable training data.This approach AutoOD-Clean starts with a large but noisy training dataset and iteratively cleans it.AutoOD-Clean is inspired by the theory of minimizing the trimmed loss [47].Given a set of  samples, standard modelfitting involves choosing model parameters  to minimize a loss function over all  samples.In contrast, the trimmed loss estimator involves jointly choosing a subset of  samples and parameters  such that the loss on the subset is minimized over all choices of subsets and parameters.While this objective is intractable in general, AutoOD-Clean can be considered as an iterative methodology for minimizing this trimmed loss.It simultaneously generates a reliable training dataset and an effective neural network model, as iteratively minimizing the trimmed loss.Trimmed Loss Let D = p 1 , . . ., p n be the set of training objects,  be the model parameters to be learned, and   (.) be the loss function.With this setting, the standard approach is to minimize the total loss of all objects, i.e., min  i f  (p i ).In contrast, the least trimmed loss (TL) estimator is given by: To find θ (TL) , we need to minimize over both the set D of size ⌊n⌋ -where  ∈ (0, 1) is the fraction of objects we want to fitand the set of parameters  .In general, solving for the least trimmed loss estimator is hard, even in the linear regression setting [47], i.e., even when p = (x, y) and f  (x, y) = (y −  T x) 2 .

AutoOD-Clean: Iterative Trimmed Loss Minimization
As described in Alg.(1) Initial Training Data Generation.As shown in Alg. 3 (Line 2), AutoOD-Clean starts by using all data objects as training data.The labels of these objects can be produced by ensembling the results of multiple detectors or using the results of any detector.
(2) Modeling.During the modeling process, we train a neural network using the current training data (Lines 5-7, Alg. 3).Specifically, we first train a neural network just for a few epochs and keep track of the loss for each training object (Line 5, Alg. 3).Given an training object, its loss is measured as the difference between the prediction and its label using a typical loss function such as cross entropy [29].
AutoOD-Clean uses the loss for each object to prune mislabeled objects.That is, it leverages the observation that mislabeled objects tend to have higher loss than correctly labeled objects, especially during the early epochs [62].This stage of the process does not require for the model to have converged; thus a handful of initial epochs is sufficient.In our experiments, we use a small network with 3 hidden layers for all datasets; and thus set the number of epochs to be 3.If using a larger network, we would set the number of epochs to be a bit larger, say 5 or 10.
After recording the early loss for each training data object -the average training loss it incurred in the early epochs, AutoOD-Clean continues to train the network until it converges (Line 6 in Alg. 3).Then AutoOD-Clean uses the converged model to make inference on the entire dataset.In the outlier detection setting, the output is a probability that can be interpreted as the confidences of the current model on the object being an outlier or a inlier.Both the early losses and the confidence scores will be used in the next step to update the training data (Lines 8-9, Alg. 3).
(3) Training Data Update.First, AutoOD-Clean removes the training points with large early losses from the training data using the cleaning rule defined below.DEFINITION  Alg. 4 shows the process of applying the cleaning rule to prune objects.Suppose an object is temporarily considered as an outlier (Line 2, Alg.4).AutoOD-Clean will remove it if its loss is one standard deviation larger than the mean loss of the outier labels.On the other hand, AutoOD-Clean will remove an inlier if its loss is one standard deviation larger than the mean loss of the inlier labels.Note we use different criteria to prune mis-labeled inliers and outliers, because outliers and inliers tend to show different patterns in their early losses due to their distinct data characteristics.Our empirical study shows that the early loss of outlier labels in average is larger than inlier labels.Fixing Erroneous Pruned Objects.AutoOD-Clean might erroneously remove some training objects due to the prediction errors of the network.This in particular may arise during early training stages, when the neural network model is trained on noisy training data and thus may possibly have low accuracy.Because the training data continuously gets improved in this iterative process, the accuracy of the neural network model is also expected to increase over time.This provides us with opportunities to fix the earlier mistakes made by adding the high confidence objects back into the training data.
Next, we describe the process of adding objects with high prediction confidence back into the training data set.We train the neural network until it fully converges.Then we make prediction on all input data.That is, we apply the latest fully converged neural network to classify the data into outliers or inliers.If an object is not in the current training data, but it is being classified as an outlier or inlier with a confidence higher than a threshold , we will add it to the training data.On the other hand, if the label of a current training object is different from the inference output, it will be removed from the training data even if its loss is not large.The threshold  by default is set as 0.99, corresponding to a very strict requirement on confidence level.Our experiments show it works well in all cases.The Termination Condition.After updating the training data, AutoOD-Clean will iteratively repeat this process, that is, it will re-train the neural network in the next iteration using the new training data.This process continues until the number of removed objects is smaller than the number of added objects (Lines 10-11, Alg. 3).Next, we show that with this termination condition, AutoOD-Clean is guaranteed to converge to a reliable training data.

Convergence Analysis
We show the convergence of AutoOD-Clean by proving that AutoOD-Clean is able to recover the ground truth labels with a linear convergence rate.Here we use the generalized linear model to represent the neural network model which AutoOD-Clean uses.This is a common practice [62], because the problem of analyzing a general least trimmed loss estimator is intractable.
We analyze AutoOD-Clean with errors in the labels.We represent the training objects in the form of (, ) such that: Here  represents the features of a training objects and  denotes the output, embedding function  and link function  are known,  is random subgaussian noise with parameter  2 [65], and  * is the ground truth.Let  * be the fraction of correctly labeled objects in the training data.
We prove the convergence claim by using an one-iteration update lemma for the linear case [62].LEMMA 5.1.Assume  () =  and we are using AutoOD-Clean.The following holds per iteration: where Note that in Lemma 5.1,  is always larger than 1, because AutoOD-Clean makes sure that the number of removed objects is always larger than the number of added objects.By Lemma 5.1 a  larger than 1 ensures that AutoOD-Clean bounds the error in the next step based on the error in the current step.AutoOD-Clean thus is guaranteed to converge.PROOF.Let   be the learned parameter at round , and   +1 be the learned parameter in the next round, following Alg.3.More specifically, a subset   with the smallest losses (  −   •  (  )) 2 is selected.  +1 is the minimizer on the selected set of sample losses.Denote   as the diagonal matrix whose diagonal entry  , equals 1 when the -th sample is in set   , otherwise 0.Then, assume that we take infinite steps and reach the optimal solution, we have: where Φ( ) is an  ×  matrix, whose -th row is  (  )  , and we have used the fact that  2  =   .Remind that the feature matrix Φ( ) is defined in Equation 11.For Φ( ) whose every row follows i.i.d.sub-Gaussian random vector, by using concentration of the spectral norm of Gaussian matrices, and uniform bound, Φ( ) is a regular feature matrix.
On the other hand, denote  * as the ground truth diagonal matrix for the samples, i.e.,  *  = 1 if the -th sample is a clean sample, otherwise  *  = 0. Accordingly, define  * as the ground truth set of clean samples.For clearness of the presentation, we may drop the subscript  when there is no ambiguation.For bad samples, the output is written in the form of   =   +   , where   represents the observation noise, and   depends on the specific setting we consider.Under this general representation, we can re-write the term   +1 as Therefore, the  2 distance between the learned parameter and ground truth parameter can be bounded by: Next, the term  2 can be bounded as: For the first term in Equation 13, let |  \ * | be the number of bad samples in   .Then the eigenvalue is bounded by  + (|  \ * |).The last term in Equation 13is defined as The term  3 can be bounded as: where the last inequality holds with high probability by the subexponential concentration property, and all randomness comes from the measurement noise .
Then, as a summary, combining the results for all three terms, we have:

The Selection Between AutoOD-Augment and AutoOD-Clean
AutoOD-Augment and AutoOD-Clean work in two different ways: AutoOD-Augment starts with a small set of reliable labels and keeps augmenting it, while AutoOD-Clean starts with a large but noisy set of labels, and keeps cleaning it.Although in general both methods work well, as shown in Fig. 4 (Sec.6.2), they complement each other in terms of applicability concerning the availability of hardware and data set types.In particular, we prefer AutoOD-Clean when it is hard to acquire reliable labels initially.This, for example, is the case on the Pendigits dataset where AutoOD-clean significantly outperforms AutoOD-Augment.If the user does not have any GPU resources available, then we recommend the users AutoOD-Augment, because AutoOD-Clean needs GPUs to train a neural net.In the current implementation, our AutoOD system starts with AutoOD-Augment.If AutoOD-Augment only gets very few reliable labels at beginning, it switches to AutoOD-Clean.

EXPERIMENTAL EVALUATION
In the experiments, we evaluate the effectiveness and efficiency of AutoOD.We also analyze the quality of the automatically produced reliable labels which are critical to the performance of AutoOD.

Experimental Methodology and Setup
Datasets.We evaluate the effectiveness of AutoOD using 11 outlier detection benchmark datasets [16,60] with varying cardinality, numbers of dimensions, and proportions of outliers.The main characteristics of these datasets are summarized in Table 1.For all four methods, we configure them with a variety of different parameter settings.First, all methods have a parameter  that controls the number of outliers it returns.In our experiments, we pick 6  values, so that the fraction of the outliers each method returns falls into the range from 0.5% to 10%.
Next, we vary the parameters specific to each method.Both LOF and KNN have the number of neighbors  as parameter.We work with 10 different  values, with the actual values of  randomly picked in a range from 1 to 100.For the Isolation Forest, we vary the number of features _  to train the base estimator of the forest.For all datasets, we vary _  from 20% to 100% of the available features with a 20% interval.The Mahalanobis method does not have any specific parameter.
• LODA [52] and LSCP [71]: state-of-the-art outlier ensemble methods.Like AutoOD, outlier ensemble methods use many detectors to detect outliers.So they are a natural point of comparison.However, these works only use the detectors from the same detection method, e.g., LOF [15].
• Snorkel [55][56][57]: given a set of noisy labeling sources, aims to assign label to each object with high accuracy.We use its most recent variant [57] to integrate results from different outlier detectors, treating these detectors as noisy labeling sources.This can be seen as an advanced ensemble method.
• PyODDs [43]: leverages AutoML to select outlier detector, with strong assumption that ground truth labels are available.
• MetaOD [72]: uses a pre-trained meta learning model to select a good detector from many detectors, in which these detectors are pre-encoded during the training phase.We use the pre-trained model that the authors have published as core ingredient of their approach (https://github.com/yzhao062/MetaOD).
• Isolation Forest [44]: we compare against Isolation Forest with a random configuration.Same as [46], we report its accuracy as the average accuracy of 30 Isolation Forest models we use in our experiment, which can be considered to be equivalent to an Isolation Forest with randomly chosen hyper-parameters.
• Best_Unsupervised: the best unsupervised outlier detector among all detectors discovered by an oracle, where we use ground truth labels to compute F-1 score of each detector.
• Ground-Truth(GT): supervised classifier that is trained using the ground truth as labels.This corresponds to a best case scenario, as it knows upfront what are the outliers and inliers in the data set.As in AutoOD-Augment, we use SVM as supervised classification technique.We train SVM model by randomly sampling 50% of dataset as training data.
Metrics.We first evaluate effectiveness at detecting outliers by measuring the F-1 score for the outlier class, given F-1 is known to be robust to class imbalance.F-1 score considers both precision and recall, where precision is the number of correctly detected outliers divided by the number of all outliers returned by the detector, and recall is the number of correctly detected outliers divided by the number of all ground truth outliers.F-1 is computed as F-1 = 2 × precision×recall precision+recall .We also report the precision and recall separately.Next, we evaluate the efficiency of AutoOD including the running/training time, its scalability to the number of detectors, and the memory consumption.In addition, we report the statistics of the reliable labels that AutoOD produces and the number and the types of detectors that AutoOD-Augment preserves.Finally, we evaluate if AutoOD-Augment is effective at identifying the bad detectors.Parameters.For the methods that fall into the category of automatic detector selection including MetaOD and PyODDs we use the configuration suggested by their original authors in the literature.For example, in the AutoML-based PyODDs, there is no parameter to tune other than ensuring it uses the same set of detectors that AutoOD uses.Per the suggestion of the MetaOD [72] authors, we use the model they published, which in the training phase uses more types of outlier detectors than AutoOD.For the Isolation Forest, per the suggestion of [46], We report its accuracy as the average accuracy of 30 Isolation Forest detectors we use.For the ensemble-based methods including Snorkel, LODA, and LSCP, we tune the parameters per the instruction of the authors and report the best results on each dataset.Snorkel uses the same set of detectors that AutoOD uses.We set its learning rate as 0.001 and the number of epochs as 1000.LODA requires the users to determine the outlier rate which is set as the true outlier rate of each dataset in our experiments.Per the authors' recommendation, LSCP uses LOF as the detection method.For each LOF detector, we set its outlier rate in the the same way as our AutoOD does.

Summary of Effectiveness Results
. In this experiment we measure the effectiveness of AutoOD using a variety of benchmark outlier detection data sets.We find that AutoOD is consistently able to detect outliers with an accuracy higher than the best outlier detector among hundreds of configured detectors, while all other approaches we evaluated perform significantly worse.In fact, AutoOD succeeds to achieve an accuracy comparable to supervised outlier classifiers trained with ground truth labels -yet without having any access to such ground truth knowledge.AutoOD significantly outperforms the two SOTA outlier ensemble methods (LODA and LSCP), Snorkel, PyODDs, MetaOD, and Isolation Forest by 12 to 97 points (out of 100) in the F-1 score.6.2.2 Detailed Analysis of Effectiveness.Fig. 4 shows the results on ten benchmark datasets.At the  axis, the datasets are ordered by their sizes.Http is the largest dataset.AutoOD applies the same configuration to all data sets.On almost all datasets, Au-toOD outperforms all other methods except  -with the later the unfair advantage of full access to the ground truth to train its outlier classifier and hence it is expected to perform very well.
Comparison to Best Unsupervised.Compared to the best unsupervised detector (discovered by an oracle using the ground truth),   AutoOD-Augment achieves higher F-1 scores on 9 out of 10 datasets by up to 38 points and AutoOD-Clean gets better F-1 scores on 8 out of 10 datasets by up to 39 points, without relying on human tuning.This is because AutoOD intelligently combines the contributions of all unsupervised detectors in the process of automatically discovering reliable labels, instead of relying on one single detector to produce the final results.

PageBlock
Comparison to Ensemble.The two outlier ensemble methods (LODA, LSCP) and Snorkel perform consistently worse than the best outlier detector, often with a large margin.The reason may be that the detectors often produce diverse results on many objects.It is thus challenging to make consensus-based inference on these objects -yet this is indeed the nature of these outlier ensemble methods.
Comparison to PyODDs.Although PyODDs uses domain knowledge in the form of ground truth labels (to which our methods do not have access), it cannot find the best detector in all cases and often ends up with a poor detector.Therefore, in many cases it performs even worse than LODA, LSCP, and Snorkel, especially for large datasets.Because our AutoOD consistently outperforms the best detector, it thus significantly outperforms all these methods from 12 to 97 points in F-1 score.
Comparison to MetaOD.We use the model published by the authors.During training MetaOD has already seen 8 out of the 10 testing datasets used in our experiments, which should bias performance in its favor.Overall, MetaOD slightly outperforms Snorkel and Py-ODD, showing that this interesting meta-learning based method works to some extent.
However, even on the 8 datasets seen before, MetaOD cannot find the best detector from the candidates, while it completely fails on the two previously unseen datasets (Mulcross and HTTP), ending up with choosing a detector that performs poorly.This indicates MetaOD has poor generalization ability, probably because the characteristics of outliers in different datasets can be rather distinct.
Comparison to Isolation Forest.For the Isolation Forest with a random configuration, we get a similar conclusion to the empirical study paper [46], namely, that it is comparable to the methods that select one method from a number of outlier detectors and to ensemble-based methods.However, this method still performs significantly worse than the best unsupervised detector and than our two AutoOD-based approaches, because Isolation Forests do not always perform the best among all detectors, as shown in Table 4.
Comparison to GT.Compared to GT, our two AutoOD methods achieve comparable or even slightly higher F-1 scores on 6 out of the 10 datasets, even though AutoOD does not have access to any ground truth training data.However, on the Pendigits and Annthyroid datasets, AutoOD does not perform as well as GT.This is because for these datasets, all unsupervised outlier detectors have very low F-1 scores.Therefore, AutoOD cannot identify many high quality labels from the output of these unsupervised detectors.
Scalability to Large Data & Robustness to Dimensionality.As shown in Fig. 4, the larger the datasets are, the better AutoOD performs, indicating its scalability to large data.This is because it is easier to produce a sufficient number of quality labels from large datasets.The dimensionality of the datasets used in our experiments falls in a large range from 10 to 166.AutoOD works well on the high dimensional Musk data.This shows AutoOD is robust to dimensionality.Because any outlier detection method can be plugged into AutoOD, base detectors less sensitive to data dimensionality thus can be used.For example, in our experiments we use Isolation Forest known to work well on high dimensional data.
The Convergence.In Sec.4.4 (Termination Condition) and Sec.5.2 we have shown that AutoOD-Augment and AutoOD-Clean are guaranteed to converge.In our experiments, we observe that on average AutoOD-Augment converges in 21 iterations, while AutoOD-Clean converges in 12 iterations.
The Performance Variance of AutoOD on Different Datasets.For some datasets, AutoOD is close to GT, while for others the difference is big.This is because of the quality of the labels that AutoOD automatically produces.When AutoOD is able to produce high quality labels, AutoOD tends to work as well as GT (Ground Truth) which uses ground truth labels to train an outlier classification model.As shown in Fig. 4, the performance of AutoOD is comparable to GT on the large datasets, because AutoOD has a larger chance to produce labels that are indeed reliable.Our analysis on the automatically produced labels (Table 2) confirms this.On the Musk, Stimage-2, Mulcross, Shuttle, and Http datasets, the labels produced by AutoOD have an accuracy close to 1. Interestingly, we observe that typically AutoOD does not need a large number of outlier labels to achieve a high accuracy in outlier detection.As long as it gets a sufficient number of accurate inlier labels, AutoOD performs well.Because outliers are typically rare, this makes the strategy used by AutoOD well-suited to outlier detection.
AutoOD-Augment VS AutoOD-Clean.As we have discussed in Sec.5.3, AutoOD-Augment and AutoOD-Clean work in two different ways and complement each other.However, we observe that AutoOD-Augment and AutoOD-Clean achieve similar accuracy results on many datasets.Our analysis on the reliable labels confirms that this is because the reliable labels that the two methods produce tend to have a similar quality (Table 2) and in most cases, heavily overlap with each other.We observe that these common reliable labels typically correspond to the same set of "clear" outliers and inliers which are relatively easy for the unsupervised outlier detection detectors and hence AutoOD-Augment/AutoOD-Clean to capture.

Precision and
Recall.We study the precision and recall of the methods that perform best on F-1 including our two AutoOD methods, Isolation Forests, MetaOD, and the best unsupervised detector.We omit other methods as including them makes the graph hard to read.As shown in Fig. 6 and 6, AutoOD significantly outperforms all other methods.Moreover, AutoOD performs even better on recall than on F-1.This is important to outlier detection, as missing outliers (false negatives) often causes bigger problems than false alarms (false positives).

Reliable Label Analysis
In Table 2, we show the statistics (the quality and the number) of reliable labels that AutoOD produces separately for inlier and outlier labels.We make the following observations: • Both AutoOD-Augment and AutoOD-Clean do not need a large number of reliable outliers to achieve good performance; as long as a sufficient number of reliable inliers can be identified that the system can rely on, i.e., that are high quality, AutoOD-Augment and AutoOD-Clean work well.
• On large datasets, the reliable labels that AutoOD produces are near perfect.Therefore, AutoOD achieves an accuracy close to that of the supervised method GT.This is because the larger the dataset is, the better the chance that AutoOD is able to produce labels that are indeed reliable.
• The accuracy of reliable inlier labels that AutoOD finds is consistently high, indicating it is easier to find reliable inliers from the datasets than outliers.We experiment on the SMTP dataset [16,60] with an extremely low outlier rate at 0.03%.Note this is the only outlier detection benchmark dataset we can find that is publicly available and has a very low outlier rate.As shown in Table 3, both AutoOD-Augment and AutoOD-Clean achieve an accuracy higher than the best unsupervised detector.They also significantly outperform other methods.This is similar to the results on other datasets with higher outlier rates.Note in the effective experiments we discussed in Sec.6.2, we evaluated AutoOD on datasets with outlier rates varying from 0.4% to 40%, confirming that our AutoOD works well across a rich variety of outlier rates.
As we have analyzed in Sec.6.3, this is because AutoOD does not need a large number of outlier labels to achieve a good performance.As long as AutoOD obtains a sufficient number of accurate inlier labels, it works well.In outlier detection, getting reliable inlier labels tends to be much easier than getting reliable outlier labels.Therefore, AutoOD does not suffer as much from the rarity of outliers.

Efficiency Evaluation
We evaluate the running/training time of AutoOD, its scalability in the number of outlier detectors, and the memory requirement.

Comparison of Running time to Other Methods.
We compared our AutoOD-Augment and AutoOD-Clean against MetaOD [72], PyODDs [43] and Snorkel [57], all of which, like AutoOD, use many different types of outlier detectors to produce the final detection results.As shown in Fig. 7, except on the Pendigit dataset, PyODDs is always the slowest because of the complex AutoML technique they use.
The total running time of AutoOD is composed of the time of first running the unsupervised outlier detectors and then second learning the reliable labels from the detection results, with the former typically dominating the latter, as shown in Fig. 7.Because both our methods AutoOD-Augment and AutoOD-Clean run the same set of unsupervised detectors, they typically have a similar total running time.This is why, compared to manual tuning, which first runs all detectors and then picks a good one, AutoOD does not have much overhead.This also explains why AutoOD tends to be better than or comparable to Snorkel in running time which runs the same set of outlier detectors as AutoOD.
For our comparison to MetaOD, we use the pre-trained model that the authors have published.Hence, we cannot measure the MetaOD training time.Thus, we only report the running time that it takes to process the targeted dataset in which outliers are to be identified.Clearly the complexity of MetaOD depends on the inference time of the meta learning model plus the final detector it selects.We note that MetaOD tends to be slow at inference phase and often ends up with an expensive outlier detector, in particular, ABOD [38].Therefore, in many cases, it is slower than AutoOD.However, among all methods, this second phase of MetaOD (given we skip phase 1 of training) tends to be the fastest on large datasets.
Note all above methods we compare against have much lower accuracy than AutoOD as shown in Fig. 4, while our AutoOD achieves this gain without paying any significant cost in running time.We thus consider AutoODsuperior overall.

Scalability in the Number of Outlier Detectors. Because
AutoOD-Augment and AutoOD-Clean perform similarly in the running time (for the reasons we discussed above), here we only report on the results of AutoOD-Augment.We run this experiment on the MulCross dataset and vary the number of detectors from 50 to 250 by gradually increasing the number of detectors per detection method.
As we can see from Fig. 8, the running time of AutoOD-Augment increases, as expected.However, the slope becomes flatter as the number of detectors gets larger.This is because AutoOD improves on the running time of the unsupervised outlier detectors by sharing the common computation across different outlier detectors.In particular, for the slower ones among the detectors, namely, the NNbased [54] and LOF [15] detectors that utilize expensive NN search, our detectors share NN search as much as possible.Therefore, the running time increases sub-linearly with the number of detectors.6.5.3 Memory.In AutoOD, the memory consumption comes primarily from the data structure that keeps the anomaly scores that the outlier detectors produce per data object.This space complexity is  × , where  represents the number of data objects and  is the number of the unsupervised detectors.On the HTTP dataset, which is the largest dataset we tested, the peak memory consumption is 708M.Therefore, memory is not a performance bottle in AutoOD.We further evaluate the quality of the outlier detector that AutoOD-Augment preserves.For this, we add an additional constraint into its termination condition to ensure that only one detector will survive.Tab. 4 shows the rank of this remaining outlier detector among all outlier detectors based on their F-1 scores.AutoOD-Augment selects the best outlier detector on 7 out of 10 datasets.On the remaining 3 datasets, AutoOD-Augment picks the second or third best detector.

RELATED WORK
Automated Machine Learning (AutoML).AutoOD targets a similar problem as AutoML in that it aims to learn which of many models is the best.However, in contrast to our work, existing AutoML systems [13,14,25,28,37,40,42,45,63,64,74] all focus on (1) supervised machine learning and (2) require labeled training data for automatic cross validation.PyODDS [43] leverages AutoML techniques to select one outlier detection method and their corresponding input parameter setting, but assuming the existence of labeled data.Further, our experiments show that PyODDS is not very effective in finding good outlier detectors among candidate detectors even using ground truth labels.AutoOD significantly outperforms PyODDS even in this setting.Outlier Ensemble.Ensemble has been studied in the context of outlier detection [4,5,39,52,71].In [4,5], the authors investigated the theoretical underpinnings of outlier ensemble.In [39], the authors use feature bagging for outlier ensemble.However, it relies on the strong assumption that they "have information about normal behavior (class) in the data set", which often does not hold.
LSCP [71] and LODA [52] are the recent outlier ensemble works.LSCP defines a local region around an instance using the consensus of its nearest neighbors in randomly selected feature sub-spaces.It then combines the detection results produced by multiple base detectors in this local region as the model's final output using traditional ensemble tricks [5].In its evaluation LSCP outperforms feature bagging [39].
LODA [52] first uses many one-dimensional histograms to produce outlier candidates.An object is considered as an outlie candidate if its value on one dimension is out of distribution.LODA then aggregates the sets of outlier candidates to produce the final results, again based on consensus.LSCP and LODA perform much worse than our AutoOD, because based on our empirical study the base detectors often produce very diverse results on many of the objects and thus make consensus-based inference challenging.AutoOD does not rely on the ensemble to discover all outliers.Instead it focuses on discovering the reliable outliers/inliers and then uses those as pseudo-labels to train a classifier, hence much more effective.Weak Supervision.Given a set of noisy labeling functions that cover different overlapping subsets of a given data set, Snorkel [55,57] assigns a label to each object in the dataset by weighting and then ensembling the results produced by these labeling functions.More specifically, using a matrix completion technique, Snorkel learns a weight for each function based on their agreement rates.If a function agrees with a lot of other functions, Snorkel tends to assign a large weight to it.So it can be considered as an advanced ensemble method.As confirmed in our experiments (Sec.6), Snorkel typically performs much worse than the best individual detector for the reason we discussed above on outlier ensembles.Parameter Tuning in Outlier Detection.ONION [17] is an online system that, given a data set, allows the users to interactively tune the parameters of distance-based outlier detection.The key idea is to pre-compute and index results of parameterized distance-based outlier detectors into a compact index.Using index-lookup, ONION is able to answer outlier detection requests with different parameter settings in near real-time.Unlike our AutoOD, it still relies on the humans to manually decide which parameter setting to request and thus to tune the parameters.Further, it only supports one specific outlier technique, that is, distance-based outliers [35].
MetaOD [72] is a meta-learning based method.Its key idea is to run a large number of outlier detectors on historical outlier detection benchmark datasets with ground truth labels and use this prior experience to automatically select an effective model to be employed on a new dataset.To capture task similarity, it introduces specialized meta-features to model the characteristics of the outliers in a dataset.However, as confirmed in our experiments (Sec.6.2.2), even on the 8 testing datasets seen in the training process, MetaOD cannot find the best detector from the candidates, while on the 2 unseen datasets, it ends up with choosing a detector that performs poorly, indicating its poor generalization ability.Human-in-the-Loop Outlier Detection.HOD [19], a crowd-based method, relies on humans to find outliers.HOD first uses many unsupervised outlier detectors to produce a large outlier candidate set, taking the union of the detected outliers.To reduce human labor costs, it designs some questions that once answered by humans help verify the status of multiple outlier candidates.In contrast, AutoOD does not make use of any human supervision, yet succeeds to achieve accuracy comparable to supervised outlier classification.Deep Anomaly Detection.Deep learning has been used to detect outliers from data in complex format such as image or timeseries, typically by learning a representation that better distinguishes outliers from inliers.Some of these techniques [8,22,59,69] use the reconstruction errors of Auto-Encoder as the anomalous score to detect outliers, assuming that Auto-Encoders incur larger reconstruction errors on outliers than inliers.Some other techniques use the same principle, but apply different deep learning techniques to learn the data representation, such as Generative Adversarial Networks [6,51,68], self-learning models [27] and Auto-regressive models [2].AutoOD is compatible with these methods.
To learn a representation effective in separating outliers, most of these methods require a clean training data set -a data set not containing any outliers.However, such clean training data rarely exist in real applications.Robust deep anomaly detection [12,20,24,66,73] targets this problem by finding data with potentially corrupted features in the training process, while AutoOD-Clean finds mislabeled objects from automatically produced outlier/inlier labels.Elite [70] instead uses a small number of ground truth labels to mitigate the impact of outliers on representation learning.However, ground truth labels are not available in the unsupervised setting which AutoOD targets.Data Fusion.Data fusion integrates data from different sources which potentially conflict with each other to obtain better description about the same objects.To exclude the effect of low quality data sources, in [23] the authors select a subset of data sources that balance the cost and the gain, where the gain is approximated on a set of samples with their truths known beforehand.However, AutoOD does not assume the availability of the ground truth.CRH [41] estimates the source reliability by iterating truth estimation and source weighting.However, it relies on the users to define a cost function and a regularization funcition based on the characteristics of the data sources and the source reliability distributions.However, such knowledge does not exist in AutoOD setting.

CONCLUSION
In this work, we propose AutoOD that elegantly unifies the merits of both unsupervised outlier detection and supervised classification into one solution for tackling outlier detection.AutoOD leverages a diverse set of unsupervised outlier detectors to iteratively generates high quality labels for the data.Using these automatically generated labels, it then trains a classification model to reliably discover outliers.We design two strategies for realizing the AutoOD methodology, namely AutoOD-Augment and AutoOD-Clean, which differ in the way they generate the training set for the classifier.Our experiments show the effectiveness of the two AutoOD-based methods -achieving consistently 12 to 97 points gain in the F-1 score compared to a wide range of alternate solutions on 11 benchmark outlier data sets.

Fig. 1
Fig.1depicts the overall AutoOD Framework.AutoOD is composed of three key components, including unsupervised outlier detection, automatic reliable object discovery, and outlier classification.(1)Unsupervised Outlier Detection.Given an input data set, AutoOD first uses a set of unsupervised outlier detectors to detect outliers.Each detector corresponds to one outlier detection method in the built-in AutoOD library with a particular configuration of parameter values instantiated.For simplicity and ease of use, for each detection method, AutoOD uniformly picks some parameter configurations from a reasonable parameter range recommended by[16].(2)Automatic Reliable Object Discovery.Next, based on these unsupervised detection results, AutoOD divides the input data into two subsets, "reliable objects" and "unsure objects".The reliable objects are those which AutoOD is confident are clearly inliers or outliers based on the detection results produced by the detectors.(3)Outlier Classification.Finally, the automatically discovered reliable objects are used as training data for a binary outlier classification model.This model then produces predictions for the "unsure" objects whose labels remain to be unknown.This way, eventually our AutoOD assigns labels to all objects.AutoOD-Augment and AutoOD-Clean.Clearly, the effectiveness of AutoOD relies on the number and quality of the reliable objects discovered in the second step and used as labeled training data thereafter.In this work, we design two approaches to discover these reliable objects that complement each other.First, we propose an augmentation-based method, called AutoOD-Augment (Sec.4).AutoOD-Augment starts by automatically discovering a small but reliable set of objects to label and keeps augmenting this set iteratively.Second, we propose a cleaning-based method, called AutoOD-Clean (Sec.5).As opposed to AutoOD-Augment, AutoOD-Clean starts with a large set of noisy labels and keeps cleaning this set into an increasingly reliable set.In the next two sections, we introduce the two approaches in detail.

Figure 3 :
Figure 3: Observation: Evolution of Training Accuracy for Mislabeled and Correctly Labeled Samples.

Figure 5 :
Figure 5: Effectiveness Evaluation of All Methods on Ten Benchmark Data Sets using Precision Metric

Figure 6 :
Figure 6: Effectiveness Evaluation of All Methods on Ten Benchmark Data Sets using Recall Metric

Figure 7 Figure 8 :
Figure 7: Runtime Evaluation .2. DEFINITION 4.2.Comparison Criteria.Let O and I represent an outlier set and an inlier set, respectively.We say detector   is better than detector   , if and only if: where  , corresponds to the outlierness score which detector   assigns to object   .Intuitively, given a set of outliers and inliers, we say detector   is better than detector   , if compared to   ,   in total assigns larger outlierness scores to outliers and smaller outlierness scores to inliers.Next, we prove Lemma 4.1.LEMMA 4.1.If detector   is better than   by the comparison criteria in Def.4.1, then w i > w j .PROOF.Focusing on   and   , we re-write the objective function of logistic regression as:   ∈I  , .By Equation 4, if   is better than   , we have   >   .Accordingly, we re-write the objective as: max     +     , . .  > (  •  , +   •  , ) − ∑︁   ∈I (  •  , +   •  , )(5)We use   to denote   ∈O  , −   ∈I  , and   to denote   ∈O  , − By Lemma 4.1, weight   represents the relative performance of detector   .This in turn justifies AutoOD's pruning rule.4.4 Reliable Object Set UpdateIn Alg. 2, we leverage the idea of multi-view outlier classification to update the set of reliable objects.For this, beyond training the logistic regression (LR) model on the outlierness score features (the outlier score model), AutoOD-Augment trains an additional outlier classifier using only the raw features (attributes) of the stable data set (Line 3), called the data feature model.
This concludes the proof.□= [0 {len(inliers)}, 1 {len(outliers)}] 16:return trainIds, D 3, AutoOD-Clean starts with a large but noisy training dataset and alternates between training a neural network based on trimmed loss and updating the training data using the early training loss as well as the prediction results of the neural network.The AutoOD-Clean approach is composed of three steps: initial training data generation, modeling, and training data update.
5.1.Cleaning Rule.Let L  = {l I 1 , l I 2 , ..., l I n } denote the early losses of the training objects I that are classified as inliers and L  = {l O 1 , l O 2 , ..., l O n } denote the early losses of the training objects O that are classified as outliers.A training object   with early loss   will be removed if: (1) l i > mean(L I ) + std (L I ) when   ∈ I; or (2) l i > mean(L O ) + std (L O ) when   ∈ O denotes the fraction of objects that are being removed in each iteration,   = ∥  ∈  \ * ( (  )    −   −   ) (  ) ∥ 2 ,   =  + (|  \ * |), and   =

Table 1 :
[44]Benchmark Data Sets for Outlier Detection Outlier Detectors.We use four popular unsupervised outlier detection methods: Local Outlier Factor (LOF)[15], K-Nearest Neighbors (KNN), Isolation Forest[44], and the Mahalanobis method that cover diverse categories of outlier types including local outliers, global outliers, tree-based outliers, and statistical-based outliers.
Experimental Setup.The experiments were run on a single machine with 32 Intel Xeon 2.30GHz cores, 120GB RAM and 500GB disk on Google Cloud.We use one P100 GPU to train the neural networks.

Table 2 :
Reliable Label Analysis

Table 3 :
Evaluation on SMTP (Low Outlier Rate) 6.4 Effectiveness of AutoOD on Dataset with Low Outlier Rate

Table 4 :
AutoOD-Augment: the Remaining Detector others; (2) Isolation Forest and LOF tend to work better in general.