On Model Performance Estimation in Time Series Anomaly Detection

The usual way to quantify the performance of a novel algorithm in the field of classification, especially in time series anomaly detection, is to compare its performance against selected baseline competitors on selected data sets. There is a common sense in the community which data sets and baselines should be considered when evaluating the algorithm’s performance. Nevertheless, on which basis data sets and baselines should be selected is frequently discussed. In this paper, we propose an index for univariate time series data in anomaly detection based on information theory. The index shows an association with the AUC-score of an anomaly detection algorithm that is trained on the data, meaning that, the index can be used as a proxy for the “difficulty” for the classification task, this data set holds. A workflow to quantify this association using an interpretable classifier that relies on the index and a derived performance baseline is developed. The classifier performs within the margin of error of this performance baseline, meaning that we were not able to clearly mathematically show the association between index and AUC-score. We believe that our work, which unites mathematical concepts from information theory, physics, and computer science, is innovative and generally points in a promising direction that is worth investigating.


INTRODUCTION
Time series anomaly detection is the task of finding points or groups of points in a time series that significantly differ from the distribution of the majority of the points.
Defining this "difference" between the normal and the abnormal state in a strict mathematical sense is, if it is even possible, very difficult.An anomaly could be something like a single value, that is numerically different from the rest of the values in the time series.It could also be something like a break in a reoccuring pattern or a difference in the way two dimension of a time series relate to each other (e.g. a time series containing two signals which are periodic and in phase, that in rare occasion go out of phase).[8] Since no general definition of an anomaly is known, it is also not feasable to write a classical algorithm for general purpose anomaly detection.Machine learning algorithms however, overcome this requirement of an explicit definition.In the machine learning context, the information what is normal and what is abnormal is implicitly characterized by the training data.So the data implicitely defines the classification problem at hand.
There are different methods for time series anomaly detection [12].In this paper, we focus on reconstruction based approaches are reconstruction based approaches.Here, the anomaly detector is based around two core components.One is a machine learning model, the other is a discriminator.
The model is trained to reconstruct normal samples from the data.There are diffrent aproaches how this model could look like (Examples included in this work are feed forward networks, recurrend neural networks, convolutional neural networks and attention based network).A common property most reconstructing models share, is an information bottleneck that is realised in the network structure (e. g. an autoencoder structure).This way the model is trained to prioritise information that shall be passed to the bottleneck to construct a faithful recreation of the input.A good quality reconstruction can only be archived when other properties of the data set, that are not passed trough the bottleneck are aggregated in the network structure.The network "learns" how the normal state behaves.
The heuristic to detect anomalies with this reconstructing model is as follows: The network must "know" the properties of the normal state, to reconstruct it.If an anomaly gets passed trough the network, the reconstruction will not be as good, since the network isn't trained to recreate it.The second part of the anomaly detector, the discriminator, is an algorithm that processes the error (e.g.L1 or L2 error), the reconstructing model makes.This could either be another machine learning algorithm or a classical algorithm like an adaptive thresholding.This part of the algorithm labels an input, or a certain part of an input abnormal, based on the reconstruction error 1 .
Time series anomaly detection is basically a binary classification task (finding out if a value is abnormal or if it is not.).There are also different methods to quantify the quality of a binary classification.In this work, we settled on the AUC-score as performance indicator.
In the field, huge effort is put into the creation of new algorithms for specialized applications or better performance over all.There are many different types of time series and sometimes its hard to say what makes an algorithm perform well on one data set and badly on another.
There is no real gold standard to evaluate the performance of an algorithm.Usually, the algorithm is compared against others on a variety of data sets.The choice of these data sets (and the classification problems they hold) is up to the authors.To assert proper testing, the benchmark data sets have to represent the field of the intended use case of algorithm, and the baselines have to be relevant and should represent the state of the art.Especially for general-purpose algorithms, there is a variety of benchmark data sets the community agrees on, and that are often used ( [19] [1] represent research where this workflow is employed.).
To formalize this methodology, some authors, e. g.Dau et al. [4] collected a variety of data sets with the aim to cover most of the real-world time series that might be faced in the industry.Other authors like Lai et al. [8] explored the possibility to formally define mathematical criteria that describe a data set.They developed a taxonomy for different types of anomalies.
Our work follows this approach as well.The primary issue with the taxonomy Lai et al. created is that, while it is well suited as a foundation to develop data generators, it is hard to apply to existing data 2 .In this work, this issue is addressed by using only formalisms that require very few assumptions on the time series data to cultivate a computational measure that is generally applicable.The second requirement for this measure is that it should be straightforward to calculate.
To fulfill these criteria, concepts from information theory are used.Entropy and mutual information are concepts that are generally applicable since they require very few assumptions about the system in question and are straight forward3 to compute.
This paper is structured in two parts.
First, a way to assign an index to a time series data set is proposed.The computation is motivated by information theoretical considerations.Furthermore, this index is calculated for a variety of different data sets.Following this step, several reconstruction-based anomaly detection algorithms are trained on the sets.The relation of the index to classifier performance is evaluated graphically.
For the classifiers at hand, it is visually notable that the index is "weakly associated" with the classifier performance.Second, to further investigate and quantify this "weak association", a nearest neighbor-based classifier that outputs an estimated performance of the trained algorithm is developed.The idea here is that the nearest neighbor classifier relies on the same information visible when looking at the plots discussed earlier but offers a mathematical way to discuss and further evaluate this information.Since the classification and the uncertainty of the classifier are only dependent on the index and the anomaly detection algorithms results, they can be used as a proxy to quantify the quality of the index 4 .We introduce two baselines to compare the algorithm against.One is a classifier that is just randomly guessing, the other one is a classifier that always outputs the average of the measured AUC-scores.The latter one is the important baseline to surpass.Surpassing this baseline means that further information that are not present in the average of the AUC-scores had to be used, which are held in the index.While the classifier outperforms the baseline of a classifier that is just randomly guessing, it performs within the margin of error of a classifier that always returns the average of all recorded performance results of the algorithm.It is further investigated how the nearest neighbor algorithm generalizes and how the uncertainty of the algorithm is distributed in the parameter space.Additionally, it is discussed how the distributions of the predictions of the nearest neighbor classifier compare to the distribution measured values.Finally, all points and variables that could be alternated to further investigate are listed.
The main contributions and novelties of this work are: • The concept of the FEMI-index is introduced.The FEMIindex represents a novelty.To our knowledge, although information theoretical influences are present in the field of time series anomaly detection, there is no work using similar concepts to describe the properties of data sets.• The workflow described in this paper shows how the "high level" concept of the index containing information on the difficulty of the anomaly detection task that comes with a data set could be quantified: In this paper, an interpretable classifier with uncertainty estimation is developed to serve as a tool to quantify the influences of the index at hand.Additionally, two essential baselines are motivated.Surpassing both baselines is strong evidence that the index under test actually contains information about the difficulty of the anomaly detection task for a given algorithm.• The uncertainty quantification in this work applies Gaussian error propagation 5 (introduced in section 4.1) to topics in time series anomaly detection.

FOURIER ENTROPY MUTUAL INFORMATION (FEMI)-INDEX
First, the concepts of entropy and mutual information are briefly presented.In addition, an introduction to the intuition behind the computation of the Fourier entropy mutual information (FEMI)index is given.A mathematical formulation is stated at the end of this chapter.regression algorithm takes the data in and delivers a linear function describing the data.Although when doing correlation analysis, one is only interested in the goodness of the fit as a proxy to investigate how well the data fits a linear function. 5Gaussian error propagation is a standard for quantifying uncertainty from experimental physics.

Entropy and Mutual Information
The entropy6 is a measure for the expected statistical information of a continuous distribution of values.Let  ∈  be a random variable with values in  .Let  () :  ↦ → [0, 1] be the probability density function of .The (differential) entropy of  is defined as: Mutual Information is a measure for the difference in statistical information in one distribution compared to another.For two random variables  and , the mutual information is defined as: It vanishes exactly when both distributions are independent.
In real-world scenarios, it is often not possible to obtain  (), instead, one is often left with samples  := {  ∈ ,  ∈ {1, • • • , } ⊂ N} that are assumed to be drawn from the distribution of .For this case, there are methodologies to estimate the entropy and mutual information of  from the samples   .The straight forward approach for estimating the entropy from a set of samples is to bin the samples.The ratio of the number of samples in one bin and the bin size is an approximation to the underlying distribution.The quality of these estimates varies with the applied binning strategy.Best values can be expected with an adaptive binning strategy that is based on the samples.For one dimensional data, it is also possible to sort the samples.The distance between one sample and it's neighbour is inverse proportional to the distribution function.With this intuition, an estimate for the entropy can be derived.There are also generalisations of this method relying on ranking the neighbouring points based on some distance function.Once the entropy is known, the mutual information can be calculated using equation 2. We refer to Kraskov et al. [7] for a deeper insight into the calculation process.
In this work, an entropy sampler developed by Marín-Franch et al. [10] is used.The entropy and mutual information that are estimated from sets of samples  1 and  2 is denoted as ℎ( 1 ),ℎ( 2 ) and  ( 1 ,  2 ).

The Intuition behind FEMI-Index
To compute the index, a data set is needed that is split into two subsets.One subset contains only data that is considered normal.One contains normal data alongside abnormal data.Note that the data itself doesn't have to be labeled.The goal here with the index was to find a small number of mathematical properties that describe the data set.The following bullet points introduce the parameters used and the intuition behind the inclusion of these: • By calculating the entropy of the subset containing only the normal data, it is asked: "How much information is there to be learned to quantify the normal state of the system that is described by the data?".
• By calculating the mutual information between the subset containing normal data and the subset that contains abnormal data, it is asked: "How much information is there separating the abnormal case from the normal case?"However, computing these measures just straight up on the data would implicitly ignore the time series nature of the data since the order in which values are recorded would be ignored.To tackle this issue, the Fourier transform is taken, and the triplets of radius and angle of the complex amplitude and the corresponding frequencies from the Fourier transformation are aggregated.The information theoretical measures for the distribution of those triplets is calculated.The whole process of calculation is visualized in a flow chart, seen in figure 2.

Formal Description
To formally describe the calculation of the FEMI-index, a definition of time series data is introduced: For each data set, let  = {1, • • • , } ⊂ N be a set of indices.For every  ∈  there is   ∈ R and   ∈ R.   are called timestamps and   are called values.In all of the data sets used in this work, the values were recorded at equidistant points in time.Thus the timestamps   are ignored in the further description of the formalism (They had to be included otherwise.).
For a constant  ∈ N (the window size) and some In the following paragraphs, the computation of the FEMI-Index is introduced step by step: To calculate the FEMI-Index, the entries are first multiplied by the Hanning window function.This way, artifacts that would emerge from the non periodicity of the values of time series represented by  are prohibited.The Hanning window for an entry with  values is given by: After windowing, the Fourier transform of the entry  is calculated.In this work, the following version of the discrete Fourier transform is used.The Fourier coefficient for the frequency  is given as: In this formula, the indices for   are shifted from , The complex output of the Fourier transform then has to be transformed to the R 2 .This is done by either taking the radius and angle of the complex amplitude or by taking its real and imaginary part.Each variant leads to a different version of the FEMI-Index.This way, each set   is translated to a set   containing entries  ′ ∈ R 3 .  is a triplet, the first two values represent the complex amplitude from the Fourier transform the third element is the corresponding frequency.For each data set, two sets of entries are needed.One where the entries contain anomalies (in a rate that is expected for the data source that underlies the set) and one without anomalies.Formally written: For a set size  ∈ N, Two tuples7   ,   ∈ (1, • • • ,  − )  are needed such that two tuples of entries can be defined: normal contains the normal enties,  abnormal can contain abnormal entries.By applying the above-mentioned procedure of multiplying the Hanning window, Fourier transforming and converting the entries in each tuple, new tuples are obtained: By concatenation: Two sets are obtained, each containing three-dimensional entries.The FEMI-index is then obtained by computing:

FEMI-INDEX CALCULATION
In this chapter the conduction of the experiments is reported.15 models where tested on 128 datasets.As a measure of anomaly detection performance, the AUC-score was computed.For each dataset, the FEMI-index was computed as well, resulting in the data points of FEMI-index and AUC-score per data set, that are used trough out the rest of the paper.The error estimation is described as well.
The FEMI-index was evaluated on several benchmark data sets.The benchmark data consists of the following sets: • synthetically generated sine-waves with anomalies, where the anomalies manifest either in a deviation from the normal periodicity of the sine or its Amplitude (4 data sets).• The datasets from the UCR-archive [4] are used.Since the UCR-Archive is a time series classification data set 8 , we labeled one class as an anomaly and the others as normal (121 data sets).• Data from machine 1-1 from the SMD-Dataset [13] is used.
The data is converted into univariate data by looking only at the dimensions where the anomaly manifests itself (2 data sets 9 ).• Data from the ECG data set [5] is used (1 data set).
We randomly sample Examples from these data sources and group them to obtain the sets needed for the FEMI-index calculation.This stochastic nature of the data sources also affects the computed FEMI-index.The indices reported are the mean of 10 calculations

(E,MI)
Figure 2: This flowchart visualizes the process of computing the FEMI-index.Note that there are two variants of the computation, depending on whether one chooses the component or the polar transformation of the complex amplitudes into the R 2 .The variables  1 and  2 represent the complex amplitudes that correspond to the frequencies  1 and  2 .The variables  1 , 2 , 1 and  2 are the radius and angle of  1 and  2 .
for each data source.The errors seen in the plot are the standard deviation for these calculations.Taking this error into account makes the evaluation of the index more "realistic" since sampling data from a system of interest in multiple instances would cause a similar variation of the index.
To further evaluate the association between the index and the anomaly detection algorithm's performance, models are trained on each of the data sources.As of now, we limit ourselves on reconstruction based anomaly detection methods.Simple "toy model like" implementations are applied in this work instead of state of the art algorithms, since most state of the art algorithms consist of very sophisticated data augmentation strategies and loss functions working in conjunction.By taking a simpler approach, we hoped to find behavior that can be associated with the model structure, which would be harder to find when comparing more complex models.All systems have in common that they are trained to reproduce the input data.The reconstruction error of the input is used as a proxy for the anomaly of the data point.For all models, the AUC-Score on the data sets was recorded.Different models were tested and are described in the following section.Additionally, the models referenced in table 1 are introduced.All models are implemented in pytorch [11].
Feed Forward Autoencoder (0-3).These models are feed-forward autoencoders with ReLU activation function.The latent space is  times smaller than the input and is always on the middle layer (4) in this case.On the other layers, the number of neurons per layer is linearly decreasing/increasing to map the input to the latent space and the latent space to the output.
CNN Autoencoder (9)(10)(11)(12).An autoencoder that consists of convolutional layers which are connected by linear layers which mediate the change in dimension of the information that is passed form layer to layer.In addition to the "normal" encoder, which operates on the time domain and processes the time series, a second encoder can be added, either alongside the first one or instead of the first one, that processes the Fourier transformed version of the time series (Following an idea shown in a blog post by Ilia Zaitsev [18]).Before the transformation, the time series could either be windowed with a Hanning window or transformed directly.
In addition to these models, we originally also included LSTM / GRU based reconstruction based models (IDs 4-8) and transformerbased approaches (IDs 13 and 14) in the testing.Unfortunately, there were errors in the code that were discovered after the benchmarking, rendering the benchmark results unusable.In the transformer code, it is just simply due to a bug 10 .The recurrent neural networks need an optimizer that is more specifically tailored to the task.The loss during the training epochs indicate that there are stability issues.We thus decided to exclude these models from the discussion.For the curious reader, we included additional information, example plots, and results from these models in the appendix (section A.2).
All of the models where trained for 40 epochs using the Adamoptimizer [6] with a learning rate of 0.001.Convergence was confirmed by evaluating the loss functions for a portion of the models.
The specific implementations can be seen in our git-hub repositories 11 The AUC-score has an uncertainty that covers the errors made in the numerical integration to obtain the AUC-score.In our computation, the ROC-curve of the anomaly detection algorithm is sampled by equally spaced thresholds from 0 to the maximum of the anomaly score.To obtain the AUC, a trapezoid quadrature is used.The error for the quadrature is the difference between the AUC estimated by an upper and lower right-hand rule.Geometrically, this exactly covers the uncertainty that is introduced through the sampling process, but it doesn't include errors that might be present in the values for the ROC-curve, which we assume to be smaller than the error that occurs from sampling.The errors are usually below 0.05 but especially the SMD set shows AUC-scores around 0.5 and errors around 0.4, which points towards a sampling problem.This can be assessed by using an adaptive quadrature.
For simplicity, model 3 and model 10 are chosen for the discussion of the FEMI-index.Model 3 is an instance of the feed-forward type networks, and model 10 is an instance of the convolutional networks.The other networks of these types performed similarly.
Figure 3 shows the component and the polar FEMI-Index for model 3 and model 10.The entropy and mutual information reported here are negative.This should not be possible.We suspect, that this is due to an error in the experiment code.We still report all findings as is, since the results are reproducible 12 and the empirical relations we discuss here still hold.Comparing the top two images (3a, 3b) to the bottom two images (3c , 3d) it is noticeable, that the component FEMI-index is much wider spread in the parameter plane, whereas the polar indices are mainly gathered around a "line shaped" region in the center.
The figures show a connection between the FEMI-index of the data set and the AUC-Score of the algorithm in some regions of the plots.In the top right region of the plots, no clear trend in the archived AUC-Score is visible on those sets.AUC-Scores that range from near 0 to near 1 are all present in that region.In contrast, on data sets that are indexed in the bottom left of the plot, the algorithm generally archives higher, more consistent AUC-Scores.This holds, except for a few deviating data set in that region.This trend is best grasped in the plot of the polar FEMI-index (figures 3c and 3d).
For the rest of the paper, visualizations of the polar FEMI-index are used since the phenomena that are discussed are easier to grasp visually in those plots.

NEAREST NEIGHBOUR (NN)-CLASSIFIER
In section, a nearest neighbor algorithm that gives an estimation for the performance of the classifier based on the FEMI-Index and AUC-Scores discussed in the last section is applied.This classifier fulfills two purposes: Firstly, this classifier is a way to more precisely investigate the phenomenon that the AUC-Score, for some indices, is associated with the FEMI-Index of the data set.
Second, finding a good performing classifier for one algorithm means that it would be possible to get an estimate for its performance.This estimate for a novel data source is based on the performance measured when benchmarking the algorithm on known data sources by calculating the FEMI-Index of the data source and query the classifier.This pre-training estimate saves computation time and improves results when combined with e. g. an ensemble method.
Nearest neighbor algorithms are a standard tool in machine learning [15].In this case, a nearest neighbor classifier was chosen for its simplicity and interpret-ability since the performance of the classifier has to function as a proxy to further measure the quality of the FEMI-index.
In this section, the steps of the algorithm are presented.Due to the importance for the next steps, a small introduction to Gaussian error propagation is given.This chapter is concluded by a formal description of the nearest neighbor algorithm.
The nearest neighbor algorithm presented here works as follows: • The user inputs a point (,  ) in the FEMI-Plane.
• The user specifies a radius (it is a hyperparameter).A radius was chosen instead of a number of nearest neighbors since the interpretability in less populated points of the domain seems to be easier.• The algorithm outputs the average of the AUC-Scores of known data points that fall into a circle of the user-specified radius with the user-specified point at its center.• This average is weighted with a function that accounts for the distance of the points (points that are closer to the center of the circle should have more influence on the output.)• Additionally, a measure for the uncertainty is computed by calculating the propagation of uncertainty (Gaussian error propagation) for the AUC-Score, the entropy, and the mutual information.
• The error obtained by propagating the uncertainty is compared to the standard deviation of the AUC-scores in the circle.The maximum of both is chosen.• If no point is in the circle, an AUC Score of 0 ± 0.5 is returned.
Since Gaussian error propagation is an essential part of the uncertainty quantification of the classifier, the next section provides a short introduction.

Gaussian Error Propagation
While Gaussian error propagation is a standard in propagating errors in calculations done in experimental physics, it is relatively uncommon in computer science.Hence, this section provides a short introduction since Gaussian error propagation plays a key role in the classification of uncertainty that is done in the next sections.
There are many ways to derive Gaussian error propagation (e. g. from the Taylor series).In this introduction, the formula of the error propagation is graphically motivated.
The task of error propagation is as follows: Given a function  :  ↦ →  and a value  ∈  with an associated uncertainty Δ ∈  .For a given value  =  (), the task of error propagation is to estimate Δ ∈  , the error of the calculated value .  ).An association between the position of the FEMI-index and the AUC-score can be seen, especially in picture c.The entropy and mutual information in the plots are negative.This is due to an error in the experiment code.The numbers reported here still function as index to identify the data sets.As the results discussed here are of empirical nature, we continue the discussion with these values.For a one-dimensional function  the errors can be found graphically as shown in figure 4.
The left hand side of figure 4 shows geometrically how the error of the values  and  effects the uncertainty of the outcomes of  .For a one-dimensional function, this geometric error propagation is easy to draw, but it is not straight-forward to do analytically.Notice how even for this rather simple function  presented in the figure 4, the maximum value of the interval around  () is not explicitly known (meaning that it is some unknown value for an  ′ between  and  + Δ.).
Gaussian error propagation can be seen as an approximation to this geometric error propagation.Instead of searching for the values that characterize the intervals of the geometric error distribution, one linearly approximates the function at the point where it is evaluated, and then calculates the interval bounds of the geometric error propagation for this linear approximation, which can be done analytically.This is depicted on the right-hand side in figure 4.
Formally this writes out as follows: The deviation in the output space from the value  () can be (in the one-dimensional case) written as: The linear Approximation.

• Δ𝑥
The strength of displacement. ( The estimated error then is the Euclidean norm of this deviation. In a multidimensional way, where the (scalar) output  depends on multiple input variables   this Gaussian error propagations is the canonical multidimensional equivalent of this geometric intuition.Meaning, the error  and Δ are multidimensional vectors.Instead of a simple derivative, the gradient of  is used, and instead of the norm of the deviation, the following term is computed: Geometrically interpreted, this is the Euclidean length of the gradient of  at the point  that was dimension-wise scaled by the entries in Δ.

Formal Description
The section is devoted to present the nearest neighbor classifier.
To define the output of the classifier, some functions have to be defined: On the plane of FEMI-index values, a distance measure  (( 1 ,  1 ), ( 2 ,  2 )) : R 2 ↦ → R + is used.In this work, we chose this distance to be the Euclidic distance.

𝑑 ((𝐸
For a constant radius  ∈ R + the set   (,  ): ) is defined.Its the set of points in R 2 that have a distance  that is smaller or equal than  from (,  ).With these functions, for a given classifier, an AUC-score    for set   was calculated as well.The classifiers estimate for the AUC-score at a given point (,  ) is defined as: In this equation   is a weight function and   is its normalization.The weight functions are meant to give the possibility to weight points in the FEMI-plane that are near the center of the circle stronger than others.Formally we chose: The normalization can be calculated as: To obtain a measure of uncertainty for the classification, Gaussian error propagation is used, assuming that   ,  , and   have an error.As discussed in section 3 there are areas in the FEMIparameter space where no clear association between FEMI-index and AUC-score can be seen.To account for these areas in the uncertainty quantification, the maximum of the error computed following the error propagation and the standard deviation of the AUC-score values in   (,  ) is taken as final uncertainty for the prediction.

EVALUATION OF THE NN-CLASSIFIER
In this section, the performance of the nearest neighbor algorithm is evaluated.The section is divided into two subsections.In the first section 5.1, the classifier is compared to the two baselines.This serves two purposes.For one, surpassing these baselines is a statement of the quality of the classifier.Second, and more important: The only way the classifier can surpass the second baseline, which just returns the average of all benchmarked AUC-scores, is to utilize the information that is not present in the average.Since this information is drawn from the FEMI-index, it follows that the index indeed carries information on the difficulty of the anomaly detection task for that given anomaly detection algorithm.
In the second subsection, the distribution of the classifier outputs is compared to the measured distribution (section 5.2).It is also discussed how the classifier generalizes and estimates uncertainty (section 5.3).Those sections are more focused on the properties of the classifier than the FEMI-index.
Evaluation of the nearest neighbor algorithm is done by removing one point from the distribution of points and then predicting it's associated AUC-Score from its FEMI-index using the remaining distribution.This is done for every point in the distribution.The mean squared error (MSE) of the AUC-Score values is used to assess the performance of the classifier.

Comparison to performance baselines
To further get a sense of the performance of the classifier, the classifier's MSE are compared to the MSE of two baseline competitors.
The first one is a classifier that just outputs uniform distributed random values between 0 and 1. Surpassing it means essentially that the classifier at hand is better than random guessing.It follows that some information on the performance of the algorithm in anomaly detection is obtainable from the FEMI-index and the AUC-data.Its MSE can be calculated using the expected values for the contribution to the MSE of the individual data sets, which can be calculated as: (  random = )(   −) 2 d (18) Since we assume uniform random distribution of predictions ,  (  random = ) evaluates to 1/(maximal AUC−minimal AUC).The minimal and maximal values of the AUC are 0 and 1, so the integral simplifies to: and with that, the expected value for the MSE of the random classifier is: The other is a classifier that just outputs the mean of all measured AUC-Score values.This classifier is an important baseline since, to outperform it, an association between the AUC-Score and the data based on the FEMI-index has to be utilized.Surpassing the performance of this baseline is a strong indicator that the index contains some relation between the difficulty of the anomaly detection task and the data.Its MSE can be calculated as: The results of the comparison are listed in table 3. Results for all trained classifiers are listed in the appendix A.1.In all cases, the MSE of the random classifier is surpassed by the nearest neighbor classifier, regardless of the radius.
However, the classifier fails for all radii to surpass the baseline of the average classifier.In some cases, the nearest neighbor classifier archives a smaller MSE than the average classifier, but only 0.001, which is a very small difference compared to the uncertainty.For all that we can tell, both classifiers perform equally.
At huge radii, the nearest neighbor algorithm gives the same output as the average classifier, which is to be expected, since the average classifier is "the asymptotic limit" ( → ∞) of nearest neighbor algorithm.
The absence of a measurable lower error compared to the baseline shows that, at least in the investigated example, the classifier could not benefit from the information that is provided by the FEMI-index 13 .
Partially, the results of the classifier had lower MSE than the average classifier baseline.However, these numerical values cannot be interpreted since the improvement is small compared to the errors of the values.

Comparison of AUC-distributions.
To compare the distribution of the predicted AUC-scores and the measured AUC-scores, the box plots of the distributions are used.They are depicted in figure 5 for model 3 ( figure 5a) and model 10 (figure 5b).For both boxplots, it can be seen that the bulk of the values predicted by the classifier (median and percentiles) is at a lower AUC-score than the original distribution of the values.That is due to the large section depicted in the upper right of the FEMI-index plots (figure 3), where an association between FEMI-index and AUC-Score is not noticeable.In this region, the average calculated by the classifier tends to be around 0.5.
The best values, MSE-wise, are archived by choosing a radius of 5 or 10.With this radius, the classifier holds the ideal balance between taking into account the local features of the index and having enough points for the average calculation to compensate values that deviate from the local trend.In both plots, a narrowing of the distributions due to the calculation of the average is visible, resulting in the final distribution for a radius of 100, where every prediction is basically the average of the distribution. 13Which does not mean that the index does not hold that information.

Generalisation
To see how the algorithm generalizes, it is evaluated across the whole region, that can be seen in the plots shown in section (3).The prediction, as well as the uncertainty for that prediction, are mapped in the 2D plane of the FEMI-index.These plots can be seen for model 3 and 10 for the polar FEMI-index and a radius of 10 in figure 8.
As can be seen, comparing the generalized predictions for model 3 and model 10 (figure 6a and figure 6b) the generalization of the classifier shows the local association of the AUC-Score and the FEMI-Index.
Furthermore, some features that can be seen in the uncertainty estimate are discussed.The areas that are discussed in the rest of this section are marked in figure 6c (A-D).First of all, the classifier outputs a high uncertainty in areas where there are no data points, which is by design.However, there are some regions (B,C) where the classifier is overconfident.Because the smallest error here is assigned to predictions that are based on a few values.Additionally these values are at the edge of the space that is taken into account for prediction.Further investigation needs to be done here.To clearly judge if this is an issue or not, the probability that a data set actually is indexed by a FEMI-index in that region needs to be quantified.
In the regions where there are points, the uncertainty shows the desired behavior, that, in the region where the AUC-score is not directly associated with the FEMI-index (D, upper right).The error is higher than in the regions where there is an association.Even though, for our liking, this error could be higher, maybe even as large as 0.5.
Another feature worth mentioning is the huge influence of that one deviating point (A) on the prediction.However, this influence is reflected in the uncertainty.

CONCLUSION
In this paper, it is described how information theoretical concepts can be used to characterize the properties of a time series data set.An association between this measure and the AUC-Score of a model that was trained on that data is visible for the trained models.To investigate this association and quantify it, a nearest neighbor classifier was created, featuring an uncertainty classification based on Gaussian error propagation.A classifier that always outputs the mean of the measured AUCscores was identified as the baseline.Surpassing the performance of this classifier is only possible if further information from the index is used.The nearest neighbor classifier performed within the margin of error of a classifier that outputs the average of the measured AUCscores.This means that no clear mathematical evidence for the association that can be seen in the visualization of the index could be shown.To really grasp how the association between FEMI-index and classifier performance comes to be, further investigation should be conducted.Up to this point, we just reported empirical results.A theoretical background that explains the reported results is needed.We personally think that the concepts discussed in this paper are worth researching.A way to mathematically estimate the difficulty  associated the anomaly detection task for a dataset would be a huge benefit to the field.The workflow proposed (Build NN-classifier based on the index, compare against the average classifiers MSE) in this paper is a methodology to evaluate and compare mathematical measures that are meant to classify data sets.

FUTURE WORK
The future work section is split into two subsections.In one subsection the extensions and additions to presented experiments are pointed out.The other shows potential variations and points to "branch off" from here.

Improvements to the existing methodology
As discussed above (section 3), there are some points to extend the experiments done here.
• The results shown here are all empirical.A theoretical background that explains the findings is needed.• To get a more detailed picture of the FEMI-index, more data sets (At the moment most of the data for the tesing originates from the UCR time series classification data set) need to be added to the existing experiments.• For the convolutional and the feed-forward networks, the tendencies visible in the FEMI-index looked roughly the same (chaos in the upper right and mostly good datasets in the lower left).It would be interesting to compute the FEMI-index on more different models to see if this general behavior changes with the model type.• During Testing, there where outiliers.Points that are in a region, where the FEMI-Index suggests, that the anomaly detection algorithm would perform well on them, that had lower AUC-Score than the points surrounding them.It would be interesting to investigate, which properties make these points an outilier.• As suggested by Clara Hoffmann [3], there are known transformations that can be done to the data, which don't change the shannon entropy.It would be interesting to see how the different data with same FEMI-index would look like.• Christian Schlauch [2] proposed a different interpretation for the FEMI-index: We interpreted the relation of the AUCvalues with the index as classifier dependent.Maybe this is only partially true.It could as well be that the FEMI-index is a theoretical upper bound for the performance of any classifier.

Variants of the research presented here
There are some promising aspects of the FEMI-index and the opportunities that might arise when it is used as a basis for selecting data sets for benchmarking or for pre-training performance estimation.But, as we stated in the former sections, there still is a lot to desire.Mainly a mathematical way of quantifying the association between classifier performance and FEMI-index.During our investigation on this topic, we identified some potential variants of the algorithm that might be worth exploring.
• There might be other information theoretical measures to explore to assign an index to a data set.E. g.Different entropy measures.Another parameter that might be worth including in an index is the estimates signal to noise ratio.• The uncertainties in this paper were huge compared to the numeric values.Maybe there is a way to find an index for the classifier that is less prone to error.
There are other, less error-prone, more precise classifiers on the basis of the existing FEMI-index.At the moment, we chose a relatively simple classifier for the sake of interpret-ability.
• There are different algorithms that can be used for classifications, like e. g. forests or neural networks, that are less easy to interpret but hopefully achieve better classification.• There are variants for the classifier at hand.e. g. trying different weight functions, making the radius dependent on the position in FEMI-parameter space, or choosing a distance metric for creating the circle that weights mutual information differently compared to entropy.

A APPENDIX A.1 MSE for the classifier for all models
The following tables show the classifier MSE archived for all models:

A.2 Example plots for the Transformer-and RNN-Models
The following part of the appendix has more detailed information on the models that were omitted from the discussion of the result in the main part of this work due to technical problems.The specific model parameters can be seen in table 4.
LSTM / GRU (4-7).These models consist of a GRU/LSTM that processes the input.The information is encoded in the cell states by one recurrent layer stack and then passed to another that should rebuild the input.Additionally, there is the option to flip the cell states before they are passed to the second layer stack and flip the output so that the network rebuilds the input from last to first like it's described by Malhorta et al. [9].
Attention Based Model (13)(14).This model is a wrapper for the pytorch implementation of the "attention is all you need" transformer [16] (This is a different approach than what is done in the state of the art transformer based anomaly detection algorithms, where the anomalies are detected by evaluating the attention mechanism instead of the reproduction.[14][17]).The idea behind this model is as follows: Both time series and speech can contain complex context-sensitive information.So the idea arises that the attention mechanism, which enables the transformer to process this information in speech, also brings benefits when processing time series.
In the wrapper used here, instead of word embedding, the time series is either fed directly (piecewise) into the transformer, or a version of the time series that is preprocessed by a feed-forward neural network is passed.The output of the transformer is expanded back to the dimensionality of the input by another feed-forward network.
Polar FEMI Index for Model 10

Figure 3 :
Figure 3: This figure shows the two components of the FEMI-Index of multiple datasets.In addition, the AUC-score that was reached by an anomaly detection algorithm that was trained on the data is shown in the color index.The left two images (a and b) show the component FEMI-index, and the right two images (c and d)show the polar FEMI-index.The images a and c show the results for model 3, a feed-forward network-based model.Images b and d show the FEMI-Index for a convolutional network-based model (model 10).An association between the position of the FEMI-index and the AUC-score can be seen, especially in picture c.The entropy and mutual information in the plots are negative.This is due to an error in the experiment code.The numbers reported here still function as index to identify the data sets.As the results discussed here are of empirical nature, we continue the discussion with these values.

Figure 4 :
Figure 4: This figure graphically shows the task of error propagation.The errors of  () and  () are derived graphically on the left-hand side.The right-hand side plot shows a graphically conducted Gaussian error propagation compared to the geometric error (dotted line).

Figure 5 :
Figure 5: This figure shows box plots of the distribution of the AUC-score values that were measured during the benchmark and that were predicted by the classifier using different radii.

Figure 6 :
Figure 6: The left two plots (a and b) show how the nearest neighbor classifier generalizes for the models 3 (a) and 10 (b).The right two plots (c and d) show the uncertainty the classifier estimates for its prediction.Additionally in figure c, there are some annotated areas.The annotations are used in the discussion of these plots in the text.
Box plot for model 4.

Figure 7 :
Figure 7: These figures show the results discussed for the not properly trained model 4.
Box plot for model 14.

Figure 8 :
Figure 8: These figures show the results discussed for the malfunctioning model 14.

Table 1 :
This table lists the different models that were tested and their corresponding IDs.FF-AE stands for feed forward autoencoder, CNN-AE for convolutional neural network autoencoder.A more in-depth description of the parameters and a link to the implementation that was used can be found in the test.

Table 2 :
This table shows the MSE of our two baseline classifiers and the nearest neighbor classifier for different values of  for the polar (P:) and component (C:) FEMI-index for models 3 and 10.All uncertainties are obtained by error propagation.

Table 3 :
This table shows the MSE of our two baseline classifiers and the nearest neighbor classifier for different values of  for the polar (P:) and component (C:) FEMI-index.All uncertainties are obtained by error propagation.

Table 4 :
This table lists the details of the models which were excluded from the discussion of the results due to errors in the implementation (transformer) and stability problems while training (LSTM / GRU).