Multivariate Anomaly Detection with Domain Clustering

Existing time-series anomaly detection (AD) pipelines for cloud monitoring at scale commonly rely on isolated training per cloud service or cloud infrastructure component. However, with the increasing volume of data generated from thousands of services and components, there is an untapped opportunity for a more effective approach to detect key performance indicator (KPI) anomalies by capitalizing on the abundance of data available. In this paper, we propose MADDoC, an unsupervised transfer learning framework for reconstruction based anomaly detection on multivariate time-series data. We show how to efficiently leverage available KPIs in the realm of cloud infrastructure monitoring to generalize unsupervised time-series AD across infrastructure components. Compared to state-of-the-art approaches relying on isolated component-wise training, the MADDoC framework achieves superior Precision and F1 scores on public and internal time-series AD datasets, by learning a strong reconstruction backbone on the time-series data across many components, before fine-tuning to a specific component. Moreover, MADDoC achieves substantial cost savings in model training, with reductions of 60% to 75% when monitoring thousands of storage infrastructure components. Further, the framework overcomes the trade-off between training efficiency and AD performance of previous AD transfer learning approaches.


INTRODUCTION
Anticipating events that disrupt the cloud service delivery for cloud service provider (CSP) customers is crucial to prevent service downtime, performance degradation, and resulting financial losses.The root cause of such performance-related incidents (PRIs) often correlates with anomalies 1 found in key performance indicator (KPI) time-series data collected for increased observability on i.e. microservices, and cloud infrastructure.Thus, most CSPs have vast amounts of collected time-series data available and target to efficiently detect the anomalies that correlate with PRIs.In practice, anomaly detection (AD) poses significant challenges due to the rarity of anomalies, scarcity of accurate groundtruth labels and the need to manually correlate with anomalies in complex inter-related services and infrastructure.Further, production environments involve monitoring thousands of cloud components, in our use case storage systems, each with vastly different KPI dynamics due to different configurations or usage requiring a tailored detection model.To address these challenges, numerous approaches for timeseries AD were proposed [6,8,9,21,23,27].Commonly, these proposed methods treat each component as a separate domain and involve fitting a dedicated (multivariate AD) model or multiple (univariate AD) models per component [23].Especially for deep learning (DL) based AD methods such an approach incurs substantial costs in terms of training, evaluation, deployment, monitoring, and maintenance.Consequently, research started to focus on reducing the initialization time and enhancing training efficiency for AD.However, although training efficiency was improved, previously proposed univariate clustering-based transfer learning methods [13,29] come with a trade-off in degraded AD performance.The described approaches are validated on public datasets, which aim to replicate a real life setting but fall short in terms of size, noise and complexity.
To address this gap, we draw inspiration from related work on univariate KPI AD [13,29] and propose Multivariate Anomaly Detection with Domain Clustering (MADDoC), an efficient, unsupervised, transfer learning framework for multivariate time-series AD.We evaluate the framework on public and real life internal time-series AD datasets and find that it achieves superior AD Precision and F1 scores by introducing offline training and transfer learning stages as well as to the best of our knowledge, a novel lightweight Transformer-Autoencoder (TAE) for AD.Moreover, our procedure drastically improves training efficiency by 60% to 75% when monitoring thousands of components, overcoming the trade-off between training efficiency and AD performance of previous AD transfer learning approaches.Our core contributions are (1) a domain expert troubleshooting workflow-inspired grouping approach of multivariate timeseries domains, (2) the validation of a transfer-learning based approach for AD on large scale real life cloud data and (3) the introduction of the TAE architecture.

FRAMEWORK
We next describe the MADDoC transfer learning framework, designed to further automate the monitoring of 25,000+ storage systems at IBM Storage.The framework is of interest in any multivariate time-series AD scenario, where a large number of replicated microservices, or similar infrastructure components, are collecting measurements for the same set of KPIs.
Suppose monitoring a large number of infrastructure components, each component recording its KPIs2 over time.The observations from each component denote a multivariate time-series  of measurements { 1 ,  2 , . . .,   } where   ∈ R  .The observed anomalies may range from strong spikes in a single KPI (e.g.sudden slow response times) to complex multidimensional pattern shifts or unexpected seasonal behaviours (e.g.due to high usage at night).In this paper, we pursue a reconstruction based AD approach that encodes time-series  into a lower dimensional space  used to compute the reconstruction X of the original input (Cf.Fig. 2).Our approach builds on the assumption that our reconstruction model, the TAE, adapts to the normal behaviour of a system, but struggles to reconstruct novel patterns.Finally, the task of our framework is to determine whether time-step  is part of a PRI-related anomaly or not by thresholding the reconstruction error x .As the nature of anomalies is inherently context-dependent 3 , we are serving one custom AD model per storage system in production.MADDoC consists of 3 main components to accomplish this task efficiently: offline pre-training, transfer learning, and online anomaly detection as illustrated in Figure 1.We drastically reduce the training time at scale by making use of transfer learning as previously explored in [13,29].Opposed to training one AD model per infrastructure component from scratch, we obtain the per component AD model by fine-tuning from one of multiple pre-trained TAE models (Cf.ferent training approaches on the same system.The bar above the heatmap illustrates the predicted anomalies (blue) and a customer ticket related to a performance issue (green).Clearly visible, the (a) per-system training approach produces much noisier reconstructions compared to (b) MADDoC with additional cluster context and fine-tuning.
Step 3: Online Anomaly Detection.In the last stage, the fine-tuned models are served to production.The online AD determines whether the observation   from a given component is anomalous or not by applying the dynamic parameterfree thresholding Telemanom [4] on the reconstruction error from the fine-tuned TAE models.

FRAMEWORK DETAILS
In this section, we provide more details on the TAE model and the Infrastructure Component Clustering.The two components aim to tackle the complex problem of meaningful knowledge transfer for AD on high dimensional real world time-series data.

Transformer Autoencoder
The Transformer Autoencoder (TAE) denotes a lightweight, undercomplete autoencoding architecture that encodes and reconstructs a given multivariate time-series input from the lower dimensional latent representation (Cf.Fig. 2).The TAE serves as a robust backbone to our AD pipeline, with good reconstruction performance even during regularly occurring noisy periods.Encoder Block.The encoder consists of repeated blocks of standard Transformer Encoders [24] followed by dense encoding layers.Every encoding stage progressively compresses the feature dimension by a factor of 2-3.Decoder Block.The decoder is symmetrical to the encoder and consists of a matching number of blocks.Each block consists of a Transformer Encoder and a dense decoding layer, decompressing the latent representation.

Infrastructure Component Clustering
Our KPI embedding approach for infrastructure component clustering relies on unsupervised representation learning using our TAE model.Further, inspired by promising results on the concatenation of embeddings with different contextual information as in BERT [5] or fastText [1]) in the field of Natural Language Processing [25,26], we incorporate prior knowledge about the component architecture and concatenate the embeddings of different parts of the architecture.Component Embedding.Following the infrastructure component architecture encourages clustering to align with the workflow a domain expert would follow when characterizing a system.We group KPIs, with the help of domain experts, based on the underlying modules of a storage system, leading to a domain expert-informed representation obtained by concatenating the embeddings of different KPI groups.The embedding of a storage system for one KPI group is obtained by masking the time-series data with zeros for all KPIs not belonging to the KPI group of interest, taking the latent representation of the masked data as the embedding for each time-frame and averaging embeddings across all time-frames.Clustering.We utilize a K-Means [16] algorithm to accommodate the heterogeneity and high dimensionality of the components, ensuring certain cluster sizes and numbers, avoiding the overhead of an excessive number of cluster from methods such as HAC [18].

Hyperparameters and Compute
We used the Adam [11] optimizer, with an initial learning rate of 0.001, which is gradually decreased during training with the ReduceLROnPlateau callback [10].Gradient clipping to 1 and a dropout of 0.1 is applied.A batch size of 512 is used.The models are implemented in keras [3].All experiments were performed on NVIDIA Tesla V100s SXM2 32 GB.

Quantitative Validation on internal dataset
We conducted validation of MADDoC using an internal dataset that reflects the complexity of the problem.The dataset consists of 177 KPIs monitored for 500 storage systems over a period of 24 weeks.We refer to the components as systems from here on out as this is our concrete use case.
As we do not have clean ground truth, we used support tickets as label proxies, resulting in noisy labels reflecting in low F1 scores.We mark a time-step as anomalous if a support ticket, that indicates an issue in the system, was active.The training set had no tickets, hence no labeled anomalies and spanned 13 weeks, the validation set spanned 2 weeks without tickets, and the test set covered 9 months with tickets and thus anomalies potentially present.We evaluated MADDoC on all ticket information available for the 500 systems and separately on systems that had a performance-related incident, filtering out irrelevant tickets, in the test set.We argue that our proxy F1 scores, together with the Mean Squared Error (MSE), can be used to estimate the true F1 score, following a similar approach to Goswami et al. [7].Table 1 compares MADDoC with the per-system training approach and the general cluster models without further target-specific fine-tuning.All scenarios share the same parameter free thresholding method Telemanom [4] and only differ on the data and method the reconstruction models were trained on.Notably, the general cluster models already exhibit substantial improvements in MSE, Precision, and F1 scores.The fine-tuning step demonstrates further reductions in reconstruction error, as well as higher Precision and F1 scores.However, the increase in Precision comes at the expense of Recall, illustrating the classical Precision-Recall trade-off.
Table 2 showcases the excellent performance of our proposed TAE model, which exhibits significantly lower validation losses compared to a similarly sized LSTM-AE that compresses to the same latent dimension.heatmap generated by MADDoC exhibits less noise and shows minimal signal during normal periods in the data.Conversely, the per-system trained model generates significant reconstruction errors during various instances of normal system behavior, resulting in multiple false positive predictions.

Qualitative Inspection
During the test period, a customer ticket was issued, as indicated by the green bar.Both MADDoC and the per-system approach detect the anomalous behavior in the system and identify an anomaly a few timestamps prior to the start of the ticket.However, the heatmap clearly demonstrates the superiority of MADDoC over the per-system training approach in accurately reconstructing normal system behaviour while still detecting significant anomalous events.

Training efficiency
Figure 4 presents a comparison of the total training times between the per-system approach and MADDoC on our internal dataset.Our proposed method, MADDoC, demonstrates a significant reduction in training time of more than 60%, highlighting the practical utility of the MADDoC framework.All models were trained until convergence.3: Quantitative results for MADDoC with TAE on public dataset SMD [23].We report F1 score, Precision and Recall (as %) in comparison to previous benchmarks reported in [28].

MADDoC Validation on SMD
We evaluated our framework on the Server Machine Dataset [23], consisting of 38 KPI streams for 28 different systems from a large internet company, a limited scale to truly estimate performance of large scale cloud monitoring, yet the closest public benchmark to our problem.This required adjustments to our clustering strategy due to the component architecture dependent KPI groupings being unique to our data.Therefore, we present results based on average system embedding clustering with K-Means, skipping the step of concatenating custom embeddings based on domain expert knowledge.
In Table 3 we compare MADDoC to other methods.The fine-tuned embedding-based cluster model surpasses the persystem models in terms of F1 score.On the other hand, the pure cluster models exhibit significant lag, primarily due to low Recall.This finding reinforces the trend observed in our own data and highlights the necessity of the fine-tuning step.The dataset publishers of SMD recommend training and testing the 28 subsets separately [23].This recommendation aligns with our findings that models trained on multiple systems perform poorly in terms of F1 score.In this paper, we demonstrate that adding an additional fine-tuning step after training on multiple systems can outperform per-system training approaches.Our method proves to be competitive with state-of-the-art approaches and achieves state-of-theart Precision scores.It is worth noting that while MADDoC achieves the highest Precision, the per-system models already outperform benchmark models on this metric.We attribute this improvement to the effectiveness of our TAE model.

RELATED WORK
AnoTransfer [29], JumpStarter [15] and ROCKA [13] propose approaches that share conceptual similarities with ours.All methods cluster and fine-tune short univariate KPI data snippets.While ROCKA and JumpStarter employ traditional clustering methods, AnoTransfer utilizes a DL-based approach, employing model embeddings and an adapted HAC [18] algorithm to match the shapes of KPI data snippets.These methods notably reduce training times with comparable or slightly inferior F1 scores.The task in these works is to cluster single univariate KPIs that show similar behaviour compared to our clustering of the underlying domains that generate complex multivariate KPIs.It is to note that clustering univariate time-series data based on short snippets can be done through a direct embedding or shape-matching style approach whereas clustering time-series data with 100+ dimensions over a time-frame that is more than 10 times longer leads to a much harder problem that we needed to guide through our domain expert based embeddings.Our approach thus significantly differs due to the unique challenges, such as high dimensionality and length of the KPIs, and target application we address in this study.The data the methods were evaluated on further differentiates our approach as we aim to build and validate a system that can work in a real world noisy, complex data scenario, whereas previous methods benchmark on much cleaner data.The validation of transfer learning approaches for AD on highly complex datasets is thus a contribution of ours, proving that these methods can be used for large scale applications.

DISCUSSION & CONCLUSION
The results of our study provide evidence on how the MAD-DoC framework is useful for anticipating PRIs by efficiently detecting relevant anomalies, leveraging data across components while ensuring system specificity.We clearly demonstrate that MADDoC is able to improve the training efficiency (reducing compute cost) while at the same time improving AD performance, especially in terms of Precision.This is crucial for Support Engineers acceptance as previous iterations have overwhelmed the support team with false positive predictions, leading to a loss of trust in the alert system.The improved AD system has a direct positive impact on a faster troubleshooting pipeline, thus improving system reliability, reducing downtime and improving cloud usability.We attribute this success to the robust clustering of our long, noisy and high dimensional data through domain expert guided embeddings, the additional full fine-tuning step and the TAE model.We observed overall low F1 scores on our dataset compared to scores on public datasets.This was expected, due to our labels being mere proxies of the underlying anomaly distribution.As the support-tickets are our best available signal of an anomalous event that may have impacted a customer, we still see evaluation on these labels as desirable.In addition to MADDoC, we proposed the TAE architecture, which performed remarkably well on the reconstruction task in comparison to a similar sized LSTM-based AE.Both models were implemented with limited hyperparameter optimization, commonly done when comparing model architectures [17,19,22].As future works we aim to extend MADDoC from the following perspectives.First, an adaptive fine-tuning technique that freezes layers based on the system's proximity to the cluster centroid, could further reduce training cost while ensuring system specificity.Second, while our TAE showed good results, the aim of MADDoC is to be model independent and should serve as a general framework that one can use with arbitrary scoring and thresholding methods, depending on the application needs.Further our method is favoring Precision over Recall due the described amount of false positives arising from the noise and the lack of differentiation between malicious and harmless anomalies.We are working on improved labels and the integration of other sources of data into MADDoC to predict the harmfulness of an unexpected pattern.

Figure 1 :
Figure 1: Flow Diagram of the MADDoC framework.Concatenated embeddings of the different KPI groups (1) used for Component Clustering (2) using Transformer Autoencoder (TAE) latent representations

Figure 2 :
Figure 2: Our simple TAE model architecture.The TAE Encoder takes a multivariate time-series input and converts it into latent embeddings, which are then reconstructed by the TAE Decoder.The encoder consists of Transformer Encoder blocks interleaved with dense downsampling layers, while the decoder uses dense upsampling layers instead.

Fig. 1 ,
Marker 3 & 4).Each of the pre-trained TAE models covers a cluster of infrastructure components with similar KPI characteristics, where the clusters are obtained using model based clustering (Cf.Fig.1, Marker 2).Step 1: Offline pre-training.In this step, we train a TAE model on multivariate time-series KPI data from a large set of random infrastructure components in an unsupervised fashion.By only using the encoding part of this model instance, the latent representation  is used to obtain a component embedding (Cf.Fig.1, Marker 1) for model-based infrastructure component clustering (Cf.Fig.1, Marker 2).We use this approach to compute infrastructure component clusters and lastly pre-train one TAE model per cluster.Step 2: Transfer Learning.To efficently obtain the per component AD model, we embed infrastructure components based on their KPI characteristics, re-using the TAE-based embedding model from the previous stage.We determine the nearest cluster centroids for every target component and derive component specific TAE models through fine-tuning without layer freezing of pre-trained cluster TAE models on the target components data (Cf.Fig.1, Marker 4).

Figure 3 :
Figure3: Heatmaps (horizontal-axis: time; verticalaxis: KPIs) of normalized reconstruction error for different training approaches on the same system.The bar above the heatmap illustrates the predicted anomalies (blue) and a customer ticket related to a performance issue (green).Clearly visible, the (a) per-system training approach produces much noisier reconstructions compared to (b) MADDoC with additional cluster context and fine-tuning.

Figure 3
Figure3provides valuable insights into the distinction between MADDoC and the per-system training approach.The

Figure 4 :
Figure 4: Total Training times of the per-system method compared to our MADDoC framework for the TAE architecture on our dataset of 500 systems.

Table 1 :
Recall  F1  Prec. Recall  F1  Validation MSE, test F1 score, Precision and Recall (as %).The 'all' denotes evaluation on all 500 systems and the 'filtered' denotes evaluation on systems that contain a performance-related issue, filtering out irrelevant tickets, in their test set.

Table 2 :
Comparison LSTM-AE and TAE.We report validation MSE on a per-system training basis and MAD-DoC Cluster + System fine-tuning approach.