Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly Detection

Massive key performance indicators (KPIs) are monitored as multivariate time series data (MTS) to ensure the reliability of the software applications and service system. Accurately detecting the abnormality of MTS is very critical for subsequent fault elimination. The scarcity of anomalies and manual labeling has led to the development of various self-supervised MTS anomaly detection (AD) methods, which optimize an overall objective/loss encompassing all metrics' regression objectives/losses. However, our empirical study uncovers the prevalence of conflicts among metrics' regression objectives, causing MTS models to grapple with different losses. This critical aspect significantly impacts detection performance but has been overlooked in existing approaches. To address this problem, by mimicking the design of multi-gate mixture-of-experts (MMoE), we introduce CAD, a Conflict-aware multivariate KPI Anomaly Detection algorithm. CAD offers an exclusive structure for each metric to mitigate potential conflicts while fostering inter-metric promotions. Upon thorough investigation, we find that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation and convergence issues arising from expansive tasks. To address these challenges, we propose a straightforward yet effective task-oriented metric selection and p&s (personalized and shared) gating mechanism, which establishes CAD as the first practicable multi-task learning (MTL) based MTS AD model. Evaluations on multiple public datasets reveal that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods. Our code is accessible at https://github.com/dawnvince/MTS_CAD.


INTRODUCTION
With the rapidly increasing number of Internet applications and the number of users, ensuring the stability of Internet services and software systems is now very important.Service Level Agreements (SLAs) on service reliability ask for real-time and accurate incident identifications.To improve the observability of the system, operators deploy monitoring programs to produce a large amount of time series data to monitor the status (i.e., metrics) of the entities (e.g., software systems, online services) in different application domains (e.g., IT systems, manufacturing industry), providing rich information for anomaly detection and incident alert.Traditional metric-level anomaly detection methods determine whether to report an incident or not based on manually setting thresholds on each metric, becoming unqualified for more strict SLAs and less effective due to the explosive growth in the number of metrics.
There is a natural tendency for anomalies to be detected in univariate time series (UTS) [5,17,26], that is to say, the anomaly detection is performed on each metric separately.However, researchers have found that in the complex real-world system, different UTS interact with each other, which is called inter-metric dependency [14].Anomaly detection on each metric separately may lead to considerable false negatives.To better illustrate this, we take CPU utilization of certain services and Query Per Second (QPS) metrics as examples.At the time , we observe that QPS drops while CPU utilization increases.Under normal circumstances, QPS has a positive relationship with the CPU.It indicates an anomaly because the normal inter-metric dependency is violated.The UTS-based anomaly detection model fails to detect the anomaly as each metric's absolute value is within the normal range.On the contrary, this anomaly can be easily detected by the MTS anomaly detection model as it can not only model the intra-metric temporal dependency but also models the inter-metric dependency.In this paper, we focus on anomaly detection for multivariate time series data (MTS for short hereafter).Various MTS anomaly detection methods are proposed to model the correlations between different metrics.They almost all use self-supervised learning frameworks, more specifically, regression learning due to the scarcity of anomalies and manual labeling.All metrics' regression objectives compose the overall optimization objective/loss.
After considerable empirical investigation, we observe that the objectives of different subtasks may not be consistent, and can even be disparate in some cases.A representative case is that while most metrics have stable baselines and patterns, on one single metric or some groups of metrics, either baseline drifts or inherent stochastic fluctuations (BD&ISF for short) happen frequently, both of which are not labeled as anomalies, which is contrary to our naive instincts.Analysis and consultation with relevant people indicate the justification for the existence of these metrics in the real-world scenario.In this case, the objectives of stable metrics induce the model to pay more attention to subtle variation (hence sensitive to BD&ISF), while metrics with BD&ISF require the model to be insensitive to these changes.We refer to this as conflicts.As a result, the objectives at odds would cause the gradients of model parameters to descend in different or even opposite directions.This weakens the ability of the model and ultimately results in a decrease in detection performance.
However, We find that existing models cannot deal with the loss conflicts among metrics.These methods can be roughly divided into two classes: graph-based and sequence-based.The graph-based methods [4,7,9] take each metric as a node, construct the complete graph between them and apply extensive techniques of graph neural networks to model inter-metric dependency.In sequence-based methods, attention-based [22,24] or RNN-based [11,14] mechanisms are widely used to extract sequential information.Unfortunately, none of them thinks of conflicts between learning objectives from a framework perspective and takes active action to isolate the effect scopes of these contradictory objectives.This causes poor performance when conflicts occur.
The aforementioned findings call for a detection method that is capable of eliminating conflicts.As the conflicts are discrepant objectives essentially, we expect that the gradient descent process of conflicting losses can be isolated to some extent.A naive idea is to train a separate model for each metric that takes into consideration the influences of other metrics.However, it is unrealistic to train one model using all data for each metric since the number of metrics to be detected is increasing rapidly.Google [15] has proposed multi-gate mixture-of-expert (MMoE), using a group of experts and multiple gates to leverage the correlations while avoiding interference among tasks under a multi-task learning framework.We prepare an exclusive structure for each metric and attempt to borrow ideas from MMoE in the multivariate time series anomaly detection domain.Unfortunately, unlike the information retrieval domain that MMoE is designed for, introducing the idea of MMoE in the anomaly detection domain cannot meet our expectations due to the following challenges: • Misalignment of input and output spaces: Due to the employment of self-supervised regression learning or its variants, the output of an isolated subtask (one specific metric) is a sub-space of the whole input feature space (all metrics).This requires an appropriate mapping mechanism for subtasks to extract more related information from the large feature spaces.• Convergence issues: A monitor system would produce massive metrics for anomaly detection, each of which is viewed as a subtask.The number of subtasks in MTS tasks is dozens of times more than that in the vanilla MMoE scenario.As the number of subtasks increases, the naive gate structure fails to spread the gradients to experts in a stable way, causing oscillations in the parameter updates of the experts and finally influencing the convergence of the model.
To address the above challenges, in this paper, we propose CAD, a Conflict-aware multivariate Anomaly Detection algorithm.CAD trains a number of experts which can model the temporal-spatial dependency from multiple perspectives with the help of convolution networks.Then the expert networks are combined together in a weighted summing manner.For one specific metric, an automatic gating mechanism can assign personalized weights to different experts.In this way, compared with current MTS anomaly detection models which share the same network, CAD can flexibly learn personalized inter-metric dependencies for each target detection metric while isolating the negative effect brought by conflicts.
To handle the misalignment of input and output space of MTS, a task-oriented feature selection as well as a p&s (personalized and shared) gating mechanism is designed.Additionally, the p&s gating mechanism greatly improves the robustness and convergence of the model as the number of tasks increases.The personalized gate selects the most related experts for each task which can prevent the collapse of experts from learning the same thing.The shared gate ensures robust expert selection to make the expert network converge quickly.
The main contributions of our paper can be summarized as follows: • Our work is based on an observation that has never been considered before that the discordance of data distribution is likely to cause conflicts between the objectives of metrics, which do harm to the detection performance of existing methods.We propose CAD, a Conflict-aware MTS Anomaly Detection algorithm, to address the limitations of existing models when dealing with conflicts in MTS.• We summarize the key challenges encountered when eliminating the impact of conflicts, proposing a task-oriented feature selection to prompt the subtasks to focus more on their own patterns and a p&s gate mechanism to make the model more robust when facing massive subtasks.Well-designed experts are capable to extract both temporal and inter-metric dependency embedded in MTS in multiple perspectives, significantly enriching the expressivity of the whole model.• We conduct comprehensive evaluations on multiple opensource public datasets, showing that CAD outperforms the state-of-the-art by considerable margins (average 4.3% to 37.9% improvement over three datasets under best-F1 using point-adjustment approach and 4.2% to 93.1% under best-F1 using the k-th point-adjustment approach).Our code is publicly released.This paper is organized as follows.In Section 2, We give a formulation of MTS AD tasks and an illustration of conflict.The methodology of our model is presented in Section 3 and the experimental setup is presented in Section 4. We offer comprehensive evaluations of our model in Section 5. We review the related works in Section 6 and conclude our paper in Section 7.

BACKGROUND
In this section, we formulate the MTS anomaly detection problem and provide a brief introduction to MMoE.Then we use a case to help readers better understand the concept of conflict.

Problem Formulation
Multivariate time series consists of multiple curves that are timealigned, each of which represents successive observations of a metric over a long period of time.MTS with  consecutive timestamps and  metrics is represented by an ordered sequence: where x i is defined as the set of observations for all metrics at the particular timestamp  ∈ [1, ]: as each datapoint    denotes the value of metric  at this time point.Given current observation x t at timestamp  on the test set, a MTS anomaly detection system calculates the anomaly score in some way based on historical observations, then identifies whether x i is anomalous or not by comparing its anomaly score with threshold.
In MTS anomaly detection task, contextual observations, i.e., observations at nearby timestamps, play an essential role in understanding current data because they notably describe the relevant temporal patterns [2,23].Thus for observation x t , rather than simply using standalone vector x t−1 , we take a sliding window of length , x t− , • • • , x t−1 (denoted by w t ), as input to precisely capture sequential dependency in our implementation.

Basis of MMoE framework
The MoE layer proposed by Eigen et.al.[8] and Shazeer et.al.[20] leverages the techniques of ensemble learning, introducing a gate Anomalous segments at the metric level are highlighted in red, and the segment corresponding to reasonable drift in metric 3 is highlighted in blue.The regions enclosed by the yellow boxes share similar distribution.A previous method's detection result as well as ours is zoomed in and listed below.Red backgrounds in the results are the union set of all metrics' anomalies (including the three metrics plotted).
module to make the network more sparse.This allows for the incorporation of more parameters without incurring additional computational costs.Ma et.al.[15] further adapt the layer to multi-task learning, substituting the only gate with multi-gates each of which is allocated to a specific task individually, to utilize shared embedding while handling complex task correlation issues gracefully.The k-th task of the Multi-gate Mixture-of-Experts (MMoE) Model can be formulated as follows: Here, ⊙ denotes element-wise product,  ( * ) denotes the set of intermediate results produced by experts, ( * ) denotes gate structure and ℎ  ( * ) denotes the model of the k-th downstream task, which is known as tower network.Given  experts' output embeddings  1 , • • • ,   ,   ∈ R 1× includes the weight of experts, hence   () ⊙  () assembles all embeddings into one as the input of k-th tower.Each tower would give final results based on its particular aggregation of embeddings.

Illustration example of conflicts
An illustrative example with regard to conflicts is presented to motivate our approach (Fig. 1).In a dataset sampled from a realworld software system [23], there are three metrics: metric 3, metric 14 and metric 18.During one particular period, sudden baseline drift (highlighted in blue in Fig. 1) happens in metric 3, while all other metrics behave normally.Despite the drift, the distribution of data in this period is very similar to two previous normal distributions (enclosed by the yellow boxes).Meanwhile, the baselines of all three regions are at different levels, showing the metric's inherent sensitivity to normal external factors, e.g., load balancing strategy.Thus we consider that no anomalies occurred in this period which is confirmed by labels (highlighted in red in Fig. 1).While this change is reasonable, the baseline drift of metric 3 misleads the parameter updating of existing models.An obvious drift can be observed in the output anomaly score of a previous model (shown as USAD in Fig. 1), which misjudges this period as an anomaly eventually.
The detection result indicates that our method could handle conflicts effectively while other previous works perform poorly as they are influenced by conflicts to some extent.More detailed discussions and evaluations on this case can be found in Sec. 5.

METHODOLOGY
In this section, we first give an overview of the system.Then we introduce the sub-components of the model individually, including task-oriented feature selection, expert network, personalized & shared gate and tower network.At last, we introduce the loss balance module.

Overall Structure
As illustrated in Fig. 2, we leverage a group of task-oriented isolated structures to tackle the conflict problems discussed in Section 1.Since previous works have no independent structures to determine whether inter-metric dependencies are helpful or harmful to a specific metric, the correlation may have a negative impact on describing reasonable patterns of the current metric.Our approach, as each task has its own structure to autonomously select various expert-derived features that are highly correlated with it, solves the conflicts in an elegant manner.Moreover, the isolated tower becomes even more focused on its own metric since irrelevant information is separated out.
However, original frameworks like MMoE are unacclimatized when facing MTS issues.Firstly, misalignment of input and output spaces distract the gates from focusing on more relevant patterns.Secondly, well-designed structures are required to capture the complicated temporal-spatial dependency.Thirdly, massive metrics bring more interference to feature extraction, which needs a way for experts to converge on specific traits.
To address the problems mentioned above, we design CAD, a hierarchical unsupervised and forecasting-based model.Then a twodimensional input window is fed into multiple expert networks including a convolution layer to enrich the representation of temporal and inter-metric dependency (Section 3.3).The hybrid gating mechanism (Section 3.4), which is designed for massive metrics, transforms the feature space selected for each metric (Section 3.2) into a set of weights to fuse embeddings extracted by different experts.The embedding is then sent to its corresponding tower network (Section 3.5).All towers jointly finalize anomaly scores based on the gap between predictions and ground-truth values.For the multiple losses balance problem, we preprocess the data to balance all metrics' losses (Section 3.6).These components are then described in detail.

Task-oriented Feature Selection
In a typical scenario where MMoE works, there is no obvious correlation between the downstream task and the input data.In other words, each dimension of the input data is equally significant for the current task, thus each gate calculates the weight combination of experts based on all input data.This assumption, however, does not hold true in MTS anomaly detection.In this case, for a subtask determining if the particular metric is abnormal, it is counterintuitive that historical information in its own feature space is as important as the one in overall input feature spaces composed of other metrics.More metrics mean more possibility to distract the limited volume structure, especially when conflicts and reasonable drifts exist among metrics (discussed in Section 5.2).We split the input at metric level, using its own local time window instead of all metrics' windows to prompt gate structure for learning personalized mappings from time series to distribution over experts.Assuming that w k t denotes time window of the k-th metric in w t defined in Sec.2.1, the embeddings of raw input  at time  consist of all experts' outputs: where   () denotes the output of the i-th expert network given input , which is described in detail in Sec.3.3.Then the embeddings  (, ) sent into subtask  are combined according to weights given by gate   : Finally, each subtask  corresponding to the metric  calculates the prediction value ŷ ( )

Expert Network
Expert networks are the main structure for extracting ample traits in time series.Each expert consists of one convolution layer and two feed-forward layers.Previous works primarily use RNN-based units, e.g.LSTM or GRU, to capture temporal patterns [4,13,23].Whereas, its inherently sequential nature makes it impossible for parallel computation, dramatically increasing training time.By contrast, convolution network, known as a structure with superior computational parallelism, shows its powerful capability of extracting features in time series tasks [3,10,12].Especially, kernels can sharpen the change within a successive region, which suits MTS anomaly detection tasks even better.
In our implementation shown in Fig. 3, the convolution layer contains  kernels of width  (equals to window size) and height 1.As a kernel sweeps through the input w t including all  metrics, each metric's window is convolved into a single value, so the layer produces a vector whose size is  × 1 × .In this way, we obtain temporal dependency by the metric-level convolution operation, meanwhile obtaining inter-metric dependency by the shared filters.Then the vector is flattened and further fed into a two-layer fullyconnected network, forming an embedding of the original input.All embeddings constitute a candidate set.

Personalized & Shared Gate
A hybrid gate structure is employed to map the selected input to the weights of embeddings in the candidate set.If patterns of metrics differ greatly from each other, instead of sharing the same expert, the loss of backpropagation will induce them to take advantage of different experts, which is reflected by diverse weight combinations given by gates.This empowers the model to resolve conflicts and irrelevant drifts among metrics.Furthermore, we design a dual-gate mechanism.A shared gate   receives all selected windows from input, yet a personalized gate    belonging to the k-th metric only receives its own window.Since there are massive metrics, too many gates with the same weight make the backpropagation gradient more chaotic for experts.A shared gate with a greater weight can learn the robust mapping relationships between expert fusion and input from more data and is more likely to induce experts to converge on dominant characteristics.We combine shared gate and personalized gate to leverage the advantages of both, making personalized gate act as an auxiliary role to tune the subtle differences between metrics.The hybrid gate is formulated as follows: (5) where   and   ( = 1, 2, • • • , ) are trainable matrixs, and  > 0.5 is a weight coefficient of the shared gate.

Tower Network
In our framework, the purpose of the tower network is to condense the embeddings into one final predicted value.We utilize two dense layers along with the activation function  and a dropout layer to reduce the dimensions.All towers' predictions are appended to set ŷt .As the ground truth observation at time  is y t , the final training objective L aims to minimize the squared L2 norm: Here, ||.|| 2 denotes the L2 norm.L is also used in the inference phase as anomaly score, which is further compared to the threshold set in line with previous detection performance in online detection scenarios.

Loss Balancing
Balancing the weight of loss for each metric is a crucial topic as improper weights could narrow the perspective of detection and neglect some important metrics.Thanks to the homogeneity among metrics and even between inputs and outputs in MTS anomaly detection, we are able to alleviate the problem of imbalance simply by proper data preprocessing.For datasets whose metrics are not on the same order of magnitude, we normalize the original training data via a MinMax Scaler.Hence the loss of metric won't be too far from each other.More details about data preprocessing can be found in Section 4.1.

EXPERIMENTAL SETUP
In this section, we first introduce the experimental datasets and evaluation metrics that are widely used in MTS anomaly detection domain.Then we present the details of hyperparameters settings.

Datasets and Evaluation Metrics
We conduct experiments based on three real-world MTS datasets to evaluate the effectiveness of CAD: SMD [23], SWaT [18] and WADI [1].All of these datasets are public and universally employed in previous works [14,21,23,24].The summary of these datasets is listed in Table 1, including the number of entities, dataset size, the number of dimensions and anomaly ratio in the test set.As the range of each metric in SMD has been limited to 0-1, we skip the data preprocessing of this dataset.For SWaT and WADI whose readings range from 10 −2 to 10 3 , we apply MinMax Scaler to limit the value of the training set to 0-1.The maximum and the minimum value of the training set are further used as criteria to normalize the test set.Time series are then clipped to a proper scale.More details about the datasets' particulars and preprocessing can be found in our repository.
In a real-time system, anomalies generated by system or external factors tend to persist for some time (e.g.bugs in programs cause sustained high CPU usage), forming a contiguous anomaly segment.Human operators rarely care about point-wise metrics in applications.Thus we apply the point-adjustment approach introduced by [27], based on the assumption that it is reasonable to consider the anomaly segment has been detected if at least one moment within the segment triggers the alert.If any point within the contiguous anomaly segment from ground truth is marked as an anomaly, the whole segment is considered to be detected correctly.Additionally, in practice, an alert after a long delay is futile since the sooner the anomaly is identified, the less damage it causes to the system or the service.According to this presumption, we also adopt the k-th PA approach proposed by [19], which assumes that the anomaly segment is recognized correctly only if the delay of the detected point is less than k from the start point of this segment.
We employ Precision (P), Recall (R) and F1-score to evaluate the performance of our method and baselines based on the above two approaches respectively.These metrics are calculated as follows: where TP denotes True Positives, FP denotes False Positives and FN denotes False Negatives.For each entity, we enumerate all possible anomaly thresholds to find the optimal one for getting the highest anomaly scores when calculating F1-score [2,13,14].The result is denoted as  1  .We discard threshold selection methods like POT [23] because they introduce parameters that need to be tuned, which is unfair for those methods that do not provide these parameters.The purpose of using  1  is not meant to find a threshold for performance evaluation, but to directly measure a model's optimal performance without introducing any hyperparameter, which ensures fair evaluations for baselines.Several datasets contain more than one MTS entity.For instance, SMD embodies time series from 28 machines distributed in three clusters.In this case, the F1-score represents the average of all machines' F1-best.In [23], the authors use the average precision (denoted P ) and average recall (denoted R) to get the F1-score.This measure is denoted  1 * score in our experiments.As P and R neutralize some severe deviations in original precisions and recalls, in the case of uneven data distribution among metrics,  1 * score usually exceeds  1  .

Hyperparameters Settings
We select different hyperparameter combinations empirically on different datasets for CAD and its variant.Due to the differences in characteristics among the datasets, the window of time series is set to 16 for SMD, 32 for SWaT and WADI.The number of experts is 5 for SMD, 7 for WADI and 9 for SWaT.As a prediction-based method, we use horizon as one hyperparameter, which means the distance between the predicted timestamp and the last timestamp in the window.We set the horizon to 3 for SMD which is more stable, and 1 for SWaT and WADI.
For all datasets, the number of kernels in one expert network is set to 16 and the weight coefficient of the shared gate  is set to 0.7 empirically.We apply the Adam optimizer and CosineLR scheduler with an initial learning rate of 0.001 to optimize models.Batch sizes are set to 128 in the training process.The early stopping strategy is adopted while the maximum epoch for training is set to 10.
All variants in the ablation study use the same hyperparameters as CAD.The models of baselines are trained using the parameters provided in their article.Specifically, we train and test Anomaly Transformer on each entity in SMD instead of on concatenated one shown in their source code for a fair comparison.

EVALUATION
We conduct quantitative and qualitative experiments to evaluate the performance of our model.Four research questions need to be answered urgently during the experiments: RQ1: How does our model perform on public datasets compared to other state-of-the-art approaches?RQ2: How much does each constituent in our design contribute to overall performance?RQ3: Is CAD robust enough under various parameter settings?RQ4: Can different experts learn the representations of data from different perspectives?

Baseline Approaches
To demonstrate the merit of our proposed algorithm, we select nine recent state-of-the-art unsupervised methods for multivariate time series anomaly detection for comparison with CAD, including both prediction-based and reconstruction-based approaches.All methods' source codes are available in Github1 .We use full-size datasets for all methods.
• LSTM-NDT [11] uses the LSTM-RNN model to attain high prediction accuracy and provides an unsupervised threshold selection method to dynamically evaluate residuals.
• DAGMM [29] learns a low-dimensional embedding of the original time series via a deep autoencoder (AE), then feeds the embedding and reconstruction error of AE into Gaussian Mixture Model to predict their likelihood.• MAD-GAN [13] employs LSTM-RNN as GAN's base model.
Contrasting with conventional GAN, MAD-GAN takes the whole variable set into account concurrently in order to draw the inter-metric dependency between metrics.• OmniAnomaly [23] and Interfusion [14] are methods based on variational auto-encoder to denoise the anomalies and capture dependencies via hierarchical stochastic latent variables.
• USAD [2] and TranAD [24] apply the idea of adversarial learning and design a two-stage training framework, combining the advantages of autoencoders/self-attention encoders and adversarial training.• DVGCRN [4] adopts an adaptive variational graph convolutional recurrent network unit to capture spatial and temporal fine-grained correlations, which are further extended into a deep variational network.• Anomaly Transformer [28] utilizes an attention mechanism to compute the association discrepancy and further amplifies it via a minimax strategy.

RQ1. Evaluation Results and Analysis
Anomaly detection performance of CAD as well as all baselines is listed in Table 2 under the point-adjustment (PA) approach and Table 3 under the k-th point-adjustment (k-th PA) approach.CAD As shown in the table, most of the models work better on SMD, which is collected from a scenario involving real-time server clusters.Anomalies are relatively arresting on several machines (simultaneous large spikes on several metrics), enhancing the overall performance, especially  1 * mentioned above.For example, almost all methods are able to obtain F1-scores up to 0.95 when evaluating machine-1-1.Nonetheless, time series in certain machines are quite misleading, specifically, data fluctuation in temporal dimension and intricate relationships in inter-metric dimension, requiring the model to obtain a higher capability of discerning subtle differences between normal patterns and anomalies.Performance of models on these machines varies greatly, contributing to the major discrepancy in final results.On machine-1-8, CAD earns a score of 0.9782, while other methods' scores range from 0.5961 to 0.9780.In terms of the whole SMD dataset, CAD outperforms baselines by 1.9% (DVGCRN) to 26.5% (MAD-GAN), exhibiting a powerful capacity to deal with a variety of situations.In addition, most existing models obtain rather poor grades on high-noise datasets like SWaT and WADI.Comparatively, our methodology performs better in terms of F1 with a slight compromise in precision, as a priority should be given to higher recall to some extent in anomaly detection tasks [9].As most of the methods perform well on SMD, we further adopt the k-th point-adjustment evaluation which is more discriminating and more practical to real-world detection.As shown in Table 3, we assign the value of 10, 20 and 30 to  respectively.Despite the fact that performance drops dramatically for some methods, CAD achieves the highest score in each case while maintaining a score of 0.7894 when  is 10 and 0.8822 when  is 30, exhibiting the effectiveness of our method in practice.
Further analysis is conducted based on baseline scores.LSTM-NDT and DAGMM are two unsupervised methods employing observation at a specific moment in time.In comparison with methods taking a sequence of observations as input, both cases are weak at exploiting temporal relationships during highly correlated periods of time [2].Moreover, LSTM-NDT predicts values for each metric separately, further leading to the loss of information embedded in inter-metric dependency [14].As a typical generative model, OmniAnomaly adopts bits of stochastic variables to model data distribution in a sequence of observations, yet a limited number of stochastic variables cannot extract complicated characteristics of time series sufficiently.Interfusion partially solves this problem by two-view stochastic variables, introducing an extra dimension to expand the representation space.However, these variables, along with MCMC imputation in the inference phase, bring an increase in model instability.Several runs yield results with some deviations.What's more, methods employing LSTM or GRU structures tend to consume extremely long training time, which is inapplicable to real-world situations.
Compared with baselines, CAD maximizes the effectiveness of temporal and inter-metric interconnections with the assistance of its well-designed structures.Such phenomena are observed that considerable drifts exist in some metrics over a prolonged period of time, which are labeled as normal patterns in both training sets and test sets.Due to the fact that certain metrics' trends are unstable inherently, even reasonable fluctuations in them will have a substantial impact on other metrics when using existing models.This results in the model misinterpreting them as an inter-metric anomaly.Expert selection mechanism in CAD can effectively shield this irrelevant influence and smooth anomaly score on these time slices.As a result, CAD is more sensitive to genuine anomalies.Meanwhile, this mechanism resolves the issue of the unpredictability of some metrics to some extent that has been criticized in traditional prediction-based methods [16].
In detail, we conduct several case studies to illustrate the efficacy mentioned above.We visualize a few metrics in machine-1-8 included in SMD, on which models' performance varies significantly, from 0.5761 to 0.9782 (Fig. 4a).We also present anomaly scores achieved from models except for MAD-GAN which even does not converge on this dataset (Fig. 4b).Despite the fact that both CAD and LSTM-NDT are forecasting-based methods, LSTM-NDT is too sensitive to fluctuation to distinguish anomalies from trivial noise.Anomaly scores of DAGMM and TranAD drift obviously on account of time series circled in green.These time series don't violate data distribution but just get a higher baseline within reasonable limits as a matter of, e.g., load balancing strategy.Even though Interfusion handles the previous situation appropriately, it fails to earn proper scores circled in blue, which are even more "normal" than the previous one.We also plot the score of a variant of CAD, CAD-single, whose metrics are trained and tested separately.Due to the absence of inter-metric dependency, this variant works well on the segment mentioned above, yet performs poorly on other data.Instead, CAD works in both cases since it learns great lessons jointly from intra-metric and inter-metric dependency to find out the intrinsic normal pattern of each metric.The drifts are almost removed by the techniques in our framework according to the anomaly score of CAD.Even on datasets that all methods achieve high grades, as shown in Fig. 5, owing to its superior anti-interference capabilities, CAD exhibits more evident spikes when encountering anomalies.

RQ2. Ablation Study
We omit every relevant component of the framework to observe the extent to which it impacts the performance of the model with reference to the F1-score on various datasets.By virtue of leveraging inter-metric dependency to jointly detect anomalies in time series with all metrics, it is indispensable to concentrate on two issues in terms of model framework design.First, we are interested in finding out if different experts can pick up specific traits of temporal patterns in distinct perspectives as expected, and if  gates can automatically learn metric-specific combinations of representations generated by experts.Second, we want to know if dependency between metrics helps improve the model's capability to catch anomalies, that is to say, if training with metrics of time series jointly outperforms the corresponding single-metric model.In the aspect of components in the model, are convolution units, which are designed for the time series task, able to effectively extract representations embedded in raw data?In addition, what is the relationship between the number of experts and the model's performance?Effectiveness of MoE framework.We pick out a commonly used model in the field of multi-task learning, which is known as Shared-Bottom structure [15], as our baseline.Compared with CAD which has a group of bottom networks, it sends the shared representations to task-specific towers directly without going through gate networks, thus we refer to this structure as w/o gate.To exclude effects induced by model complexity, for different numbers of experts, we modify the scale of bottom layers in the w/o gate structure such that the number of convolution kernels in the two

RQ3. Hyperparameter Sensitivity
As an application-oriented approach, the feasibility of a model deserves adequate consideration.We conduct contrast experiments in terms of key hyperparameters to test whether the model is parameter-sensitive (Fig. 6).One of these parameters is the window size.A larger window size means more long-term dependency within metrics, whereas introducing more remote time points that may dilute the significance of neighboring observations.In our experiments, the score of CAD has a slight downward trend as the window size exceeds 40.In general, however, CAD gets relatively smooth scores under a variety of settings, showing its robustness to window size.Another key parameter is the number of experts.

RQ4. Feature Extraction Capability of Experts
A key prerequisite for modeling inter-metric promotion as well as conflict is that experts are able to learn the traits of time series from diverse perspectives.Given a time window as input, the expert network generates  ( ) (w t ) whose shape is  × where  denotes the number of experts and  denotes the dimension of the hidden vector.To analyze the validity of the expert network, we visualize the distribution of embeddings sampled from  ( ) (w t ) during the test phase on machine-1-8, as the value of  is 5 and  is 128.Due to the lack of a straightforward way of viewing 128-dimensional space, we compress it to 2-dimensional space through t-distributed Stochastic Neighbor Embedding (t-SNE) [25].Each point in the scatter plot corresponds to a low-dimensional representation of the primitive hidden vector, meanwhile colored in line with the expert it comes from (Fig. 7).In the t-SNE space, there are certain similarities among embeddings from the same expert since they aggregate together.On a global scale, the well-separated space further proves experts' ability to discern abundant yet distinct features.

RELATED WORK
Deep neural networks have been proven to be highly effective at modeling intricate dependencies in time series over the past few years.As a result, they have become the method of choice in this field.Moreover, most proposed models are based on unsupervised learning due to the paucity of anomaly labels.Since anomalies are only small probability events, this approach works quite well.
Taxonomically, these unsupervised solutions can be identified as reconstruction-based, forecasting-based and hybrid models [6].
Reconstruction-based models.These models encode subsequences of training data in latent space to filter out rare outlier points.Usually, a continuous sliding window is sent into the network and then mapped onto a low-dimension space.Soon a network expands the dimension of these data to reconstruct the input.In this process, an anomaly is more unlikely to be recovered as the model learns few abnormal patterns, resulting in a large gap between outputs and abnormal inputs, called reconstruction error, according to which the anomaly score is calculated.As an early attempt, EncDec-AD [16] embeds encoder-decoder structure into the LSTM network to learn condensed temporal patterns.OmniAnomaly [23] and Interfusion [14] further employ stochastic variables to improve the robustness of the model.However, these methods are prone to error accumulation in long sequences [21].
Forecasting-based models.The prerequisite of unsupervised forecasting-based models is that normal time series follows some rules, and anomalies are those who violate the inherent patterns.These models predict forthcoming points based on historical observations, then estimate if an anomaly occurs or not according to the point-wise difference between predicted values and ground truth values.Compared with traditional long sequence time-series forecasting tasks, anomaly detection asks for more precise prediction in a closer horizon and extended ability to cover diverse metrics concurrently.LSTM-NDT [11] is a well-known effort to handle this problem in a forecasting-based manner by using a non-parametric dynamic error thresholding strategy.THOC [21] uses a dilated skip-RNN structure to capture temporal dynamics and a hierarchical clustering process to fuse the multi-scale features.Hybrid models.These methods leverage composite errors, e.g., forecasting error or reconstruction error, to obtain the final anomaly score.USAD [2], as well as TranAD [24], adopts two-phase training to amplify reconstruction errors.Generative Adversarial Network is another framework applied to reconstruct inputs.MAD-GAN [13] uses an RNN-based discriminator and generator to detect anomalies based on both reconstruction and discrimination losses.As a GNNbased method, FuSAGNet [9] jointly optimizes reconstruction and forecasting errors.DVGCRN [4] also computes both reconstruction and prediction scores via a graph convolutional recurrent network.

CONCLUSION
Today's software systems demand rapid responses to anomalies.Adequate analysis of MTS indicates that inappropriate use of intermetric dependencies brings negative effects in some cases.In this paper, we propose Conflict-aware multivariate KPI Anomaly Detection (CAD), a novel unsupervised framework that effectively weeds out harmful conflicts which may confuse the detection for certain metrics.Based on the understanding of real-world data, we offer an exclusive structure for each metric to isolate the possible conflicts to a certain degree.What's more, the task-oriented feature-selection mechanism and a hybrid gate structure are elaborately designed to deal with the conflicts among metrics, greatly enhancing the effectiveness of the model.Time series-oriented experts learn rich characteristics from diverse perspectives under the combined effect of the above techniques.We adopt a series of metrics for comprehensive evaluation.CAD outperforms all state-of-the-art baselines on three widely used datasets.Several experiments have demonstrated that CAD is adept in modeling intrinsic normal patterns while immune from irrelevant interference which results in false alarms or omissions in baseline methods.The concise structure and high computing efficiency allow it to be widely deployed in various scenarios and enable real-time detection.Moreover, the hyperparameter sensitivity study confirms its strong feasibility for a variety of detection tasks.

Figure 1 :
Figure 1: Illustration of one possible conflict among metrics.Anomalous segments at the metric level are highlighted in red, and the segment corresponding to reasonable drift in metric 3 is highlighted in blue.The regions enclosed by the yellow boxes share similar distribution.A previous method's detection result as well as ours is zoomed in and listed below.Red backgrounds in the results are the union set of all metrics' anomalies (including the three metrics plotted).
(a) Metrics Visualization.Representative metrics are plotted with their raw value.Anomalous segments at the metric level are highlighted in red.(b) Anomaly scores of baselines.The best score threshold is represented by the red dotted line.Anomalous segments of all metrics are highlighted in red.

Figure 5 :
Figure 5: Anomaly scores of baselines in machine-1-1 without first 10000 trivial (stable and normal) points.In all baselines, the best-F1 is greater than 0.95.
When it is set to 1, CAD degrades to a shared-bottom model.With the increase in the number of experts, the model has a greater capacity to capture temporal and inter-metric dependency hidden in time series, while more parameters sacrifice the simplicity of the model, spending extra time training and testing data which is unacceptable for real-time anomaly detection.A value between 5 and 11 provides a reasonable trade-off between representation capacity and time efficiency.

Figure 7 :
Figure 7: Visualization of embedding distributions of different experts.The high-dimensional vectors are mapped to a 2-dimensional space through t-SNE.

Table 2 :
Performance comparison under the Point-Adjustment (PA) approach.Best scores are highlighted in bold, and second best scores are highlighted in bold and underlined.

Table 3 :
Performance comparison under the k-th PA approach on SMD.Delay denotes the detection deadline in each anomaly segment.Only when outliers are detected within delay can the detection be recognized as a valid one.

Table 4 :
Performance of CAD and its variants in terms of best-F1 under point-adjustment.