Pre-trained KPI Anomaly Detection Model Through Disentangled Transformer

In large-scale online service systems, numerous Key Performance Indicators (KPIs), such as service response time and error rate, are gathered in a time-series format. KPI Anomaly Detection (KAD) is a critical data mining problem due to its widespread applications in real-world scenarios. However, KAD faces the challenges of dealing with KPI heterogeneity and noisy data. We propose KAD-Disformer , a K PI A nomaly D etection approach through Dis entangled Trans former . KAD-Disformer pre-trains a model on existing accessible KPIs, and the pre-trained model can be effectively “fine-tuned” to unseen KPI using only a handful of samples from the unseen KPI. We propose a series of innovative designs, including disentangled projection for transformer, unsupervised few-shot fine-tuning ( uTune ), and denoising modules, each of which significantly contributes to the overall performance. Our extensive experiments demonstrate that KAD-Disformer surpasses the state-of-the-art universal anomaly detection model by 13% in F1-score and achieves comparable performance using only 1/8 of the fine-tuning samples saving about 25 hours. KAD-Disformer has been


Introduction
Online service systems are playing essential and growing roles in our daily life.Some example systems are social networks, online shopping, mobile payment, and search engine.To guarantee the high quality and non-interrupted services, businesses are gradually relying on Key Performance Indicator (KPI) of time series data to pinpoint and tackle anomalies and other concerns [5,36,42].Anomaly detection (AD) based on time series focuses on rapidly identifying and addressing irregularities, making it a hot topic within data mining communities [18,28,[33][34][35]43].Traditional KPI anomaly detection methods are mostly rule-based and consist of massive handicraft thresholds.Although this type of anomaly detection is computationally efficient, the detection performance is far from satisfactory [19].Besides, huge human efforts for deciding the considerable parameters of the rules make these methods unfeasible for large-scale online systems.
However, real-world time series data are complex and exhibit significant variations across different domains.Consequently, existing anomaly detection algorithms employ specialized models tailored for each individual KPI.However, for large online systems with hundreds of thousands of KPIs, training individual models leads to prohibitive overhead.Furthermore, the long initialization time [20] 1 of existing algorithms, determined chiefly by the amount of data used for fine-tuning the model to achieve satisfactory accuracy, render them impractical for rapidly changing services.We identify two criteria for practical anomaly detection algorithms to monitor large-scale online systems: • Generalizable Pre-Training: The ability to detect anomalies across diverse KPI datasets using a universal model, which is pretrained on the accessible datasets and can adapt to previously unseen datasets.• Unsupervised Few-Shot Fine-Tuning: The ability to achieve a strong performance even with limited data after fine-tuning, minimizing the initialization time.This allows models to rapidly adapt to unseen KPIs or services.
However a naive pre-trained KPI anomaly detection model may suffer from obvious performance degradation with the large drift between the "KPIs for pre-training" and the "KPIs for fine-tuning".The KPIs for pre-training represent the accessible KPIs utilized to pre-train the model.The KPIs for fine-tuning refer to the KPIs that we want the model to detect anomalies for, but they are not known or accessed during the pre-training process.To solve the performance degradation problem, some existing anomaly detection models [21,40,41] adopt a two-stage approach: classifying KPIs into different groups, then fine-tuning models for each group.However, the grouping stage introduces computation overhead and the performance can be compromised if a KPI is inappropriately clustered.Here we need a universal pre-trained model with the power to effectively and quickly adapt to incoming KPIs without explicitly clustering.Yet facing real-world complex KPI data, we encounter three challenges: • High KPI diversity: KPIs from thousands of applications and systems have diverse, non-stationary patterns.Pre-training a single model for such heterogeneous data is pretty challenging.
• Tailored model adaptation: There is a challenge to ensure the model to quickly adapt to incoming KPIs while maintaining the good performance for historical KPIs.A general model with limited capacity may be able to quickly adapt to incoming KPIs but at the cost of performance degradation on historical KPIs.On the other hand, a complex model may be more capable on historical KPIs but is not flexible to incoming KPIs.• Robustness to noisy data: Rapidly adapting to incoming KPIs during fine-tuning demands high-quality data, yet KPI time series often contain noises.Efficiently achieving satisfactory performance on limited and noisy data of the KPI for fine-tuning poses a considerable challenge.
In this paper, we propose KAD-Disformer, a KPI Anomaly Detection approach through Disentangled Transformer.Different from the previous transformer-based KAD models [29,34], we disentangle the projection matrices ( ) of query, key, value in Transformer into   and   .  focuses on the common projection patterns across different types of KPIs.  tries to learn the personalized projection patterns for individual KPIs through a very limited number of samples.To achieve this, we design the uTune mechanism for unsupervised few-shot fine-tuning.Assisted by the tailored two-stage gradient update mechanism of uTune, the personalized projection matrices are capable of rapidly adapting to the KPIs, while effectively preventing over-fitting during the fine-tuning process.Our model improves the F1-score of anomaly detection by considerable margins.The detailed evaluation also confirms that our model can quickly achieve a high F1-score with only 1/8 fine-tuning samples compared with the other existing methods.
Our contributions are summarized as follows: • To the best of our knowledge, KAD-Disformer we proposed is the first pre-trained time series-based KPI anomaly detection model.Through careful selection of the model structure and optimization techniques, we significantly enhance the effectiveness and efficiency (i.e., initialization time) of the anomaly detection algorithm.• In KAD-Disformer, we disentangle the projection matrices in Transformer into common projection and personalized projection, respectively, to effectively trade-off between maintaining the model capacity and quick adaption to an incoming KPI simultaneously.The personalized projection is updated in our uTune mechanism to quickly fit to fine-tuning samples with little risk of over-fitting.• In KAD-Disformer, we design the adapter layers and denoising reconstruction mechanism to improve the detection accuracy.• We conduct comprehensive evaluation not only to show the overall performance but also to measure the contributions of each part of KAD-Disformer for overall performance (Section 5).• We have deployed KAD-Disformer to a large-scale real-world online service system, helping them maintain high-quality service for months.The codes of this paper is released at https: //github.com/NetManAIOps/KAD-Disformer.
Though deep-learning-based unsupervised methods' performance is superior to the traditional methods, the overhead brought by retraining on large-scale different KPIs makes them infeasible in large-scale online service systems [27].Therefore, several transferable methods have been proposed these days.These methods can learn from existing large-scale KPI data and efficiently transfer the model to fit an incoming KPI [40,41].For example, ATAD [41], a cluster-based semi-supervised method, extracts the features from KPIs with the hand-crafted rules and groups the historical KPIs.When a new KPI comes, ATAD will assign this KPI to a group and then ask for partial labels to fine-tune the classifier.However, the performance of ATAD is unstable and highly related to the clustering algorithm and the quality of labels.All these transferable methods rely on extra information such as clustering to guarantee the detection performance.

Preliminaries 3.1 KPI Anomaly Detection
Since time series data in online service systems are largely affected by service schedules or user behaviors, most of them have shown the property of seasonality [15].Time series collected from realworld production environments inevitably have noises.Therefore, the normal patterns of seasonal time series have two parts: 1) normal seasonal patterns with local variations, and 2) noises with some kind of distribution [40].For univariate time series, people usually regard the data points that do not follow the normal patterns as anomaly points [40] (e.g., spikes or dips).For the sake of brevity, we use the notation similar to [14].Given a time series data  = [ 0 ,  1 ,   With the above notations, we define the time series anomaly detection as follows: Given a time series  = [ 0 ,  1 ,  2 , • • • ,   ], the goal of the KPI anomaly detection is to predict the corresponding label series  = can be used, since the sub-sequence after time  is unknown at time .

Few-Shot in KAD
In the KAD domain, there is currently no clear definition of "fewshot".In real-world environments, KPI data is continuously generated in chronological order.Therefore, when integrating new KPIs and deploying KAD models, there is an initialization time.The time waiting for new KPI data to be generated (i.e., collecting data for fine-tuning) occupies the majority of this initialization time.This period usually spans several dozen or even hundreds of hours, because this time is directly aligned with real-world time, and the newly integrated KPIs do not have historical data [37].
Based on the above observation, we propose that the few-shot capability of KAD can be measured from two perspectives.First, whether KAD can achieve better anomaly detection performance with an equal amount of fine-tuning data.Second, whether less finetuning data is needed to achieve competitive anomaly detection performance.Both perspectives imply that less fine-tuning data can be collected to achieve satisfactory KAD performance, allowing for faster model deployment and improved efficiency.

Methodology
In this section, we first give a brief overview of KAD-Disformer from the aspects of architecture and workflow.Then we give detailed introductions to the modules of Disentangled Projection Matrices (denoted as DPM), uTune, and Denoising Reconstruction respectively.The design of Transformer is introduced together with DPM and the design of various adapters such as Series Adapter and Encoder Adapter is introduced together with uTune.

KAD-Disformer Model Overview
KAD-Disformer follows an encoder-decoder architecture and the overall architecture is shown in Figure 2. Transformer [31] can effectively capture the time dependency like RNN-based models but is more parallelizable.Transformer-based models have excellent generalizability, and a well-trained model can be effectively transferred to many other datasets and tasks through fine-tuning [24].Inspired by these advantages, we exploit the Transformer framework as a base to tackle universal anomaly detection tasks.
The encoder-decoder architecture is widely used in KPI anomaly detection area [2,3,18,28,33,34,40].A popular hypothesis of unsupervised encoder-decoder models for anomaly detection was proposed in [33], i.e., given training data, the model can learn normal patterns through dimensionality reduction.No label is needed in this process, and all the knowledge is learned from the raw data automatically.

Workflow of KAD-Disformer
Initially, we need to perform a data preprocessing on the raw KPI data for our model, KAD-Disformer.In contrast to existing methods [3,11,18,33], our approach involves two sliding windows for the raw KPI data embedding.For each KPI input vector we apply two sliding windows to convert the input vector to a matrix.  acts as a context window with a stride of 1, with its purpose being the aggregation of local data information.On the other hand,  ℎ serves as a historical window, and its stride often bears some form of intrinsic significance, such as the KPI's period (denoted as ).The stride's value for this historical window can be decided by the users, based on their domain-specific knowledge, with the default stride being the KPI period as calculated by the Fast Fourier Transform (FFT) [23].The primary goal of this historical window is to effectively recognize and retain the long-term dependencies inherent in the time series data.
The following workflow of KAD-Disformer consists of pre-training, fine-tuning, and inference.During the pre-training phase, we randomly initialize the whole parameters of KAD-Disformer, and each iteration updates all the parameters in the model.The detailed procedure is indicated in Appendix B.
As for fine-tuning, we design a uTune mechanism.The detail of uTune is in Section 4.4.uTune mechanism follows the adapter-based fine-tuning paradigm [10] and only updates the partial parameters of KAD-Disformer.

Disentangled Projection Matrices
The core component of the Transformer architecture is the Attention mechanism.Specifically, the original multi-head attention approach, as described in [31], employs three learnable linear projections   ,  ,  to map the raw data  to  (query),  (key), and  (value) matrices into distinct higher-dimensional spaces.This transformation is outlined in Equation (1).
In the context of KAD-Disformer, we propose a novel approach aimed at decoupling the training of the base model from that of the task-specific model.To achieve this, we disentangle the linear projection  into two separate components: a generalization projection, denoted  , and a personalized projection,  .
is only updated in the pre-training phase, delivering the acquired knowledge from the pre-training stage to the fine-tuning stage.Conversely,   is iteratively updated during both the pre-training and fine-tuning phases.The primary function of   is to assimilate knowledge specific to a particular KPI, thereby enhancing the "learning to learn" performance on the finetuning data.We delve into further detail regarding the updating of   in Section 5.5.Our method, referred to as disentangled dot-product attention, is formalized in Equation (2).
Where  =  ( where  is the size of our sliding window.
Why disentangle?The main advantage of disentangling projection matrices in universal AD is alleviating the overfitting problem in the fine-tuning procedure and keeping the knowledge learned from the pre-trained data for fine-tuning.If all the projection matrices are updated with fine-tuning data without disentangling, the projection matrices are easily overfitted.However, the small-scale data usually can not reflect all the patterns of the incoming KPIs.In disentangled projection matrices,   still stores the knowledge from the pre-training data after fine-tuning, so the attention is not easily overturned from the original projection matrices.Besides, our uTune can also alleviate the overfitting problem, which is discussed in Section 4. 4.
Why projection matrices?The primary rationale behind our design choice to disentangle projection matrices lies in the fact that projection matrices constitute the core structure within a Transformer.Hence, these matrices are highly information-dense and important within a Transformer.  primarily takes responsibility for employing the knowledge obtained from pre-training to compute attention, whereas   's role is to make fine adjustments based on the characteristics of the current KPI.This approach thus achieves a balance between retaining pre-trained knowledge and utilizing the intrinsic feature of the KPI.

uTune
The model adaptation for unseen incoming KPI data is a great challenge in the universal AD area.To solve this problem, we propose a two-stage adapter-based fine-tuning mechanism called uTune.To better illustrate the functioning of uTune, we center our discussion around two key questions: "What to update?" and "How to update?".4.4.1 What to update?  and adapters are the components updated during the uTune.The function of   has been thoroughly elucidated in Section 4.3.Our focus now shifts to the adapters within KAD-Disformer.The main purpose of the Adapter module is to maximize the utilization of the model's parameters from pre-training phase when handling unseen KPI data.As illustrated in Figure 2, there are two adapters: Series Adapter and Encoder Adapter.
Series Adapter layers take the raw fine-tuning data as input to learn a time series data adapter, making the model fit the incoming new time series data.To better adapt to the data and improve the anomaly detection performance, we incorporate a time series decomposition module [32] into the Series Adapter to eliminate the noises and produce more clean time series data.We decompose the time series into two parts: the seasonal part and the trend part.We apply the average polling window to compute the seasonal part of the time series and keep the residual as the trend part shown in Equation (3).
Average pooling can efficiently be executed in the neural network with little overhead.And then, the downstream modules of the time series decomposition are two feed-forward networks fed by the seasonal part and trend part, respectively.The feed-forward networks provide the learnable parameters for fine-tuning the time series.After being transferred by the feed-forward networks, we add the seasonal and trend part together and send the output to downstream.Encoder Adapters are located in the encoder module (Figure 2).These layers are simple and efficient fully connected layers designed to better perform deep-level adaptation when the Encoder has multiple layers.In practice, the Encoder is often a stack of multiple layers, and the Series Adapter only acts on the input stage.As the number of Encoder layers increases, the effectiveness of the Series Adapter diminishes, so we introduce the Encoder Adapter to facilitate the effective utilization of the common knowledge in the deeper layers.

How to update?
Our design is inspired by the concept of First-Order Model-Agnostic Meta-Learning (FOMAML) [8], we do not directly apply it but rather optimized it according to the characteristics of our task.In each iteration, we sample two mini-batches of data of the same size.Batch  1 is from the unseen KPI  prepared for fine-tuning, and batch  2 is from the existing pre-training datasets   .Given a loss function , KAD-Disformer makes a feedforward with  1 and evaluates the loss  1 in the first stage.And then KAD-Disformer makes the backpropagation to update the parameters of   and the aforementioned adapter layers (Line 5 to Line 6 in Algorithm 2).The first stage focuses on the unseen KPI and pushes the pre-trained base model to learn the personal information from the unseen KPI.We call the first stage the personalization stage.
In the second stage, we use  2 to make feed-forward and evaluate the loss  2 .At this time, we don't use  2 to make the backpropagation but compute a final loss as Equation (4).
where  ∈ [0, 1] is a hyper-parameter balancing the weights of performance on new data ( 1 ) and existing data ( 2 ).According to our experience, we set  to 0.5, and the sensitivity analysis of  is in Appendix C.Then, we use  to make backpropagation.The second stage focuses on the overall data's performance and prevents   and adapter layers from overfitting the unseen KPI, making the model learn from new and existing data simultaneously.
We call the second stage generalization stage.The pseudo code of the uTune can be found in Appendix B. The two-stage update can effectively reduce the converging time and simultaneously achieve high performance.
The personalization stage makes KAD-Disformer fit the unseen KPI data fastly while guaranteeing performance.Recall that the fine-tuning procedure is easily overfitted due to the small scale of fine-tuning data, which can lead to performance degradation after deployment.The generalization stage utilizes the knowledge from both new data and existing data to make the backpropagation in order to alleviate the overfitting problem.Moreover, different KPI data may share some common knowledge, i.e.,   .Thus, even if the small scale of unseen KPI data can not reflect the full features, this shared common knowledge can help the model better understand the features of unseen KPIs.
The main difference between our method and FOMAML is that we do not update the initial parameters  uniformly after calculating the gradients of all tasks.Instead, we directly update the parameters to obtain  * after computing  1 , and then update again based on  * .Our design aims to better and faster adapt to unseen KPIs.The original FOMAML typically deals with multiple tasks, and its concept is to obtain more common knowledge from multiple few-shot tasks to achieve few-shot learning.In our scenario, each fine-tuning is actually for a specific unseen KPI, i.e., the number of tasks is 1.Therefore, our focus is on quickly adapting to the current KPI.We directly update  using the gradient of  1 to obtain  * , and then continue updating based on  * at the end to prevent overfitting to the small amount of unseen KPI data and incorporate historical data.

Denoising Reconstruction
The core idea of denoising reconstruction is to reconstruct a denoised time series rather than the original one.In the KAD task, a denoised reconstruction is more effective in detecting anomalies.The reason is that unsupervised methods use the error between the original data and the model's output to detect the anomaly [18,28,33,34].
The main goal of the model is to learn the normal time series patterns from the original data.However, large-scale training data inevitably have some anomalies and noises in the real-world environment.Therefore, it is impossible for the model to learn from the absolute normal data.To tackle this problem, we design the denoising reconstruction mechanism to reconstruct a denoised time series to better distinguish the normal data and anomalies.
In the denoising reconstruction mechanism, there are two data flows.The first one is the context data flow, which captures the context information of the data point.The stride of the context sliding window is 1.The second one is the history data flow.The history data flow can capture the long-term dependency of the time series.The stride of the historical window can be the period of the time series or a value with physical meaning determined by the user.The goal of the historical window is to provide the denoising decoder with the historical information of the time series.
The input of the context decoder comes totally from the context encoder without historical information.We hope the output of the context decoder is as close as possible to the original time series.The input of the denoising decoder consists of two parts: one part is from the history encoder, and the other part is from the context encoder.The matrices  and  come from the history encoder, and the matrix  comes from the context encoder.The goal is to use the same context query matrix  to query the history information and make history knowledge help reconstruct a denoised KPI.
The loss of denoising reconstruction has two parts from the denoising decoder and context decoder, respectively.X1 denotes the output of the context decoder, and X2 denotes the output of the denoising decoder.The metric used to evaluate the reconstruction performance is Mean Squared Error (MSE), which is widely used in time series anomaly detection [11,29,34].
The final output for calculating anomaly scores is the average of X1 and X2 .

Experiments
In this section, we evaluate KAD-Disformer using various KPI anomaly detection datasets collected from the real-world environment to answer the following questions.
• RQ2: Can KAD-Disformer quickly achieve a desirable performance after being tuned with a small number of samples?• RQ3: Is the design of disentangled projection helpful to improve the effectiveness of KAD-Disformer? • RQ4: Is the uTune mechanism helpful to improve the effectiveness of KAD-Disformer?As for the baseline models, we select one classic statistic method and five deep learning-based models, two of which employ their own tailored fine-tuning techniques.Firstly, the ARIMA model [22] is a conventional statistical anomaly detection approach that enjoys widespread industrial use.Within the deep learning spectrum, we choose LSTM-NDT (RNN-based) [11] and Donut (VAEbased) [33], both of which have achieved significant popularity and extensive application in various industrial scenarios.AnomalyTrans (Transformer-based) [34] represents the state-of-the-art unsupervised methods in the KAD field.Lastly, we select ATAD [41] and AnoTransfer [40], two exemplary models of transfer learning-based anomaly detection, which exhibit capability in addressing universal AD challenges.AnoTransfer is also the state-of-the-art universal AD model.

Dataset and Evaluation Metric
We conduct experiments on four datasets collected from different real-world online service systems.The overall dataset statistics can be found in Appendix D. Dataset A is a public dataset from the 2018 International Artificial Intelligence for IT Operations (AIOps) algorithm competition [16].Dataset B is Yahoo Webscope collected from Yahoo online service systems [12].Dataset C is NAB, a benchmark for evaluating time-series anomaly detection algorithms in real-time applications [13].Dataset D is collected from a real-world cloud service systems serving millions of users.Due to the space limitation, more information about the datasets can be found in Appendix D.
Precision, recall, and F1-score (denoted as , , and  1) are common KPI anomaly detection metrics, but their traditional forms are not ideal for interval anomalies in KAD.Improved metrics have emerged to address this gap and been widely used in KAD area [14,18,28,33,40] (notated as  * ,  * and  1 * ).In this evaluation approach, a labeled anomalous segment is deemed correctly detected if any part is identified, marking it as a true positive or, if overlooked, a false negative.To assess methods holistically, we use both traditional and enhanced metrics, alongside the Area Under the Curve (AUC).Efficiency is gauged by measuring each model's total processing time, from pre-training to inference.

Overall Performance (RQ1)
To evaluate the overall performance of KAD-Disformer, we consecutively choose one out of the four datasets as the target dataset for fine-tuning and testing and the rest three datasets as the source dataset for pre-training.For the selected target dataset, we split it into two parts: the training part (50%) and the test part (50%).As the sufficient data, shown in Table 2, 50% is enough for convergence of each model.For the transferable methods (ATAD, AnoTransfer, and KAD-Disformer), we train the model from the source datasets and tune the model on the training part of target dataset and evaluate the model on the test part of target dataset.To test whether KAD-Disformer can keep high performance with fewer fine-tuning samples, we use the first 10%, 50% of training part of target dataset to tune KAD-Disformer and test KAD-Disformer on the test part of target dataset.For the non-transferable methods, we train the models on source dataset and training part of target dataset and test the models on test part of target dataset.It is noteworthy that we also train the non-transferable methods from scratch using the training part of dataset, the results are the same and are omitted for  1.
From Table 1, we observe that in all the four target datasets, KAD-Disformer achieves the best F1-score and AUC over all comparative methods, including the classic statistic methods and deep learning-based methods.KAD-Disformer also achieves the least total time consumption (including pre-training, fine-tuning, and inference) among deep learning-based methods.The improvements in detection accuracy mainly come from two folds.The first is the architecture of the Transformer.This is confirmed by the fact that the Transformer-based methods such as AnomalyTrans and KAD-Disformer achieve an improved detection accuracy compared with RNN-based (LSTM-NDT) and VAE-based (Donut) models.Second, denoising reconstruction alleviates the negative influence of noises and anomalies in the training data.The contribution of the denoising reconstruction is analyzed in Appendix F.2. Remarkably, with only 10% fine-tuning data, KAD-Disformer achieves a comparable, even better performance than other methods.This is the contribution of our uTune mechanism.The further analysis of uTune is in Section 5.5.
There is no doubt that the classic statistic method ARIMA is the most efficient method.However, the accuracy is the lowest.LSTM-NDT is almost the worst method with respect to time efficiency due to its sequential module LSTM.Besides, due to the existence of early-stop mechanism, the convergence speed of LSTM also limits its efficiency.The higher the convergence speed is, the less time it spends.It is surprising that Donut is quite efficient and spends comparative, even less time compared with the transferable ATAD.This is because of the simple architecture of the model and the noteworthy converging speed of VAE.Another important observation is that AnomalyTrans and ATAD, consume a lot of time on each dataset.For AnomalyTrans, the high-level time consumption is caused by the computation-intensive Transformer-based architecture.For ATAD, the high-level time consumption is caused by the random forest architecture.

Few-Shot Learning Ability (RQ2)
Due to our novel disentangled projection matrices and uTune mechanism, KAD-Disformer can be well-tuned with a handful of samples, which indicates KAD-Disformer can save a lot of time for collecting fine-tuning data.To evaluate the abovementioned properties of KAD-Disformer, we evaluate the accuracy of our model varies with different numbers of fine-tuning samples.The results are shown in Figure 4. We gradually give more fine-tuning samples (add 2h data every time) to the model and record the performance metrics.
Here we use B, C, D → A scenario in this experiment and take ATAD and AnoTransfer as the baseline models.There are two reasons why we select dataset A as the fine-tuning dataset in this experiment.The first reason is that dataset A is a public dataset, which is beneficial for reproducing our results.The second reason is that dataset A has the largest number of data points and longest mean length of the curves among three public datasets A, B and C, which can cover a wide range of fine-tuning data's size.
From the results, we find that KAD-Disformer consistently outperforms the other two comparative methods given the same quantity of data, indicating the high effectiveness of KAD-Disformer in quickly adapting from source to target.Another observation we can get from Figure 4 is that AnoTransfer achieves better performance than ATAD, which is consistent with the result of [40].More importantly, we find that the performance of KAD-Disformer grows much faster with the growth of fine-tuning data.With only 1/8 of fine-tuning samples , KAD-Disformer achieves competitive performance with AnoTransfer saving about 25 hours.This confirms the fact that our KAD-Disformer has the ability of few-shot learning.KAD-Disformer quickly adapts to an extremely small number of new incoming fine-tuning samples and keeps high-level generalization due to the "learn to learn" mechanism.As a result, KAD-Disformer can significantly reduce the time of collecting sufficient fine-tuning data and enables a fast deployment in the realworld environment.

Ablation Study of Disentangled Projection Matrices (RQ3)
We conduct an ablation study to test the effectiveness of disentangled projection matrices (denoted as DPM).We compare the performance with and without DPM (replace with the original selfattention) under similar experiment settings to Section 5.2.Without DPM, all the projection matrices are updated during fine-tuning.
From Table 1, we observe that the model with DPM outperforms the one without DPM.Remarkably, given 10% fine-tuning data, the model without DPM suffers severe performance degradation.We conclude that the improvement brought by DPM comes from three parts.The first is that DPM can save knowledge learned from pre-training in   and keep them in the fine-tuning stage.Without DPM, all the parameters of projection matrices are updated during fine-tuning, which may lose some common knowledge and lead to performance degradation.The second part is more learnable parameters during fine-tuning for KAD-Disformer.Projecting matrices are the most important parameters of Transformer.By disentangling the projection matrices, we get double learnable parameters.Generally speaking, more learnable parameters give the model more potential to achieve better performance.The third part is that DPM alleviates the overfitting problem during fine-tuning thanks to the uTune mechanism illustrated in Section 4.4.

Contribution of uTune (RQ4)
To evaluate the contribution of our uTune mechanism, we conduct an ablation study under a similar setting to Section 5.3.We apply the traditional fine-tuning technique and our uTune to the same pre-trained KAD-Disformer model, respectively, and compare the performance tuned with different quantities of fine-tuning data.The result is shown in Figure 5.
From Figure 5, we find that without uTune, the performance suffers apparent degradation when we use small-scale (10%) finetuning data.Given 5h of fine-tuning data, the F1-score decreases nearly 30% compared with the model with uTune.Besides, even given full fine-tuning data, the performance of uTune is still 14% higher than the traditional fine-tuning technique.
Visualization.To further understand what uTune does, we collect the outputs of KAD-Disformer's encoder and compare the differences between the outputs after pre-training and fine-tuning.We aggregate dataset A, B, C as the pre-train dataset and regard dataset D as the fine-tuning dataset.We randomly sample 300 data points from the pre-train data and 80 from the fine-tuning data.Then we feed them to the models before and after the uTune.We collect the outputs of the encoder and apply t-SNE [30] to reduce the dimension of the concatenated outputs and visualize them in a 2D-scatter figure shown as Figure 6.
The first observation we got from Figure 6 is that the distribution of points of fine-tuning data (red) in Figure 6a is distinct from the distribution of the pre-train data (green).It is reasonable that the pre-trained model, i.e., before uTune, has never seen the unseen KPIs from the fine-tuning data, the encoder could not properly encode these data to the proper position.However, after uTune, the distributions of the data points from fine-tuning KPIs become very similar to the distribution of pre-train data, which is shown in Figure 6b.The result demonstrates that our uTune can effectively adapt the distribution of the fine-tuning data to the pre-train data.After the uTune, the decoders are familiar with the inputs, leading to a satisfying performance.

Lessons Learned From Deployment
Deploying KAD-Disformer in a real-world cloud service system serving millions of users and integrating Microsoft's online KPI data has offered significant insights [9,35,39].We highlight key takeaways: Pre-trained KPI.Our experience has led us to conclude that a pretrained dataset should encompass a broad spectrum of KPI types.This diversity ensures that the dataset's coverage extends across various KPIs, maintaining a balanced ratio in terms of the volume of data per KPI type.Additionally, it is paramount that the dataset spans an extensive period ranging from several days to multiple months.Such a temporal breadth is crucial for fostering a robust generalization capability during the pre-training phase.Fine-tuning Phase.Our findings underscore the importance of the data length provided for fine-tuning.Ideally, this data should comprehensively cover an entire cycle to ensure that the fine-tuning process is as effective as possible.We observed that when the finetuning data exceeded the span of one cycle, there was a notable enhancement in the system's performance after fine-tuning.

Conclusion
Universal KPI anomaly detection is a crucial but challenging task for large-scale online service systems with hundreds of millions of KPIs.In this paper, we propose a disentangled transformer model named KAD-Disformer to efficiently and effectively detect anomalies for an enormous number of KPIs.We novelly design the disentangled projection matrices and uTune mechanism to help the model quickly fit the incoming KPI with limited fine-tuning samples without the risk of over-fitting.Besides, the Denoising Reconstruction technique can alleviate the influence of noises and make our KAD-Disformer more robust.We conduct experiments on four different real-world datasets, and the results show that KAD-Disformer outperforms the current state-of-the-art universal anomaly detection model by 13% in F1-score and achieves comparable performance with only 1/8 of fine-tuning samples saving about 25 hours.KAD-Disformer has been deployed in a real-world online service system serving millions of people for months.Besides, we are glad to share the source code of KAD-Disformer for researchers and engineers in this area.Our code is available at https://github.com/NetManAIOps/KAD-Disformer.

B Pseudo code for workflows
The workflow of the fine-tuning (uTune) is shown in Algorithm 2 The workflow of the pre-training is shown in Algorithm 1

C Sensitivity analysis of Hyper-Parameters
We use the setting of B, C, D → A to analyze the influence of two hyper-parameters:  in uTune and the number of the stacked encoder and encoder layers  in KAD-Disformer.
From the result, we find that when  ∈ [0.2, 0.7],  ≥ 3, the performance of KAD-Disformer is stable and satisfying.Thus, in our experiment, we choose  = 0.5,  = 3. Dataset C consists of sub-datasets from companies like Twitter and AWS, containing time series of different lengths and metrics such as CPU utilization and cost-per-click.

D Dataset Description
Dataset D features 67 time series from a cloud service system in Microsoft, spanning three months with minute-level granularity, labeled by experienced operators to reflect system health.

E Evaluation Metric
Improved anomaly detection metrics have recently been introduced and applied to contemporary research [14,26,33].Consider a labeled, continuous anomaly segment: we categorize the segment as accurately detected if the algorithm identifies any anomaly within that segment.Thus, every point within this anomalous segment is designated as a true positive (TP).Conversely, if the model fails to identify an anomaly, every point within the segment is designated a false negative (FN).Points that lie outside these abnormal segments are not adjusted.

F More Ablation Study F.1 Ablation Study of Disentangled Projection Matrices
The full version of the performance comparison between our approach with and without disentangled projection matrices (DPM) is shown in Table 3.

F.2 Ablation Study of Adapter Layers and Denoising Reconstruction Module
To verify the contributions of the adapter layers and denoising reconstruction module, we conduct an ablation study by removing the adapter layers and denoising decoder respectively.The comparison results are shown in Table 3.
The results show that the performance decreases after removing either adapter layers or denoising the reconstruction module.Without adapter layers, the F1 decreases up to 32%, and without denoising reconstruction, the F1 decreases up to 20%.However, the reduction of time is marginal.Similar to DPM, the adapter layers are tunable parameters of KAD-Disformer, which directly decides the capacity of fine-tuning.The denoising reconstruction can alleviate the influence of noise and anomalies in the training data, avoiding KAD-Disformer from learning abnormal patterns, which is also confirmed by the previous work Donut [33].

Figure 1 :
Figure 1: Examples of time series from a global Internet company.The red points mark the anomalies.

Figure 3 :
Figure 3: Demonstration of how the uTune works.

Figure 4 :
Figure 4: The performance of KAD-Disformer, AnoTransfer and ATAD tuned with percentages of data.

Figure 5 : 6 :
Figure 5: The performance comparison of KAD-Disformer with and without uTune mechanism given different percentages of fine-tuning data.

Figure 7 :
Figure 7: The performance of KAD-Disformer with different alpha.

Table 1 :
Overall performance of comparative methods.B, C, D → A indicates that dataset A is selected as the fine-tuning dataset and B, C, D are selected as the pre-training datasets, and so forth.KAD-Disformer-10%, KAD-Disformer-50% and KAD-Disformer-100% indicate that we use the first 10%, 50% and 100% of training part of the fine-tuning dataset to tune KAD-Disformer.w/o DPM indicates KAD-Disformer without Disentangled Projection Matrices.w/o Adap indicates without Adapter Layers.w/o Denoise indicates without Denoising Reconstruction module.A method with * means it has own tailored fine-tuning mechanism.

Table 2 :
The statistics of each dataset.A, B, C are all public dataset and D is collected from a real-world web service provider.

Table 3 :
Performance Comparison with Different Configurations