SLIM: a Scalable Light-weight Root Cause Analysis for Imbalanced Data in Microservice

The newly deployed service -- one kind of change service, could lead to a new type of minority fault. Existing state-of-the-art methods for fault localization rarely consider the imbalanced fault classification in change service. This paper proposes a novel method that utilizes decision rule sets to deal with highly imbalanced data by optimizing the F1 score subject to cardinality constraints. The proposed method greedily generates the rule with maximal marginal gain and uses an efficient minorize-maximization (MM) approach to select rules iteratively, maximizing a non-monotone submodular lower bound. Compared with existing fault localization algorithms, our algorithm can adapt to the imbalanced fault scenario of change service, and provide interpretable fault causes which are easy to understand and verify. Our method can also be deployed in the online training setting, with only about 15% training overhead compared to the current SOTA methods. Empirical studies showcase that our algorithm outperforms existing fault localization algorithms in both accuracy and model interpretability.


INTRODUCTION
Monolithic services have been progressively restructured into more refined modules, comprising of hundreds (or even thousands) of loosely-coupled microservices [7,16,17,38].Leading companies like Netflix, eBay and Alibaba have adopted this application model.Microservices offer several benefits that make them a powerful architecture, including simplification of application development, resource provisioning efficiency and flexibility.Despite these promising advantages, microservices introduce complex interactions among modular services, which can make on-demand resource provisioning challenge and potentially lead to performance degradation.
To enable engineers to resolve failure efficiently, fault localization is at the core of software maintenance for online service systems.With the advancement of monitoring and collecting tools, Metrics, Traces, and Logs have become the three fundamental elements of fault localization.Metrics refer to numeric data measurements taken at regular intervals of time, which help to understand the reasons behind the functioning of your application.Each trace records the process of a request being called through service instances and their operations [33,36].Logs provide detailed information on the system's running status and user behavior.
Though tremendous efforts have been devoted to software service maintenance and observability, in practice failures are inevitable due to the the increasing size and complexity of systems, which can result in significant economic losses and user dissatisfaction [7,30,34].Analyzing the root causes of such performance issues is non-trivial and often error-prone, as hundreds of services may exhibit anomalies (e.g., network congestion and limited available cores) and propagate to dependent services.Moreover, large microservice systems are highly active and dynamic [7], with numerous service changes.
It is important to note that the faults in a service often occur within the change of service, e.g.initial period of deploying new service or code change of services.This is due to changes in system architecture, service version (with backward or forward compatibility), and insufficient testing of the new service itself [3,27].Google SRE book points out that about 70% of failures are caused by changes in services [6].Additionally, the lack of adequate fault data for newly deployed services hinders the learning process, making root cause analysis algorithms hard to localize such faults, even leading to a chain of incidents.However, existing fault localization algorithms mainly focus on utilizing new deeplearning and machine-learning models to fully learn the information in the multi-source data (log, metric and trace), while ignoring the highly dynamic running-time status with numerous services in change and the limited training data.Thus, they are unable to quickly learn and respond to new faults caused by service changes with a small amount of fault data [9,10].
Motivations.Existing fault localization methods for imbalanced classification generally rely on re-sampling the training data [28,29].In fact, simply over-sampling the minority class samples or down-sampling the majority class samples may cause model overfitting and on the other hand cannot generate extra insights from the data.We show that those SOTA algorithms cannot achieve significant performance improvement through re-sampling in our experiment part 4.4.2.Besides, many deep learning-based fault localization methods [4,15,17,32] which claim to have good fault localization performance, generally require several hours for a single training session.Thus, they fail to handle services experiencing frequent failures.In addition, those algorithms only offer simple binary classification results with a lack of interpretability, making it difficult for engineers to understand, diagnose, and further prevent faults in the next steps.Therefore, how to build an interpretable fault localization model that could quickly and accurately respond to newly deployed service fault patterns from a limited number of imbalanced failures is an appealing but challenging problem [11,23].
We summarize three main technical challenges to build effective fault localization models for imbalanced datasets of newly deployed services as follows.
1) The first challenge arises from imbalanced data of service change (e.g., newly deployed service).We also note that simply applying re-sampling does not lead to performance improvement in our experiments.
2) The second challenge is the model interpretability for operating engineers.Good interpretability can help engineers to better understand the fault cause and possibly identify related risks.Unfortunately, most of the existing works on fault localization lacks interpretability.
3) The third challenge arises from the unbearable training overhead of existing models, primarily attributed to the dynamic runtime environment in the microservice scenario.
In this paper, we aim to design an interpretable classifier on highly imbalanced data via learning decision rule sets [13] for fault localization.It can be expressed in disjunctive normal forms (DNF, OR-of-ANDs), which enjoys good interpretability due to the logical clauses.An example of DNF models with two conditions is "IF (cpu_usage>80 AND file_disk_read>180) OR (file_disk_write>70 AND mem-ory_usage >170) THEN ŷ = 1".It helps engineers to understand which key metrics are affected by the fault.As we only focus on building rules for the minority class (further called the positive class), thus it is an ideal interpretable model for imbalanced classification.Moreover, our model can achieve incremental (or online) training within minutes for every newly deployed services, which makes it deployable to a wide range of services with low cost.
In this paper, we propose a fault localization algorithm called SLIM (Scalable and interpretable Light-weight algorithm for Imbalanced data in Microserverice) to address the aforementioned challenges.Here we summarize the main contributions as follows: 1) To the best of our knowledge, SLIM is the first fault localization algorithm to address the issue of imbalanced fault data in service change (e.g., newly deployed services) from an algorithm-level within the microservices environment.2) Our fault localization algorithm can generate the interpretable rule set that could assist engineers in understanding the root causes of failures.3) Our SLIM is efficient and can be deployed easily with a low cost.Compared to other algorithms, our model's training time is only around 15% of theirs in the most complex scenarios.4) We apply our model's interpretable ruleset to two use cases, which replace human experts to build prior knowledge.The results show that our interpretable ruleset is comparable to expert knowledge and reduces 80% of the time required.Besides, our ruleset knowledge base beats the precision of other interpretable methods' knowledge base.5) We have conducted extensive experiments on 359 failures from three systems including three open-source benchmark datasets [29].The results show SLIM's effectiveness, efficiency and interpretability.We also apply our algorithm in the real running-time environment of the largest cloud service system provider in China.We give the detailed case study in the section of experiment.We also provide the demo available on the github 1 .

THE SLIM ALGORITHM 2.1 The Pipeline of SLIM
SLIM is an interpretable and scalable fault localization algorithm for microservices system.It identifies the (faulty) services that cause performance degradation in a microservice system, and provides explanations for operators to understand why. Figure ?? shows an overview of SLIM's pipeline, primarily involving 4 modules.Firstly, it processes logs and converts them into metrics as features.Then, the features are transformed into binary encoders.With these binary features as inputs, rules can be learned.Finally, the learned 1 https://anonymous.4open.science/r/SLIM-5B7F/rules are used to vote out faults.We will now introduce these 4 modules in detail.

Log Extraction Module.
This module extracts key log information and converts it to metric data.Operational log data is in unstructured format and not suitable for direct training.So we leverage log extraction methods [40] to deal with it.As Figure 1 shows, the procedure consists of log template parsing and analysis of template variation.We first parse the normal history log messages to construct a standard template base offline using drain [20].These normal log templates serve as a historical reference to identify whether any new template exists in online log messages.Then we parse the online log messages and construct streaming timeseries log templates.We compare the streaming time-series log templates with the template base and record the unmatched log template, along with their quantity, types, and other features.We aggregate these features by time interval, aligning them with the metrics' sample interval.
2.1.2Feature Binarization Module.This module generates binarized features from the metrics obtained in the feature extraction module.We employ a bucketing strategy using a sequence of thresholds to cut the numerical features into discrete, binarized features.For example, in Figure ??, the network latency has a continuous distribution between 100-500ms.The data is discretized into multiple values, such as  _ ≤ 100, 100 <  _ < 200 and  _ ≥ 200.For categorical features, we use one-hot encoding to generate binarized features.We apply this discretization process uniformly across the data distribution.These discrete values can be combined to form rules.

Efficient Rule
Learning Method.This module digs out the rule set by our proposed novel classifier.The algorithm first identifies rule sets for each fault type.Namely, for each fault type we train a sub-model and obtain a rule set.Then, those which hit the rules in the rule set are classified as corresponding fault samples.The detailed rule learning method will be discussed in Section 2.2.

Fault Localization.
This module provides the final results, including the localization of fault types and the localization of fault services.We design fault ranking methods by adopting a voting mechanism that takes into account both the hit counts of each rule set and the rule's confidence.

Detailed Procedure of Rule Set Learning
In the section, we introduce our rule learning methods, including Rule Set Selection in Section 2.2.2 and Efficient Rule Generation in Section 2.2.3.Before diving into the details, we first predefine some key notations.Rule and rule set.Our classifier is a rule set that consists of rules.We now give the definitions of these two ingredients and the related notations.

Definition 1 (Rule).
A rule  is a set of feature indices, i.e.  is a subset of Γ.A sample (  ,   ) is covered by a rule  if and only if  ⊆ {  ∈ Γ| , = 1}.Definition 2 (Rule Set).A rule set  consists of multiple rules, and serves as a classifier, which classifies a sample as positive if the sample is covered by at least one rule in , and as negative if there is no rule in  that covers it.Therefore, the predicted label ℎ  can be calculated via  ∈ (  ∈  , ).
Let X  , X  and X  denote the set of samples covered by the -th feature, the rule  and the rule set , respectively.In the following we use these different subscripts to distinguish sets of samples covered by different features/rules/sets, i.e.
According to the relationships among the features, rules and rule sets, we get X  =  ∈ X  and X  =  ∈ X  .We define a set operator + as a positive sample filter, meaning that X ′+ returns a set containing all positive samples in the arbitrary given set X ′ .
Problem Formulation.The common evaluation metrics, accuracy and error rate, are no longer applicable in imbalanced classification as they are prone to be dominated by the majority class [24].To address this issue, we choose F1 score instead, which combines both precision and recall and cares about the performance of the minor positive samples: Since    ℎ  is the number of correctly classified positive samples,  ℎ  is the number of positively predicted samples,  ℎ  ,    ℎ  and    can be rewritten as |X  |, |X +  | and |X + |, respectively.Formally, we formulate the F1 score as follows: To maximize F1 score, we have the following optimization problem: ( where we set a predefined parameter  to be the maximum feature length of a rule, and  to be the maximum number of rules in a rule set.These constraints are to ensure the interpretablity of rules.By taking the logarithm, we can rewrite the objective in the following form: 2.2.2 Rule Set Selection.We now present the details of our efficient rule selection method.Let us rewrite the two logarithm components of (4) as As logarithm function is nondecreasing and concave, both  () and  () are non-negative monotone submodular functions [5].Consequently, the objective function can be viewed as a difference between two submodular functions.Our proposed method, which is referred as SLIM, is based on the method DistortedGreedy [19], for maximizing the difference between a non-negative monotone submodular function and a modular function.We will show that by introducing the notation curvature, Distort-edGreedy is applicable to our problem.We first define the curvature  of  () to measure the closeness of  () to a modular function, and  is unknown a priori.The complete procedure for rule set selection is summarized in Algorithm 1.Given the training dataset, the maximal number of rules and the limitation on the length of rules, and by initializing the rule set as an empty set, we iteratively add a rule  * maximize a distorted marginal gain of  with a parameter , i.e., max where Similar to DistortedGreedy, we adaptively update the tradeoff between  ( |) and  ( |) using  in line 5 of Algorithm 1.In early stages, a small value of  is adopted to select rules with higher precision.The value of  is gradually increased to improve the recall of the rule set.In other words, SLIM tends to select the rules with higher precision in the early iteration steps and focuses on the rules with higher recall later.The rationale behind the  updating strategy is that when first focusing on the rules with high precision and low recall, SLIM can achieve higher precision and later improve the recall by including more rules.However, if the rules with high recall but low precision are given more priority in early stages, it is difficult to eliminate the effect of the false positive samples.
To better illustrate this, we show a toy example in Figure 2. Given a dataset with 20 positive and 100 negative samples, and our goal is to select 2 rules from 3 candidate rules, namely rules A, B and C, where rule A covers 10 positive and 1 negative samples, rule B covers the rest 10 positive samples which are not covered by rule A and 1 additional negative sample, and rule C covers 18 positive samples and 5 negative samples.We first discuss the scenario that we replace the  update strategy in line 3 of Algorithm 1 with   = 1.At the first iteration for rule A, | = 10, then the marginal gain of rule A is log(10/(20 + 11)) ≈ log(0.31).Similarly, the marginal gains of rule B and rule C are given as log(10/(20+11)) ≈ log(0.31),log(18/(20 + 18 + 5)) ≈ log(0.42),respectively.In this scenario, SLIM will select rule C in the first iteration and rule B(or A) in the second iteration.Finally, SLIM constructs a rule set which covers 20 positive samples and 6 negative samples.However, with the proposed  update strategy in  43) (≈ log(0.099)),respectively.Then SLIM will return a better rule set that consists of rule A and rule B, which covers only 2 negative samples.We also give the theoretical guarantee for the proposed method at appendix ?? 2.2.3 Efficient Rule Generation.Algorithm 1 involves finding a rule  to maximize a distorted marginal gain of  , i.e. solving problem (6).As the number of possible rules is exponential with number of features, solving problem ( 6) is NP-hard.To address this issue, we propose an efficient rule generation method, which solves (6) approximately.Because  is independent of  , (6) can be reduced to max Notice that a rule  is a set of feature indices, so finding an optimal rule is equivalent to finding a set of features.The expression in (7) can be further rewritten as: where Directly maximizing  ( ) is difficult.Although  ( ) is a supermodular function, the presence of logarithm function makes the property of log( ( )) non-trivial.In our algorithm,  ( ) is maximized using MM algorithm [21], which iteratively increases the value of the objective function by maximizing a tight lower bound.We propose a proper lower bound of  ( ) by finding a lower bound of log( ( )) and an upper bound of log(( )) separately.
Maximizing a non-monotone submodular function subject to cardinality constraints has been extensively studied in the literature.Specifically, SLIM maximizes  1  ( | ( ) ) and  2  ( | ( ) ) by using a simple local search method.As shown in [26], by identifying the cardinality constraint as a matroid constraint, the local search method can provide at least 1/4approximation to the optimum.Lemma 1.Both  1  ( | ( ) ) and  2  ( | ( ) ) are non-monotone submodular functions.
Maximizing a non-monotone submodular function subject to cardinality constraints has been extensively studied in the literature.Specifically, SLIM maximizes  1  ( | ( ) ) and  2  ( | ( ) ) by using a simple local search method.As shown in [26], by identifying the cardinality constraint as a matroid constraint, the local search method can provide at least 1/4approximation to the optimum.Fig. 3 shows the overall framework of our rule generation method.At the th iteration, we first generate surrogate functions of  ( ), i.e.  1 , ( ) ( ) and  2 , ( ) ( ), according to current estimation  ( ) .Then we maximize  1 , ( ) ( ) and  2  , ( ) ( ) utilizing local search technique and arrive at a new estimation  ( +1) .Our method only involves the set operation and is hence a computational efficient method.The computational complexity of our method can be further improved by permitting early stopping, i.e, terminating the local search if no significant improvement is achieved by replacing features.

Detailed procedure of Fault Localization
In this section, we introduce the detailed procedure of fault localization.For the localization of fault type, let the  * denote the selected fault type result,  the collection of all fault types' rule sets,   the -th sample in the current time period,   the -th fault type and  all samples.   (  ) is an indicator function that takes the value 1 if   is hit by the rule set   and 0 otherwise.If the -th sample is hit by multiple rules in the -th ruleset, we select the rule with the highest precision     at the training set to hit the sample.So we calculate the probability of the -th sample belonging to the -th fault type as Equation ( 12) writes.Then in Equation ( 13) we sum up the sample-level probability for each fault-type rule set  and pick out the fault-type with the highest value as the root cause: * = arg max For the localization of service, let the  * denote the selected service result and   all samples in the -th service.  ∈   is the -th sample in the -th service and    (  ) means whether   is hit by ruleset   .We first group the samples by service.As Equation ( 14) demonstrates, similar to fault type localization, the probability for each sample is the highest precision's rule in   when these rules in   hits the sample and otherwise 0.Then, we individually sum up the product of all samples hit by the rules in each service multiplied by the corresponding rule precision score according to Equation (15).We sort the probability results of every service and choose the service with the highest value as the root cause.

𝑃 (𝑟
* = arg max The fault localization module could help us to confirm the failure service and fault type of the failure.We give a detailed evaluation of the effect of our model in the experiment part.

THE APPLICATION OF INTERPRETABLE RULESET
Our interpretable ruleset could assist engineers to confirm the root cause and find out the most relevant metric for the anomaly.It makes the model could corporate with expert knowledge for troubleshooting and debugging system errors.On this basis, we go a step further to apply our model to generate the knowledge base for existing fault localization algorithms.We will introduce these applications in the following part.

Overview of the Knowledge Base Generator
Usually, an anomaly knowledge base is constructed manually to store expert experience and quickly localize history fault types.When an anomaly firstly occurs, experts analyze its key features and generate fault fingerprints for the Knowledge Base.The fingerprint will help engineers to fast solve the problem again in the future.Therefore, expert knowledge plays an important role for many algorithms [14,39].These algorithms leverage expert knowledge to describe every fault type and construct the prior knowledge from metrics and trace data.This is called the knowledge base or case base.Although the knowledge base is necessary for real systems, it is also expensive for the company to recruit experts and build a knowledge base manually.Thus, our interpretable ruleset can help to construct the knowledge base and reduce costs.Fig. 4 shows an overview of the knowledge base generation.We try to replace the expert knowledge component in these algorithms with our rule set in order to build the knowledge base required for diagnosing faults in their models.

CloudRCA
CloudRCA [39] leverages RobustSTL to extract the abnormal metrics and identify important system metrics using the expert knowledge base.The selected metrics are learnt via a Knowledge-informed Hierarchical Bayesian Network (KHBN) to perform root cause analysis.
In our implementation, we replace the RobustSTL and expert knowledge with our interpreter to find out the most important metric sequence.We first pass labeled training data through our SLIM, where the model analyzes the fault cases to give the key ruleset.Then, we construct the feature matrix according to the ruleset including the key metric and log information.Finally, the KHBN completes the root cause analysis.We finish the experiment with the A dataset and evaluate the CloudRCA's performance.

MicroCBR
Similarly, we also integrate our ruleset into MicroCBR [14], which leverage the labeled anomaly case to construct the knowledge base and perform root cause analysis through case-based reasoning.Each case records a specific root cause and its solution, along with a set of anomalies detected from resource metrics, logs and other operating information.The precision of case-based reasoning is always dominated by these abnormal metrics.
We leverage our ruleset to automatically construct anomaly knowledge for its knowledge-base.First, the SLIM to learn the labeled data.Then, our ruleset will show the important metrics for the all fault type.MicroCBR aims to record the time-series metrics' fingerprint, which includes every key metric's spike or dip.We leverage our ruleset to assist the MicroCBR to select the key metrics and construct fingerprints.

EXPERIMENTS
In this section, we perform experiments on benchmark datasets to show the performance of SLIM in comparison with the state-of-the-art fault localization algorithms.Dataset A and B are obtained from two different production service systems, which are both injected in 18 types of faults that can be summarized as: (i) CPU exhaustion on containers, physical servers and middleware.(ii) packet loss, delay on service and physical node.(iii) database connection limit and close(just for dataset A); (iv) low free memory at JVM/Tomcat (just for dataset B); (v) Disk I/O exhaustion (just for dataset B) [29].The dataset C is generated by the train ticket booking microservice system [28,41], where the fault classes include the network delay and CPU consuming.The dataset D comes from the real-world system that is one of the biggest cloud services provider(we refer to it as Company ALC for brevity).It consist of 35 incidents occurred in our cloud platform.These incidents are collected and verified which services are the root cause by our SRE team.For every fault, we have about 12000 metrics, collected during 3600 seconds(half hour before and half hour after the anomaly was reported) from 2173 microservices.We summerize the number of features, number of samples, number of services, and number of faults classes in Table 1.

Baseline
Algorithms.We compare our proposed SLIM with several state-of-the-art fault localization algorithms, including five supervised methods, i.e.Dejavu, Seer, MEPFL-RandomForest(MEPFL-RF), Multilayer Perceptron(MEPFL-MLP), decision tree, Eadro, Murphy, Sage and AutoMap.Dejavu is an actionable and interpretable fault localization method for recurring failures, where graph attention networks is used to localize the fault [29].Murphy [18] based on a Markov Random Field (MRF) that can take advantage of such loose associations to reason about how entities affect each other in the context of a specific incident.Sage [15] based on the Conditonal VAE that could simulate the service's status and counterfactual the system by restore the service's abnormal metric and confirm the root cause.Eadro [25] is similar with Dejavu that leverages the Graph Attention Networks to learn the log, metric and trace.They try to embedding log into the node features by Seer [17] captures the RPC-level graph dependency and metric by training a hybrid deep learning network that combines a CNN (Convolutional Neural Network) with an LSTM (Long Short-Term Memory).MEPFL-RandomForest(MEPFL-RF) and Multilayer Perceptron(MEPFL-MLP) [42] treat fault localization as a classification problem and solve it using traditional machine learning methods RandomForest and Multilayer Perceptron, respectively.We also compare our method with decision tree due to its interpretablity.AutoMap leverages the multidimension metrics to dynamically generate service relationship graph, and then leverages the random walk algorithm to localize the fault from the graph [31].

Experiment Environment and Parameter
Tuning.We implement SLIM using Python 3.7 and Go language.All the experiments are conducted on a personal computer with 3070ti, 32GB RAM and 5800X processors with 6 cores.We tune parameters for all methods by 5-fold cross-validation.Specifically, for our model SLIM, we choose the number of rules, i.e. parameter , from {2, 4, 8, 12}, and limit the length of a rule to no more than 6 to ensure interpretability, i.e.,  = 6.At the table 3, we show the detail analysis of Feature Binarizer module.We test the performance impact of our rule set algorithm under different interval partition quantities
For DecisionTree and MEPFL(RF), the number of samples at each leaf node is tuned from 1 to 100 and the number of trees is tuned in {1000, 2000, 3000}.For Dejavu, we set the parameters according to the suggestion in [29].

Performance on Fault Localization
4.2.1 Evaluation Metrics.Top-k Accuracy, which is referred as A@k, is used to measurement the perforemance of each methods.Top-k Accuracy computes the probability that the root causes can be located within the top  service instances among all candidates.Higher A@k indicates more accurate of the root cause localization.Here we measure the performance of each method using @1, @2, and @3.
Kappa Analysis, which is Cohen-Kappa analysis [35], a statistical method used to measure the inter-rater reliability.It is generally thought to be a more robust measure than a simple percent agreement calculation.Due to the requirement for precision data in Kappa analysis, and considering that our model provides root cause rankings rather than precision, we select the top-1(A@1) result from the ranking as the final localization outcome to carry out the Kappa test.

Performance.
We present the fault localization results in table 2, where the result are averaged over 5 independent trials.From table 2, we see that the proposed SLIM achieves highest Top-1 accuracy and Cohen-Kappa value on dataset A,B and C.This due to we employ F1 score as objective function, which is robust to data imbalance.In contrast, Dejavu simply resamples the data to balance the number of data in each class, thus performs slightly worse than SLIM [29].Eadro and Dejavu share similar performance outcomes because they employ the same methodology.Sage and murphy leverage the counterfactual method to restore the system and localize the root cause.Due to the Sage and Murphy is semi-supervised counterfactual inference methods, their performance experiences a slight decrease compared to supervised algorithms.However, they are more suitable for fault recovery and exploration in service change.Seer, Decision Tree, MEPFL-RF, and MEPFL-MLP do not have specific designs to address imbalanced data, leading to inferior performance.AutoMap performs poorly on three datasets.This is because these approaches do not make use of any historical faults information until the ground truth of similar historical faults are identified.Some critical intermediate steps in these methods, such as anomaly detection and similarity evaluation, despite being carefully designed, are entirely unsupervised.As a result, they may be susceptible to confusion from irrelevant abnormal changes in other metrics, which can be caused by noise or fluctuations, particularly when the number of metrics or fault units is high.On the contrary, SLIM focuses on the key metrics from the rule set that is generated by historical failures.training time, thus the computational costs of Seer and Dejavu are much higher than that of the rest methods.We note that as Seer and Dejavu are highly rely on the current service topology diagram, as the models need to be retained once the service topology is change (such as there is a new service deployed).Overall, the high training overhead makes Seer and Dejavu unsuitable for scenarios that needs the fault localization methods to be adapt quickly.

Evaluation of Real-World System(Dataset D)
We present the fault localization results in table 4. Due to the large number of real system services and the high requirements for algorithmic overhead, the comparison algorithms we previously used, such as Dejavu and Seer, have excessive computational cost for deep learning models and cannot adapt well to the system requirements.Therefore, we only compare methods like Decision Tree and RandomForest.As shown in the table, our model still maintains better performance than other methods.In addition, we also compared the algorithm Ripper [12], which is another rule-based algorithm, but it does not optimize for imbalanced data.Compared with Ripper, we made for the imbalanced dataset, our model has fewer false positives, resulting in a more accurate ranking of fault.Dataset Algorithm A@1 A@1⇑A@2 A@2⇑A@3 A@3⇑ SLIM 0.851 - To verify the performance advantages of our method in the data imbalance scenario, we extracted services to simulate the newly deployed services for the imbalance test.We selected the fault "network_delay" in docker005(dataset A), the fault "OS Network" in apache02(dataset B) and the fault "OS Network" in Tomcat01(dataset B).Because these faults occurs more frequently than others in the entire dataset, making it easier to characterize the trend of the fault localization performance with the frequency of occurrence.We set the test set for each service to include two faults, and the training set gradually increases from one occurrence to -2, where  is the total number of this faults.

Numerical
Results.We show the results of each fault localization algorithms on three faults in Figure 6.Noticed that in this experiment, we removed some underperforming algorithms from the previous performance comparison.Because we couldn't determine whether the poor performance of diagnosing new services was due to the inherent shortcomings of the models or if it was a result of imbalanced data.From the Figure 6, we find out that our model is able to correctly identify all the testing faults when they occur in the training set for two times.While the rest algorithms need more training samples to produce reliable fault localization.It means that, given a newly deployed service, our proposed SLIM needs only a few historical data to train, thus can significantly reduce the numb fault the system needs to experience.We further balance the number of faults by upsampling the data of minority faults using SMOTE [8].We report the results of each method with SMOTE in Figure 7. From Figure 7, we see that compared with the results in Figure 6, few improvement is achieved when SMOTE is used for most algorithms.In Apache02 for Dataset B, we find out that the Seer has a little promotion.However, the SMOTE produced negative impacts for Seer on fault "docker005" in Dataset A. This is due to that SMOTE may introduce some noisy data to the model, which may affect the precision of Seer.

Performance of Knowledge Base Generator
We evaluate the generator using precision and time consumption.Two AIOPS experts are recruited to manually finish the knowledge base construction by their expert knowledge.The experts have longer than two years of operating experience and have publications at national conferences.As a baseline, we also construct the knowledge base using Deci-sionTree and Dejavu(local interpreter part).These are done using similar procedure to that used for SLIM.Table 5 shows the comparison results of our interpreter, expert knowledge and other baseline method for CloudRCA and MicroCBR using dataset A. Compared with the expert knowledge, our interpreter finishes the root cause analysis automatically and loses just 5.1%-7.8%precision.Compared with other baseline methods, our interpreter's knowledge base improves precision by 7.8%-19.4%.We also compare the time consumption by all methods in dataset A. Table 5 compares the results between the experts and the generator.In the experiment, experts require 5 hours to construct the knowledge base.Then, we compare the knowledge base effect among experts, our interpreter and other baseline interpreter.Compared with experts, our interpreter reduces the time required by 82% and has nearly the same precision.

LIMITATION AND FUTUREWORK
Due to the complexity of F1-Score in multi-class settings, our model, in order to trade off computational cost and performance, is optimized only for binary classification.This design choice makes our model not strictly end-to-end, which may decrease performance in the ultimate fault localization.
In future work, we aim to propose a new multi-class F1-Score optimization method to learn rule sets based on this module, achieving a fully end-to-end model.This approach seeks to address the performance trade-off issues encountered in the previous implementation.

CONCLUSION
In this paper, we propose an interpretable, effective and fast fault localization algorithm SLIM to directly optimize the F1 score, which is particularly applicable for highly imbalanced classification.Our experimental results demonstrate the superior performance and interpretability of SLIM in comparison with existing fault localization methods.In addition, the good adaptbility of SLIM makes it an ideal tool to handle large-scale microservice systems in many real-world scenarios involving frequent service change.

1 :Figure 1 :
Figure 1: Log Extraction Module: Log Parsing, Matching and Analyzing.2.2.1 Notations and Preliminaries.Given a dataset X = {(  ,   )}  =1 ,   ∈ {0, 1}  is the binary feature vector obtained from feature binarization and   ∈ {0, 1} is the true label indicating the belongingness of a given fault type.Here,  is the size of the feature index set Γ. A sample (  ,   ) is positive if   = 1() and negative if   = 0().We call the -th sample is covered by feature  if  , = 1.For example, given  ∈ , the -th feature  _ > 200,  , = 1 means for the -th sample,  _ > 200 and  , = 0 means  _ ≤ 200.Denote ℎ  ∈ {0, 1} as the prediction of the -th sample.For a certain fault type, we aim to predict  from the dataset X using an interpretable rule set.Rule and rule set.Our classifier is a rule set that consists of rules.We now give the definitions of these two ingredients and the related notations.

Figure 2 :Figure 3 :
Figure 2: Example of the proposed rule selection strategy.

4. 2 .Figure 5 :
Figure 5: The Overhead of all Algorithms on Benchmark Datasets.trainingtime, thus the computational costs of Seer and Dejavu are much higher than that of the rest methods.We note that as Seer and Dejavu are highly rely on the current service topology diagram, as the models need to be retained once the service topology is change (such as there is a new service deployed).Overall, the high training overhead makes Seer and Dejavu unsuitable for scenarios that needs the fault localization methods to be adapt quickly.

Table 2 :
Accuracy comparison of different root cause localization algorithms.

Table 3 :
Accuracy comparison of different NumThresh

Table 4 :
Comparison of different root cause localization algorithms at Real-World System.

Table 5 :
The Performance Comparison of Generator and Expert (Pre:Precision; TC:time-consuming)