Dynamic Alert Suppression Policy for Noise Reduction in AIOps

As IT environments evolve in both size and complexity, observability tools are needed to monitor their health. As the anomalous events are detected, alerts are generated, leading to alert notifications to the Site Reliability Engineers(SREs). However, most of these notifications turn out to be false alarms, leading to alert fatigue, and inefficiencies. Existing approaches for reducing alert noise rely on static policies that can quickly become outdated in dynamic IT environments and are therefore difficult to maintain. In this work, we propose a novel unsupervised approach, Dynamic-X-Y, guided by a well known moving average envelope statistical method, to learn custom tailored alert suppression policy from historical alerts and events data. At run-time, these learned policies are applied to incoming events/alerts to reduce false alert notifications. We validate our approach on two different datasets, log anomaly and metric anomaly events/alerts, to show percentage increase in accuracy over state-of-the-art methods by 7.39% and 35.7%, respectively.


INTRODUCTION
Cloud computing's affordability comes with the complexity of loosely coupled micro-services, posing challenges for SREs in managing availability, performance, and capacity planning amidst vast amounts of machine-generated data [4,13,21].Artificial Intelligence for IT Operations (AIOps) provides an AI-infused platform that empowers SREs with collaborative development and operations practices in an open, hybrid cloud environment [4] to manage applications/services running on cloud.AIOps platform mines voluminous amount of information from disparate data sources such as logs, metrics, traces, etc. to identify events [23] (log anomalies, metric anomalies), that are correlated and grouped together based on predefined or machine learned patterns [11], followed by fault localization [1], and use these to find similar historical incidents for action recommendation [16].Events are de-duplicated and correlated by event correlation algorithms [11].These algorithms, encompassing temporal, topological, and scope-based approaches, are employed to group related events into alerts.Subsequently, each generated alert triggers a notification to a SRE for addressal, serving as an alert mechanism to capture their attention and to enable timely resolution of the underlying issue.
It has been widely observed that often some of the alerts are short lived in nature [3,8,28], i.e., events associated with an alert may automatically resolve on their own.Hence, some of these alert notifications may not be worthy of SREs' attention, leading to false alarms.In the SRE's parlance, alert notifications that are false alarms are referred to as noise.As a result, SREs may ignore initial alert notifications to avoid frequent false alert notifications.However, the aforementioned approach may lead to overlooking crucial alert notifications and have serious consequences.It is conceivable that the deluge of false alert notifications could impede SREs' workflow, leading to decreased productivity.Prior attempts [30,38] have focused on manually defining static policies applicable across all metrics or services.These polices act upon incoming events and keep the alert notifications suppressed until  events are observed over  duration.In this paper, the policy consisting of  and  is called as Alert Suppression Policy (ASP).Below we share a few challenges with manually defined static ASP: Dependency on Know-How: Manual policy creation is limited by SREs' experience and knowledge that was acquired through system observation.Such ASPs may therefore fail to consider all relevant dimensions while formulating, resulting in inadequate coverage and causing an increase in noise or false positive alert notifications.Moreover, one policy may not fit across all metric and services, because each metric or service may generate events with a different frequency and velocity.Dynamically changing environments: Manually defined ASPs may fail to adapt to the dynamic nature of the IT environment.For example, the same application stack running in a client's environment may exhibit different behavior on different days due to changing workload patterns.Continuous updates are required to keep such policies aligned with the evolving system.Non-Scalable & High Maintenance: It is impractical to manually create multiple ASPs for a large volume of metric names and service names.Statistical approaches for data analytics are optimal for analyzing historical data and dynamic environments.Thus, it is imperative to leverage them for the generation, updating, and maintenance of ASPs.Thereby, relieving SREs from the burden of manual creation, activation, and deactivation of policies.
To address the aforementioned challenges, this work proposes a novel, scalable, and unsupervised method called Dynamic-X-Y for learning tailored ASP for each micro-service or metric.The method includes a Persistent Region Detection (PRD) module that utilizes historical alerts and its constituent event counts to identify peaks (high event count deviating from normal behavior) forming dense regions, which are then used to compute the  and  values.These  ,  values are used to define ASPs, which are employed at runtime to suppress alerts that do not warrant SRE's attention, thereby reducing noise.The PRD module is based on the widely used statistical approach known as moving average envelop [2,15] which has been applied in various domains, including finance [33,34], economics [35], medical [26,27].The key contributions of this paper are: a) Propose a method that learns tailored ASPs, one for each metric or service based on its characteristics, such as velocity, peak and density.In the historical events and alert data, it identifies dense persistent regions where events occurred in huge quantities and use them to compute the values of  and  for defining the dynamic ASP.(b) Evaluate the proposed method on different data modalities, including log anomaly events/alerts and metric anomaly events/ alerts.We compare the proposed method with the baselines static-X-Y ASP and No-Suppression (default approach) to show its superior performance.An important observation worth pointing out is that the dynamic ASPs learned through the proposed unsupervised method performs equivalent to the ASPs learned if the labeled data were available, i.e. supervised setting.(c) We perform a deeper analysis of the proposed method across several dimensions such as width-cutoff, window size, and density to provide key insights about how these features effect the quality of persistent regions.(d) We show the application of dynamic ASPs to demonstrate its usefulness in a real industrial setting.

METHOD
Before describing the method for learning ASPs, we first define the required preliminaries in AIOps for the ease of understanding of the reader.

Preliminaries
Cloud is designed to host multiple applications, each consisting of one or more micro-services that consume resources like CPU, memory, and network bandwidth.An application generate logs which is an essential piece of information required for issue diagnosis.Additionally, a metric associated with a resource is also another essential piece of information that is very helpful for issue diagnosis.A metric data associated with a resource is represented as time series of numerical values.A metric continuously emit values to communicate the state of the associated resource.For instance, an application running on the cloud may have two resources, host-server1 and host-server2, with host-server1 associated with two metrics, cpu.utilization and memory.utilization,while host-server2 is associated with three metrics, cpu.utilization, memory.utilization,and request.time.This example produces two metric time series for cpu.utilization, two for memory.utilization,and one for request.time.
To formalize, we define a set of  resources as R = { 1 ,  2 , . . .,   } and a set of  items which can be either a metric or a service as M = { 1 ,  2 , . . .,   }.Since each resource is associated with one or more items in , the set of item-resource pairs can be partitioned into equivalence classes guided by the item, [  ] =   , where   ⊂ R.That is, each resource    ∈   in an equivalent class [  ] is associated with the item name   .Each item-resource pair (  ,    ) is a time series of real values of length  , represented as a realvalued vector R. Also, each (  ,    ) has an associated time series of anomaly events(Ω   ) of same length T, where Ω( ) ∈ {0, 1}, 1 represents an anomaly event and 0 represents no anomaly.For this work, any state-of-the-art anomaly detection algorithm [10] can be applied to (  ,    ) time series for generating anomaly events time series Ω   .For the sake of brevity, the phrase anomaly event will be referred as event, and the events time series Ω   is referred to as Ω.The proposed approach, known as Dynamic-X-Y for alert suppression, comprises of the following three modules: (1) Persistent Regions Detection (PRD): To identify regions with high frequency of events that are dense in the input historical events time series Ω.
(2) X-Y Computation: From the detected persistent regions, dynamically learn the number of events ( ) and duration ( ) to automatically define tailored ASP per item.(3) Inference: At run-time, apply the ASPs to suppress alerts that are not persistent and do not require SREs attention, hence reducing the noise.
Figure 1: Three charts: blue for raw metrics, red for anomalous events, and green for persistent regions detected.

Persistent Regions Detection
A region in Ω( ) is identified as persistent if it contains a statistically significant number of events that are contiguous and dense over an extended period of time.Because each item (a service or a metric) behaves differently and has its own characteristics, hence, dynamic in nature, it may be the case that for an item, a persistent region consists of 10 events in 50 minutes.While for another item, a persistent region consists of 5 events in 50 minutes.The goal of PRD is to statistically learn one or more persistent regions for an item across all its resources.Each persistent region is characterized by four values: Start, End, EventFrequency, and Duration.
To understand it through an example, refer to Fig. 1, the x-axis represents time index.The first chart in blue color illustrates the raw metric values emitted from a resource.The second chart in red color shows the metric events time series (Ω) detected by a anomaly detection algorithm for the raw metric values in the chart above it.The input to the PRD module is an event time series (Ω), and the output is a list of persistent regions consisting of one or more metric anomalous events, shown in the third chart (in green).For example, Figure 1 illustrates that while there were multiple metric anomalous events, the PRD module identified a single persistent region.Figure 2 demonstrates the functioning of PRD module on a dummy event time series data, comprising of three steps: • Peak Detection: identifies high frequency events window in time series Ω. • Dense Peak Detection: identifies candidate persistent regions that consists of dense and contiguous peaks.• Merging and Selection: identifies the final persistent region by merging the candidates from the previous step.

Peak
Detection.The objective is to identify time windows in events time series Ω that deviate significantly from its moving average.We use a sliding window( 1 ) to capture the time-varying frequency of events in Ω.While traversing the time series Ω using the sliding window( 1 ) of window size  1 and window stride  1 (both measured in minutes), accumulate the count of events within each sliding window, represented as H, refer Equation 1.
To ensure that the size of H is same as Ω, a suitable length padding is used.To gain a better understanding of the application of  1 (orange window), we direct the reader's attention to Figure 2 that vividly illustrates how  1 is employed to aggregate the occurrence frequency of events within time series Ω.Overall, the vector H captures the time-varying frequency of events in the time series Ω, with each entry representing the number of events observed within a fixed time window of duration  1 with a window stride of  1 , i.e.H( ) ∈ {0, 1, . . ., 1 }.In this paper, the value of  1 = 20 and  1 = 5 (both in minutes) is used to calculate events' frequency.In a cloud environment, an application failure triggers a surge of events, leading to a higher frequencies of events which is already captured in H.The higher frequencies in H may deviate significantly from the mean frequency observed during normal system behavior.To identify higher frequencies in H that deviates from its mean, we use Bollinger Bands [19], a widely used moving average envelope method for analysis in financial domain.Bollinger Bands consist of three lines: a moving average bollinger band (MABB) (typically a 20-period simple moving average), an upper bollinger band -UBB (usually 2 standard deviations above the moving average), and a lower bollinger band (LBB) (usually 2 standard deviations below the moving average).These bands provide a measure of the volatility and fluctuation of an underlying time series.In this work, the UBB (equation 2) identifies high-frequency peaks in H that deviates from their moving averages.
In Equation 2, at each index  in H, the moving average over a period of length  is represented as   , the moving standard deviation is represented as   , and  is the multiplier.Bollinger Bands are wellsuited for detecting high-frequency patterns or peaks in H because they provide a flexible and dynamic way of capturing volatility in the events frequency over a period of time.
Equation 3 checks if the value of H( ) exceeds the moving average by a certain number of standard deviations (determined by the multiplier ).This identifies data points in H that are significantly deviating from the overall trend, and therefore are likely to indicate peaks.The final output is a vector F of length  , where F( ) ∈ {0, 1}.
A value of 1 in F indicates that the events frequency within a period of length  is greater than or equal to the UBB of the corresponding time window in H. And, a value of 0 in F indicates that the events frequency is below the UBB of the corresponding time window in H.This output helps to identify time windows where event frequency peaked, exceeding the expected variability of the event time series.
For an exemplary demonstration of the successful transformation of H into F, see Figure 2. The values of  = 40 and  = 3.5 are empirically determined for detecting high-frequency peaks in H.

Dense Peak Detection.
Bollinger Bands identified high frequency peaks that deviates from their moving averages, and next we want to measure that the peaks detected in vector F are also contiguous.By traversing the time series F using a sliding window ( 2 ) of size ( 2 ) and stride ( 2 ), accumulate peak signals present within each sliding window, represented as , refer Equation 4.
If the accumulated count of peak signals for each index  in  is greater than the threshold  then it is a dense region indicated by a value of 1, else 0 to indicate a non-dense region, refer Equation 5.
These dense peak signals are candidate persistent regions, represented as vector K, where K( ) ∈ {0, 1}.The vector K captures the dense peaks present in vector F, while filtering the isolated transient events.To gain a better understanding of the application of  2 (red window) in detecting dense peaks within F, see Figure 2. The value of  2 = 3,  2 = 1 and  = 2 are empirically determined for the purpose of detecting dense peaks in F.

Merging and Selection.
To ensure accurate identification of persistent regions, a two-step process is performed.Step 1 involves merging of candidate persistent regions in K that are in the neighboring vicinity of each other, and Step 2 involves time-based selection of persistent regions.Before performing the two steps, the start and the end indexes of each candidate persistent region in K is mapped to its corresponding start and end timestamps of events in Ω.This mapping is achieved by backtracking the operations of two sliding windows,  1 and  2 .The result is a list of candidate persistent regions P. where   and   represents the start timestamp and the end timestamp, and   is the events count within the duration   of the candidate persistent region .Each candidate persistent region in P is merged with its neighbouring candidate persistent region guided by the following heuristics.Regions  and  are considered neighbors if they are within  minutes of each other.This step examines whether a persistent region is surrounded by other persistent regions in a nearby neighborhood and merges such neighboring regions into a single persistent region.Hence, creating a combined longer persistent region, instead of several small and isolated persistent regions.In the second step, persistent region is selected.If the duration of the merged candidate persistent region is longer than the  (width-cutoff) minutes, then it is classified as persistent region.This step helps filter out false positive candidate persistent regions from P. After merging and selecting candidate persistent regions, the final set of persistent regions in P are used to compute the number of events ( ) and the duration ( ) over which the events must be observed to unsuppress them.The parameters  and  were set to 15 minutes.

X-Y Computation
The main objective of this module is to learn a value of  and  for each equivalence class [  ] =   , ∀  ∈ M and   ⊂ R. The  and  value for each item   is used to define a ASP that is applied at run-time for alert suppression.The  and  value is central representative of the list of persistent regions in P  for an item   , defined as follows in Equation 6.
where  is the number of resources in   and  is the minimum number of detected persistent regions required for each item-resource pair in [  ] =   .That is, concatenate list of persistent regions across all  resources, where the number of persistent regions in the list P , associated with each resource   ∈   is greater than .For an item   , computing a central tendency value of  and  from P  is not a straightforward task because the persistent regions in P  are not necessarily all of the same Duration.Hence, computing a mean or median directly of the EventFrequency present in P  to get a central value of  may not be accurate.To address this, we first find the quantum persistent region Q in P  which is defined as follows in Equation 7.
where Q is a persistent region in P  with the smallest duration.
Based on the duration in Q, the EventFrequency is scaled down for each persistent region in P  by the same ratio and stored in a vector X, as shown in Equation 8: The final values of  and  are computed using Median(X) and Q.Duration, respectively.The X and Y values for each item   are used to define the ASP which is used at run-time to suppress alerts, as explained below.

Inference
The ultimate goal of this work is to accurately determine when to suppress or unsuppress alerts using the learned ASP.At runtime, the observability tools continuously observe and collect the log data coming from a service and metric values coming from a metric.The anomaly detection algorithm acts upon these values to detect anomalous events which are then de-duplicated and grouped together into alerts.As new events arrive, and if they are de-duplicated to an existing alert then the event count of the alert is updated until the issue is resolved.If the event count in an alert at any point in time exceeds the learned X value within Y duration as defined by the corresponding ASP, the alert is unsuppressed, leading to an alert notification to the SRE.Conversely, if the aforementioned condition is not met, the alert remains suppressed.The application of ASP minimizes unnecessary noise, thereby, enabling timely and accurate responses to critical alerts, facilitating the creation of actionable alert notifications for SREs.

EXPERIMENTAL SETUP
This section covers various aspects of the datasets used in the study, including its characteristics, the creation of annotated test datasets, the baseline methods, and evaluation metrics.

Datasets
In-order to validate the quality of persistent regions detection across different modalities of data i.e. logs and metrics, we collected log data for a conversation application deployed on cloud and the metric data was collected for a set virtual machines (VMs) hosting hundreds of applications deployed and provisioned on cloud.The metric dataset was prepared from 14 days of raw metric data, consisting of 52 metrics across 1836 resources.The total number of active metric-resource combinations were 3576.The metric anomaly detection algorithm detected 56, 442 anomalous events.Whereas, two years worth of log data was collected from a conversation application that consists of 15 micro-services, continuously generating log data as requests are made.The log anomaly detection algorithm identified 137, 269 log anomaly events.Not only the duration of log data is longer, it also contains a lot more anomalous events as compared to the metric data.An interesting characteristic of the log data that we would like to bring to readers attention is the inter-arrival frequency of events, i.e sproadic nature.For the metric data, the average duration of contiguous metric anomaly events is approximately 53(..) minutes (a little less than hour) containing 11 events.Whereas, for the log data, the average duration of contiguous log anomaly events is approximately 87(..) hours containing 2691 events.This suggests that the log events are less sporadic in nature as compared to the metric events.Or, whenever log events happen they happen for a larger duration of time, and metric events happen for a shorter duration of time.

Test Dataset Preparation
In AIOps, alerts are unsuppressed by default, leading to an alert notification.As mentioned in the introduction section, SREs often complain that some of the alert notifications are false, and they unnecessarily spend considerable time sifting through numerous events to determine the underlying problem.Subsequently, false alert notifications may lead to lack of trust in the system.An effective alert suppression system has a direct correlation with its efficacy at detecting persistent regions and non-persistent regions.Note that, events in a persistent region will be unsuppressed, leading to an alert notification.Whereas, events in a non-persistent region will be suppressed, and hence no alert notification is generated.Therefore, to compare baselines (outlined in subsection 3.3) with our approach and to ensure accuracy and reliability, we need ground truth data of persistent and non-persistent regions.In order to generate accurate ground truth data, an expert in the field assigned labels to regions where anomalous events were observed keeping in mind the persistence characteristics, velocity and density.Only those time series were identified for human labeling that contains at least one anomalous events, resulting in 38 and 12 time series for metric and log data, respectively.The plot in Figure 3 shows 5 regions for an event time series.To receive ground truth labels from the annotator, we prepared a corresponding table (refer to Table 1) comprising of the region number, events count per region, start time and end time of the region.The last column in Table 1 shows human annotations, a TRUE value indicates a persistent region, while a FALSE value indicates a non-persistent region.Table 2 presents details and relevant statistics of the labeled test datasets for both metrics and logs.Note that, just the mere presence of anomalous events in the time series does not guarantee persistent behavior, i.e. there could be a event time series that exhibits sporadic behavior.In such scenario, no persistent regions are observed, and therefore no ASP is generated.For item (either metric or microservice), where no policy is generated because persistent regions were not observed in the historical data, this indicates that the particular item was never effected when the fault happened in the system.Due to this reason, our algorithm was able to learns 16 ASPs one for each 16 metrics out of total 52 metrics in metric dataset, and similarly learns 12 ASPs one for each 12 micro-services out of total 15 micro-services in log dataset.For example, for the metric 'MemoryUsed', the Dynamic-X-Y method learned  = 11 and  = 55 min.Similarly, in the log dataset, for the 'NLU' service, the proposed method learned  = 4512 and  = 12.5 hours.These results very well align with the less sporadic nature of log data as mentioned in Section 3.1.

Baseline Methods and Proposed Method
To evaluate the performance of our proposed method for alert suppression, we compare it with the two baseline methods.(1) No-Suppression: This method creates an alert notification if an anomalous event is detected.That is, any region in an event time series Ω having at least one anomalous event is a persistent region.This is equivalent to  = 1 in an X-Y-Policy which is also the default behavior.(2) Static-X-Y: This method uses predefined static X and Y values for detecting persistent regions, suggested by a domain expert.While one could determine the best X and Y values for a dataset using ground truth (if available) and a brute force search, it is usually not feasible due to unavailability of ground truth data.Domain experts are consulted instead, but their suggested values may not be sufficient for dynamically changing scenarios (discussed earlier).If 6 or more events are reported in 30 minutes, the corresponding region is persistent, and the alert is unsuppressed.The static X and Y values for this setting are 6 and 30, respectively.(3) Dynamic-X-Y: Our proposed method dynamically learns X and Y values from historical data for each item in an unsupervised setting, i.e. no ground truth is needed to learn the values of X and Y.One clear benefit is that it observes and recommends a tailored X-Y-Policy for each item, unlike the aforementioned baseline method that has a static X-Y-Policy applied across all metrics or services.

Evaluation Metrics
During evaluation, each of the aforementioned methods detected persistent regions in event time series to unsuppress alerts.To evaluate the accuracy of methods used for detecting persistent and non-persistent regions, standard evaluation metrics such as True-Positive, False-Positive, True-Negative, and False-Negative are utilized to calculate F1 score [31] and Accuracy [31].True-Positive measures if a method detected a ground-truth persistent region as a persistent region, while False-Positive measures if a method detected a ground-truth non-persistent region as a persistent region, True-Negative measures if a method detected a ground-truth nonpersistent region as a non-persistent region, while False-Negative measures if a method detected a ground truth persistent region as a non-persistent region.

EXPERIMENTS
We address the following questions through our experiments: (1) How does the performance of the proposed method compare to the baseline methods?(2) How does the proposed unsupervised method Dynamic-X-Y compares to the X and Y values computed using a supervised setting?(3) What are the effects of the hyper-parameters on the performance of the proposed method?

Performance Analysis
Table 3 compares the performance of our proposed method with the baselines.The results show that our method Dynamic-X-Y outperforms both the baselines by significant margin.On the metrics dataset, the percentage improvement in accuracy of persistent regions detection is 45.8% (64.41 to 93.93) and 7.4% (87.46 to 93.93) compared to the No-Suppression and Static-X-Y methods, respectively.Similarly, on the Log dataset, the percentage improvement in accuracy is 37.5% (50.23 to 69.09) and 35.7% (50.90 to 69.09) compared to the same methods.The reason why both baseline methods exhibit lower F1 and Accuracy scores than the Dynamic-X-Y approach on the two test datasets, is because they are designed to have greater sensitivity for detecting persistent regions.Consequently, these methods prioritize a wider coverage of alert notifications, including both persistent and non-persistent regions, with the trade-off of potentially generating a higher number of false positive persistent regions.For example, the No-Suppression method identifies a region as persistent if at least one anomalous event is observed.Thus detecting numerous false positive persistent regions, and hence resulting in a lower F1 and Accuracy score.It generates alerts for nearly every detected region, leading to an excessive number of noisy alerts that can hinder the SREs' ability to focus on alerts that actually require their attention, ultimately resulting in inefficiencies.The Static-X-Y method, on the other hand, is more stringent than the No-Suppression method, calling a region persistent only when it observes at least X =6 anomalous events within a Y = 30-minute interval.As a result, Static-X-Y detects fewer persistent regions, also leading to reduction in the number of false positive persistent regions.This improvement is evident in both F1 and Accuracy scores when compared to the No-Suppression method on the Metric dataset.It's worth noting that in the case of the Log Dataset, the performance of Static-X-Y method is very similar to the No-Suppression method.This can be attributed to the fact that the average number of events in each persistent region is 2691 in log dataset, which is greater than the X = 6 of Static-X-Y method and Dynamic Alert Suppression Policy for Noise Reduction in AIOps ICSE-SEIP '24, April 14-20, 2024, Lisbon, Portugal also greater than X = 1 of No-Suppression method.Consequently, in the log dataset, the Static-X-Y method detects a similar number of persistent regions as No-Suppression method.This suggests that the Static-X-Y method's parameters, X and Y, are not domain agnostic, and its efficacy may vary with new data from different domains having different characteristics in terms of velocity and density of anomalous events.In contrast, the proposed Dynamic-X-Y method adapts by learning optimal X and Y values for each dataset.For example, in the log dataset, the average  value learned from all micro-services is 2376.This value is comparable to the average number of events (which is 2691) in each persistent region within the log dataset.This adaptation seeks to strike a balance in detecting true positive persistent regions while minimizing the false positive persistent regions, as reflected in the state-of-the-art F1-Score and Accuracy scores presented in Table 3.

Unsupervised vs Supervised setting
For this subsection and the following subsection, we will be using metric data for our study.This is because events in metric data exhibit sporadic behavior, it is imperative that greater benefits are observed when tailored alert suppression policies are applied.In a scenario with labeled data available, we perform a brute force search to find the optimal values for X and Y that maximize Accuracy in detecting persistent regions.We then compare the results of this search to our unsupervised approach, which does not require labeled data.Comparing the accuracy of the optimal values to the Dynamic-X-Y method will assess its capability and robustness.The experiment utilizes the metric dataset mentioned in Section 3. Since events are arriving every 5 minutes, the arrival of  events will take a minimum of  * 5 minutes.Thus, for  = , the potential values of  fall within the range of {() × 5, ( + 1) × 5, . . ., 11 × 5, 12 × 5}, where 2 ≤  ≤ 12.For instance, if  = 11, then the possible range of Y would be {55, 60}.This results in a total of 66 X-Y pairs.Figure 5 shows the 3-dimension plot of these 66 combinations, where x-axis represents the number of events (X), y-axis represents the time interval (Y) in steps of 5 minutes, and z-axis represents the accuracy score for each X-Y pair.The accuracy keeps increasing as X keeps increasing, and it reaches a maximum accuracy of 94.42 at (,  ) = {(9, 45), (9, 50), (9, 55), (9, 60)}.After which, accuracy drops as X keeps increasing.This means the optimal value of X and Y using brute force approach is 9 and 50, respectively.Dynamic-X-Y approach learns an optimal X and Y value for each metric name., refer Table 4.To compare with the single optimal X and Y value computed using brute force search under supervised settings, we take the mean of the X and Y values across all metric names in Table 4 learned by Dynamic-X-Y.This approach allows us to learn a single value of X and Y that is valid across all metric names for Dynamic-X-Y.The calculated mean of X and Y values over all metrics are 9.06 and 54.06, respectively.It is worth noting that the learned mean value of X and Y from the unsupervised approach is very close to the optimal X and Y values found through the brute force search under supervised setting.Determining X and Y values under supervised setting is a hypothetical approach, since labeled data is hard to get.The values of X and Y under the supervised setting is the most idealistic value, and therefore serves as an upper bound on how good a model can perform.The ability of the Dynamic-X-Y method to determine the best value of X and Y under unsupervised setting shows the overall strength of the method.This shows that the Dynamic-X-Y approach can learn optimal X and Y values in the absence of labeled data.

Ablation Study
Through ablation studies, this subsection assesses the impact of three hyper-parameters on the effectiveness of the Dynamic-X-Y method: Width-Cutoff (), Window-Size ( 1 ), and Count of Persistent Region ().For each ablation study, the parameter under investigation is varied while all other parameters are kept constant.the performance of the Dynamic-X-Y approach.A higher value of  implies a stricter observation of candidate persistent regions by the algorithm, while a lower value of  leads to the opposite effect.A lower value of  results in the filtering of relatively fewer candidate persistent regions, leading to the selection of more candidate persistent regions as final persistent regions.In Table 5, when  = 10, there are 21 policies, whereas when  = 15 there are 16 policies.The persistent regions are detected for a greater number of metrics and relatively more number of persistent regions for each metric.These regions will likely have a smaller size, resulting in smaller X values.Policies with smaller X values will lead to larger number of irrelevant alerts, thereby defeating the objective of noise reduction.In summary, low value of  results in learning more policies, each policy has low value of X and Y, triggering false alert notifications.Exactly opposite will occur when we increase the value of ; Less persistent regions are detected for a reduced set of metrics and hence less policies, each persistent region of relatively larger size with higher value of X. Learning less policies with higher values of X and Y is also not desirable, because for metrics where no policies are available, the No-Suppression default method is used which will also lead to false alert notifications, and hence low precision.This is the reason why precision drops as the value of parameter width-cutoff () increases.These results indicate that there is an optimal value of  that should be used for identifying persistent regions to learn values of X and Y.Both high or low values of  leads to more false positives which in turn reduces precision, leading to a drop in accuracy.For this dataset, the most optimal value of parameter Width-Cutoff () is 15, leading to 16 policies.).The parameter  1 is defined in the peak detection (subsection 2.2.1) to control the length of the sliding window ( 1 ) over which events are accumulated, resulting in a time series (H).Table 6 presents the effect of varying the parameter  1 on the performance of the Dynamic-X-Y approach.One observation is that there is a direct proportionality between the Window-Size( 1 ) and the number of policies learned.In other words, the Dynamic-X-Y method learns more policies when the value of  1 is increased.Another key observation is that while the method learns more X-Y-Policies, the corresponding X value in those policies gets smaller with an increase in  1 .For instance, the metric Mem.Resident has the values of  = {11, 10, 8, 8, 8} when the value of  1 = {10, 15, 20, 25, 30}.To gain insight into this behavior, let us consider the example depicted in Figure 6.Here, we observe two sliding windows of different sizes (3 and 9) traversing across the same input event series Ω to generate the corresponding H.In Ω, there exist two regions containing events, where the first region contains only one event while the second region contains six events.Irrespective of the window size, the second region is always detected as a persistent region.Whereas, the first region is detected as a persistent region when the window size is 9.The reason for this is that when the window size is 3, the signal at index 2 in H fails to meet the UBB condition (1 ≥  +  × , where  = 0.25,  = 0.43).Therefore, it is undetected as a peak, so a persistent region is not detected which is a true negative.However, when the window size is increased to 9, the event frequency increases from 1 to 2, and the signal at index 2 is detected as a peak because it satisfies the UBB condition (2 ≥  +  × , where  = 0.25,  = 0.43), leading to the detection of the persistent region, which is a false positive.Consequently, with  1 = 3, the X value is equal to the EventFrequency of the second region which is detected as a persistent region.Whereas, with  1 = 9, the X value is equal to the median of the EventFrequency of the first and second region which is obviously less than the X value calculated with  1 = 3.In summary, when the window size  1 is increased, the events in Ω tend to interfere with each other during windowing for computing vector H.As a result, candidate persistent regions are detected with a smaller EventFrequency, thereby lowering the overall median (final X), leading to more number of policies.During inferencing, these policies may generate more false alert notifications, especially when the learned value of  is small.Based on the above discussion, it is evident that setting  1 = 30 will result in more policies with smaller values of X, leading to false alert notifications, and hence lower precision and accuracy of 33.22 and 67.37, respectively (Table 6).In conclusion, the Window-Size ( 1 ) has a direct relationship with the number of false alert notifications, and therefore the accuracy decreases as the  1 increases.

Effect of Number of Persistent Regions (𝛼).
The parameter  is defined in the subsection 2.2.3.It suggests how many minimum number of persistent regions must exist in the list of final persistent regions P , for calculating the values of X and Y. Table 7 illustrates the effect of varying the value of  on the performance of the Dynamic-X-Y approach.For this experiment, all other hyperparameters were constant while varying the value of the parameter .As  increases, the majority of metric names with zero or one persistent region are filtered out by the threshold imposed on P , , refer Equation 6.Consequently, no X-Y-Policy is learned for most of these metric names.For instance, when  = 1, only 16 out of 52 total metric names have X-Y-Policies learned, whereas when  = 12 the number of X-Y-Policies are 0. If X-Y-Policy is not available for a metric name at run-time, we apply the No-Suppression approach which will have higher false positives leading to lower precision.As  approaches 12, the Dynamic-X-Y is not able to learn X-Y-Policy for any metric name and the results become exactly same as the No-Suppression baseline result as shown in Table 7.For this dataset, when  = 1, the number of policies are 16.The value of  = 0 is an oxymoron in this case, X and Y values cannot be learned if there are no persistent regions.These results suggest that while designing an alert suppression system, the algorithm can learn X and Y values when at least one persistent region is detected.

INDUSTRIAL CASE STUDY
Next, we present key insights and findings from the industrial case study, derived from applying the proposed methodology to a real-world production dataset.

Learn and Apply Dynamic-X-Y Method
We present a tool that utilizes the Dynamic-X-Y method to learn and apply X-Y-Policies.

Demonstrate Efficacy of X and Y Policies
The focus of this sub-section is to demonstrate the practical application and effectiveness of the ASPs learned using the novel Dynamic-X-Y method.Specifically, we examine its efficacy in noise reduction by suppressing alerts for a specific metric called "TcpRetrans".The data used in this study spans over 14 days, and consists of 206 anomalous events.As illustrated in the Red plot of Figure 9, these events are plotted against their respective timestamps, offering a clear visual representation of their distribution.Figure 9 shows 206 anomalous events grouped temporally as described in Section 1, leading to 13 distinct alerts, the number adjacent to each alert represents the event count.Each peak in the Red plot represents an alert.Using the No-Suppression baseline, each alert leads to  4).Upon

Figure 2 :
Figure 2: Running example to illustrate Persistent Regions Detection module on input event time series: bold-outlined 0's are padding blocks, bold arrows show first two sub-modules for computing K with Ω, dotted arrows depict sliding window backtracking and candidate persistent region detection.
Figure 2 provides an illustration of the process by which the indices of the dense peaks present within K are backtracked to the corresponding timestamps of the candidate persistent regions residing in Ω.Each element in P has the following structure, (  ,   ,   ,   ),

Figure 3 :
Figure 3: Time series for a metric-resource pair showing persistent and non-persistent regions.

Figure 4
shows two plots for a time series data; the top plot is the manually curated ground truth, and the bottom plot is the output of the PRD module showing detected persistent region(P) and non-persistent region(NP).It detected two persistent regions and three non-persistent regions, whereas the ground truth indicate only one persistent region and three nonpersistent regions, therefore   = 1 and   = 1,   = 3 and   = 0.

Figure 4 :
Figure 4: Blue and Red graphs: Ground Truth and Prediction

Figure 5 :
Figure 5: Performance of 66 possible Static-X-Y policies on the test dataset 4.3.1 Effect of Width-Cutoff().As mentioned in subsection 2.2.3, the Width-Cutoff parameter is used to select persistent regions from a pool of candidate persistent regions that have duration exceeding  minutes.Table 5 illustrates the effect of varying  on

Figure 7
displays the user interface of the training tool.By clicking the hyperlink "Train models" on the top left, the Dynamic-X-Y algorithm retrieves historical alert data from the alert database to learn the tailored ASPs.Once the policies are learned, they are sent to an automation dashboard, as shown in Figure8.In the automation dashboard, a user can enable or disable suppression policies in real-time.As an exemplar, only 3 out of 16 policies are shown.When the policy is enabled, it acts upon an incoming alert to suppress or unsuppress it.Clicking on a policy name in the Automations Dashboard UI provides detailed information about the corresponding policy.

Figure 7 :
Figure 7: Tool UI to learn X-Y Policies using Dynamic-X-Y

Figure 8 :
Figure 8: Automations Dashboard UI with X-Y Policies

Table 1 :
Table corresponding to regions in Figure3for labeling the last column as persistent (T) or not-persistent (F)

Table 2 :
Statistics of annotated Test Datasets

Table 3 :
Model Comparison on Test datasets

Table 4 :
ASPs Learned by Dynamic-X-Y method

Table 5
illustrates the effect of varying  on

Table 5 :
Effect of parameter Width-Cutoff() on the performance of the Dynamic-X-Y approach

Table 6 :
Effect of parameter Window-Size( 1 ) on the performance of the Dynamic-X-Y approach

Table 7 :
Effect of the count of persistent regions() on the performance of the Dynamic-X-Y approach