STIGS: Spatio-Temporal Interference Graph Simulator for Self-Configurable Multi-Tenant Cloud Systems

The finer-granularity of microservices facilitate their evolution and deployment on shared resources. However, resource concurrency creates elusive interdependencies, which can cause complex interference patterns to propagate as performance anomalies across distinct applications. Meanwhile, the existing methods for Anomaly Detection (AD) and Root-Cause Analysis (RCA) are confounded by this phenomenon of interference because they operate within single call-graphs. To bridge this gap, we develop a graph formalism (Spatio-Temporal Interference Graph - STIG) to express interference patterns and an artifact to simulate their dynamics. Our simulator contributes to the study and mitigation of interference patterns as a performance phenomenon that emerges from regular resource consumption anomalies.


INTRODUCTION
In the ever-evolving landscape of cloud computing, microservices have emerged as a dominant architectural style, enabling more flexible and scalable applications.This style relies on a finer-granularity of functions and more radical resource sharing among different applications.However, this strategy increases overall system complexity by adding elusive interdependencies among microservices [6] from distinct applications.Definition 1.1.Interference happens when two services that have no logical dependency (caller-callee relation) compete for the same resource (compute, memory, I/O) to the extent that they affect each other's performance (e.g., throughput, latency) [9].
Contrary to the caller-callee relations [5], in application callgraphs and abstract syntax trees, these new interference-enabling interdependencies are more elusive because their presence and flow of direction are not deterministic.Instead, interdependencies might appear and disappear according to the non-stationary patterns of the applications' usage and the work of load balancing or self-configurable service placement mechanisms.Therefore, cross-application services interference confounds the outcome of traditional microservice diagnostic methods like Anomaly Detection (AD) and Root-Cause Analysis (RCA) [4,11], as these methods rely on stable and predictable call-graph dependencies [5].
While self-configuration solutions can dynamically adapt to changes in the application usage [3], multi-tenant systems require more involved approaches [10].For that, various interference mitigation (IM) methods have been developed -originally, for virtualized cloud environments [9] and, lately, for microservices [1,7].Nonetheless, there are still at least two obstacles that prevent existing IM methods from reducing confounding in AD and RCA approaches: (1) limited number of covered services (four as in [1,12]), and (2) reliance on metrics that are agnostic to the interdependencies across applications.These methods measure interference w.r.t.sensitivity (the susceptibility of a service to be influenced by other services) and contention (the service consumption demand on a resource, e.g., CPU) between service pairs, but they are oblivious of the many-to-many relationship nature of interference.
Conversely, our approach overcomes these limitations by formulating the interference phenomenon as a spatio-temporal graph.Our corresponding simulation helps mitigate the probability and impact of the interference phenomenon by de-confounding the diagnostics from the AD, RCA, and IM methods, hence, rendering these methods more effective for complex multi-tenant cloud systems [1,12].We contribute with (1) a formalism to capture interference patterns as spatio-temporal graphs (STIG), (2) a simulator called STIGS (Figure 2) for generating interference patterns, and (3) a practical evaluation with three popular microservice benchmarks (Bookinfo 1 , TeaStore 2 and SockShop 3 ).Definition 1.2.Spatio-Temporal Interference Graph (STIG) is denoted as G = ( , ,   ( ) ,   ( ) ), where  are nodes representing services,  are directed edges representing interference between services across applications,   ( ) are the time-varying node features (e.g., resource per service), and   ( ) the edge features (e.g., interference probability).

INTERFERENCE ANOMALY SCENARIO
As an example, assume three e-commerce applications having 14 microservices (shown in Figure 1) deployed on the same server (either host1, host2) each with a CPU of 4 cores and 10 GB of memory.The occurrence of a sudden surge of 100% in users during a flash sales event could subsequently cause an increase in the demand for these applications, e.g., from 60% to 90% CPU and memory usage from 5GB to 10GB.As the services compete for shared resources, the increased load could induce a low response time, e.g., 1000ms from the original 100ms among the resource-sharing services.This, in turn, could evolve to more severe problems like intermittent or permanent failures.Because anomalies jump across the applications' borders, one cannot rely on the individual call-graphs and performance metrics.To address this situation, the STIG model captures the dependencies originating both from the call-graph and the deployment graph (e.g., service placement configuration).

STIG SIMULATOR 3.1 Design and Architecture
The workflow of the STIGS depicted in Figure 2 represents a structured approach to modeling and analyzing interference in multinode applications, which we detail next.The task Define Multi-Node Application generates the dependency graphs from the system architecture (System Archi.xml) and the deployment configuration (Deployment config.yaml).Based on that, we Instantiate the Semantic Model Template to extract distinct interferenceenabling paths.The Graph Generator combines the set of distinct paths and the multi-tenant setup (Deployment config.yaml) to generate (1) a knowledge deployment graph (e.g. Figure 1) and ( 2) the time-annotated call-graphs, which serve as ground truth for the STIG generation process.The Impacted Pair Generator task identifies the candidate pairs of service nodes with the potential for mutual interference.The Interference Probability Calculator estimates the likelihood of interference by taking into account both the execution timings and their history of service anomalies.Finally, one or multiple instances of the STIG (e.g. Figure 3) are generated to represent distinct likelihood scenarios of anomalies induced by interference between services across applications.If at

Algorithms
To investigate the interference phenomenon, we identify the source and corresponding impact of the interference through the proposed algorithms.In Algorithm 1, we computed query predicate stack that acts as sources and targets of interference, respectively, from the Knowledge Deployment Graph (kgraph) and particular host node (Host1 in Figure 1).These stack computations depend on the execution order of calls at the specific host (line 5 and 10).The source of interference on one or more targets services is capture as a probability measure proportionate to the magnitude of shared resources within a time window.Consequently, longer time intervals and higher resource utilization entail higher probability of interference (computed by the Algorithm 2).This involves generating a list of the impacted node pairs (sourceStack and the targetStack) based on their execution overlapping times.The algorithm first sorts these stacks by their execution start time (line 2) and matches the current source node and the target nodes list given their execution time conditions (lines 3-10).The probability of interference is derived for each source node (curSource) and their respective overlapping target nodes (curTargetList), also factoring-in their levels of shared resource usage.With that, we can estimate the interference probabilities for the STIG (line 12 by calling Algorithm 3).This involves computing for each source node (curSource) the list of target nodes (curTargetList) and their corresponding execution time overlap, as well as the magnitude of the resource usage shared with each source and target nodes (lines 3-6).The resulting list of impacted pairs is then returned by Algorithm 2 (line 15).return .. 27: end procedure Using this information, we can construct Spatio-Temporal Interference Graphs (STIGs) as described in Algorithms 1,2 and 3.The STIG, as seen in Figure 3, consists of nodes as services, solid edges as service calls within the same application, and the dotted edges standing for interference paths.The weights on the interference edges can be initialized with prior probabilities based on temporal execution overlap across application services sharing the same resource (worker-node).

EVALUATION CASE STUDY
We deploy three popular benchmarks (BookShop, TeaShop, and SockShop) on a Kubernetes cluster and generate traces by injecting requests (10 to 1000) to their front webpages.Traces are collected based on the following for each node in   do  Requests" column shows how many requests are made in each configuration.This starts at 10 requests in config1 and increases progressively, reaching up to 1000 requests in config11.The "Rate" of request is every 1 min.These generated traces will help in our analysis in combination with STIGs.Traces dataset is available at simulator's Github repository.

STIG Analysis
To visualize the cause-effect phenomenon on generated STIGs, we extracted only the source and target pairs of the front-end service based on the maximum interference effect and obtained all associated source and target pairs.As a reference, Figure 4 shows a structural dependency matrix (SDM [2]) representing the interference probabilities (STIG edges) between source and target services (STIG nodes) of SockShop and TeaShop, where the darker colors represent higher probability.In the SDM, the front-end-M1:shop1 shows the highest probability (1.0) of being interfered with by frontend-M2:shop2, which stems from the assumed determinism of these services starting simultaneously.Conversely, as the effect of interference propagates, there is a lower interference probability, which reflects smaller execution overlap between downstream services.

Reconfiguration Plan
The reconfiguration plan involves ranking the services with respect to the highest probability of necessity and sufficiency of being the culprit of the anomaly induced by interference.Because interference happens both ways, the plan can attribute source and target to anomalous services in either side of an interference association.For this, we monitored and collected traces from shops (Table 1) and performed probabilistic analysis on them.Probability of Necessity (PN ) consists of the chance that an effect (anomaly) on a target node ( = 1, i.e.,  ) is caused (interfered) by an anomaly on a source node ( = 1, i.e.,  ), given that there is a history of absence of anomaly on the target node ( = 0, i.e.,  ′ ) and there is an absence of anomaly on the source node ( = 0 or  ′ ).Formally, from Pearl [8], PN(Y,X) = P(Y,X|Y',X').The Probability of Sufficiency (PS) is the reverse case PS(Y,X) = P(Y',X'|Y,X), while the probability of both Necessity and Sufficiency (PNS) is the weighted average PNS(Y,X) = P(X,Y)PN(Y,X) + P(X',Y')PS(Y,X).Among the various approaches to compute these probabilities, we adopted the formulations in [8] (section 19.3.3) that assume causal exogeneity6 and monotonicity7 .The formulations are the following PNS = P(Y|X) -P(Y|X'), PN = PNS / P(Y|X), and PS = PNS / [1 -P(Y|X')].The results in Table 2 show that PN is more than two orders of magnitude higher than PS and PNS.This means that one can focus primarily on tackling the necessary sources of the induced anomaly, i.e., product-page and reviews.To mitigate the interference-induced anomalies on teastore-webui, one could reconfigure the deployment graph in a way that this microservice is placed on a worker-node where there are no instances of the productpage and reviews microservices.Meanwhile, the STIG simulated data also informs us that the other anomalies (e.g., on (teastore-auth and teastore-image services) are not induced by an interference.For these cases, the solution is to add more resources (compute, memory) to their corresponding worker-nodes.For more details, an analysis is available under the artifact Github repository data/traces folder 8 . The only services with anomalies in BookShop are product-page and reviews, whereas in TeaShop only the teastore-webui, teastore-auth, and teastore-image have anomalies.However, there were only two pairs of services with joint probabilities  ( ,  ) > 0 (shown in the table).

CONCLUSION AND FUTURE WORK
We presented a novel approach to the problem of service interference in multi-tenant microservice architectures, where concurrency over shared resources induces the propagation of elusive anomaly patterns.Our formalism and simulator are a contribution to the study of interference anomalies and the mitigation of this complex emergent phenomenon.The artifact components and the interference simulation can be easily extended to new performance anomaly scenarios.In future work, we plan to study the scalability and latency of the simulator within larger and more heterogeneous deployments.

Figure 1 :
Figure 1: Knowledge Deployment Graph.Nodes colors for distinct applications (shops) and maroon/red color for host nodes.The dashed arrows for hosting service relationships and the solid arrows for caller-callee relationships.

Algorithm 1
Compute Query Predicate Stack 1: procedure generateQPstack(kgraph, host, filepath) .host= List all nodes deployed on host 11: exe.orders= exe orders in nodes at host 12: Initialize Query.Predicate.stack as an empty list 13: for each . in ...ℎdo 14: Get index of first service path in ℎ 15: end for 16: for each  in .do 17: Create . with service details 18: if  on same path of service in .then start time for each entry in query and predicate lists 25:Add a dictionary with query and predicate to .. 26:

4 :Figure 3 :
Figure 3: Spatio-Temporal-Interference-Graph.(STIG).Nodes are services, solid edges are calls within one application, and the dotted edges are interference paths

Figure 4 :
Figure 4: Structural Dependency Matrix: consolidates the averages of interference across a STIG set.

Table 1 :
Configuration of Traces Generation

Table 2 :
Results for two interfering service pairs