VAMP: Visual Analytics for Microservices Performance

Analysis of microservices' performance is a considerably challenging task due to the multifaceted nature of these systems. Each request to a microservices system might raise several Remote Procedure Calls (RPCs) to services deployed on different servers and/or containers. Existing distributed tracing tools leverage swimlane visualizations as the primary means to support performance analysis of microservices. These visualizations are particularly effective when it is needed to investigate individual end-to-end requests' performance behaviors. Still, they are substantially limited when more complex analyses are required, as when understanding the system-wide performance trends is needed. To overcome this limitation, we introduce vamp, an innovative visual analytics tool that enables, at once, the performance analysis of multiple end-to-end requests of a microservices system. vamp was built around the idea that having a wide set of interactive visualizations facilitates the analyses of the recurrent characteristics of requests and their relation w.r.t. the end-to-end performance behavior. Through an evaluation of 33 datasets from an established open-source microservices system, we demonstrate how vamp aids in identifying RPC execution time deviations with significant impact on end-to-end performance. Additionally, we show that vamp can support in pinpointing meaningful structural patterns in end-to-end requests and their relationship with microservice performance behaviors.


INTRODUCTION
Microservices have emerged as a pivotal change in the software industry, paving the way to a novel paradigm for structuring the software development process.This novel approach entails multiple independent teams responsible "from development to deploy" [31] of loosely coupled independently deployable services [30,31].Due to their modular nature, microservices are particularly well-suited for the modern software industry, where rapidly releasing software updates and enhancements is a critical competitive advantage [34].
Although beneficial in many aspects, microservices also introduce new challenges, especially when it comes to maintaining consistent software performance.This complexity arises from various elements.Firstly, the inherent complexity of these systems often hinders the adoption of proactive measures for performance assurance [38,45], such as pre-production performance testing [17,21,42].Secondly, these proactive measures are often hampered by time and resource constraints due to the substantial pressure to deliver fast-to-market [34,40].Thirdly, microservices systems typically exhibit an emergent performance behavior in the field that is hard to predict in advance [45].Finally, these systems undergo continuous software changes, with multiple releases occurring on a daily basis, and handle highly variable workloads [3], which make them more vulnerable to unforeseen performance regressions [43,45].
These challenges have led to an increased interest in the concept of observability [29], i.e., the ability to have a holistic understanding of the system's performance by analyzing its logs, traces, and metrics.Distributed tracing tools [32] are today widely used in practice to enhance observability of microservices systems [28].These tools track and record the propagation of requests as they flow through different RPCs and services of a microservices system [35], and provide visual aids to support performance analysis of end-to-end requests, e.g., swimlane visualizations [10,37,39].
Despite their utility, distributed tracing tools have recently been criticized for their limited support for performance analysis [10].A common use case for these tools is the analysis of the system-wide performance behavior [32], such as understanding the response time distributions of end-to-end requests [10].However, current distributed tracing tools often fall short in this area, necessitating a switch between various visualization tools, which can make the process cumbersome and time-consuming [10].Indeed, they primarily focus on the analysis of individual requests, which has limited value unless it is compared with the performance behavior of the entire corpus of requests [2,10,32].
In this paper, we introduce vamp, an innovative visual analytics tool designed to enhance the performance analysis of microservices systems.vamp extends the conceptual proposal of Leone and Traini [22].The fundamental idea underpinning vamp is to simplify the understanding of the relationship between request characteristics and end-to-end response time behavior through interactive charts and color-encoding techniques.vamp comprises two main visualization components: an interactive tree that illustrates the workflow in terms of RPCs for multiple end-to-end requests, and an interactive histogram representing the end-to-end performance behavior of the requests under analysis.Interaction with these visual components aids in identifying the unique characteristics of certain RPC execution paths with respect to some specific system performance behaviors.
We evaluate vamp using 33 datasets derived from TrainTicket [47], an open-source microservices system widely utilized in previous software engineering research [23,41,46].Our findings demonstrate that vamp enables the identification of notable and recurrent request characteristics associated with specific end-to-end response time behaviors.
A video of vamp in action can be accessed at https://youtu.be/ qMVOMt06EJE.The swimlane visualisation is the canonical way to visualize individual requests within distributed tracing tools [9,19,37,39].Fig. 1 shows a representative example of a swimlane visualization.The visualization depicts a timeline of a single request, with RPCs depicted horizontally and sorted vertically to highlight their relationships.This type of visualization proves highly beneficial for performance analysis of individual requests, allowing for a detailed investigation into how each RPC affects the overall end-to-end response time.

MOTIVATION
However, these visualizations exhibit certain limitations when it comes to conducting more complex performance analyses.For instance, observing the performance of individual end-to-end requests might result in misleading insights if not contextualized appropriately [10].Indeed, a request's response time can only be deemed anomalous when compared with other requests of the same type [2].Additionally, engineers are often more inclined to investigate recurrent response time trends rather than focusing on the performance of individual requests [32].Diverse end-to-end response time behaviors may be associated with specific request characteristics, such as particular RPC execution paths or RPC performance behaviors.Consequently, engineers may wish to identify these characteristics to uncover potential performance issues, and gather a more comprehensive picture of the system performance [8,20,32].
Distributed tracing tools currently lack sufficient support for this type of analysis, which often necessitates the concurrent use of multiple visualizations and tools, such as Jaeger [39] and Kibana [13] [10].A naive strategy involves initially recognizing repetitive performance behaviors for further investigation, followed by the examination of individual requests to characterize relevant performance behaviors.This can be accomplished by detecting "modes" within the end-to-end response time distribution (for instance, using Kibana), which represent meaningful recurring performance behaviors.Following this, samples of requests associated with each mode can be extracted and examined individually (for instance, using Jaeger's swimlanes) to identify distinct characteristics that contribute to specific performance behaviors or modes.
However, this method can be particularly laborious as it requires manual inspection and comparison of multiple requests across diverse visualizations and tools.Moreover, even when the method is successful, it may not provide a satisfactory level of confidence.Indeed, determining the specific characteristics associated with a particular distribution mode necessitates verifying that these characteristics appear exclusively in requests that exhibit this particular end-to-end response time behavior.This task can be challenging by using current tools.Consider the scenario illustrated in Fig. 2, which represents the distribution of end-to-end response times for a specific type of request, such as loading a website homepage.As can be observed in the figure, requests demonstrate four distinct response time behaviors, i.e., modes.Suppose that the rightmost mode is characterized by a unique request characteristic, specifically an RPC that exhibits slower execution time 1 .That is, this specific RPC shows increased execution time in all requests belonging to the rightmost mode (e.g., due to an expensive task), but not in others.With the current distributed tracing tools, identifying patterns like this can be particularly challenging.Current distributed tracing tools lack targeted methods to simplify the analysis of RPC attributes, such as execution time, and their relationship with end-to-end response time.

VAMP
vamp aims to enhance performance analysis of microservices systems by simplifying the investigation of attributes pertaining to specific RPC and their relationship with end-to-end response time.In this section, we first introduce the core insights that underpin vamp, its primary visual components, and the interaction modality.Then, we describe how these visual components fit within the vamp dashboard, and detail the vamp architecture and implementation.

Visual Components
The core insight behind vamp is to make explicit the relationship between RPC attribute values and end-to-end response time.RPC  vamp leverages two main interactive components to highlight this relationship: a tree and a histogram.The tree provides an aggregated view of the requests' workflows in terms of RPC invocations, while the histogram displays a traditional distribution plot of the endto-end response time.Users can interact with the tree to examine how specific attribute values, related to a particular RPC execution path, influence the end-to-end response time; we refer to this as forward analysis.Conversely, starting from the histogram, users can investigate how specific end-to-end response time behaviors are associated with certain RPC attribute values; this is referred to as backward analysis.In the following, we will first describe in detail the characteristics of these two main visual components.Then, in the subsequent subsection, we will detail the interaction modality of vamp.

Direct RPC invocation
3.1.1Tree.This visualization component takes inspiration from the Jaeger comparison tool [15], which allows users to compare two end-to-end requests and highlight their structural differences.We have redesigned this approach by extending its capabilities beyond the comparison of two requests, thereby allowing aggregated analysis of multiple end-to-end requests.In a nutshell, the vamp tree provides an aggregated view of the RPC workflows performed by a set of end-to-end requests, as shown in Fig. 3a.Each node of the tree represents a RPC invocation within a specific execution path, where the leftmost node represents the root RPC, and edges indicate direct RPC invocation.For instance, in Fig. 3a the node labeled as RPC  represents the execution path RPC  →RPC  →RPC  .As can be observed by the figure, the same RPC can appear in multiple nodes (e.g., RPC  ), as it can be invoked within multiple different execution paths.A RPC execution path will appear in the tree if and only if it is present in at least one of the requests being analyzed.It is worth noting that when a particular RPC invokes the same RPC multiple times, this leads to a single node in the tree.In other words, if the RPC  invokes the RPC  multiple times, there will be only one child node referring to RPC  .
vamp utilizes color encoding to highlight RPC execution paths that are worthy of investigation based on their attribute values.It currently supports the analysis of two kinds of attributes: execution time and frequency.The first one denotes the (average) execution time of the RPC within a specific execution path in each request, while second one indicates the path frequency, i.e., how many times it occurs within each request.We use color encoding to emphasize RPC execution paths with higher variance in their attributes.
The key intuition here is that RPC execution paths showing higher variance in their attributes are likely to manifest different behaviors that can potentially affect the end-to-end response time.For instance, a higher frequency of a particular RPC invocation within a request could result in a longer end-to-end response time.Or similarly, a slower RPC execution time may correspond to a prolonged end-to-end response time.
We employ a continuous color scale to depict the variability in the attribute values associated.This scale is based on the Coefficient of Variation (CV) [14], i.e., a standardized measure of dispersion that is defined as the ratio of the standard deviation to the mean.As execution times in distributed systems are well known to be subject to long tails [11], when dealing with this attribute, we apply outlier filtering by removing execution times values greater than the 99 ℎ percentile.A CV of 0 results in a white node, indicating no variability.On the other hand, a CV greater than or equal to 1 results in a red node, suggesting a high variability in the attribute values.The shade of color gradually transitions from white to red as the CV value increases.

Histogram.
The vamp histogram component (shown in Fig. 3b) depicts a traditional distribution plot of the end-to-end response time.These kinds of visualizations are frequently used in practice for performance analysis, and are provided by several tools, e.g., Kibana [13].According to recent research [10], understanding the distribution of end-to-end response times stands as core activity in modern performance analysis practice.The histogram component provided by vamp aims to facilitate this process by supporting the identification of specific performance behaviors that are worthy of investigation.The user can visually identify "modes" in the response time distribution, which indicate meaningful recurring performance behaviors, to start a targeted investigation on these requests, as we will detail in the subsequent subsection.

Interaction modalities
vamp supports bidirectional analysis, allowing users to initiate their analysis from either the tree (forward analysis) or the histogram (backward analysis).In the following, we provide detailed descriptions of both these interaction modalities.4a depicts an illustrative example of forward analysis.By examining the tree, the user can identify "suspicious" RPC execution paths that exhibit high variability in the corresponding attribute values.For instance, when analyzing the execution time attributes, the user can identify RPCs that show highly varying execution times, and, by clicking on the corresponding node, they can inspect the recurring execution time behaviors associated with the path, displayed in the form of a bar chart, as shown in Fig. 4a.Each bar refers to a specific execution time range (see y-axis labels), and it shows the percentage of requests with RPC execution time falling in that range.In order to identify meaningful ranges, we employ a widely-used clustering algorithm, namely K-means [26].In particular, we run the algorithm on-the-fly after the user click with  ranging from 2 to 5 and we select the results showing the highest silhouette score [33].Each bar represents a  4a, we can observe that when the selected RPC has an execution time ranging between 400 and 600 milliseconds, it can lead to end-to-end response times that range between 700 and 800 milliseconds.Understanding these kinds of relationships would have been way more challenging by using currently available tools.It is worth noticing that the same interaction modality also applies when analyzing different RPC attributes, such as occurences.

Backward analysis.
In the backward analysis, the user can start its investigation directly from the histogram component.The user selects a specific range of end-to-end response time using a slider selector, as shown in Fig. 4b.This selection triggers an update in the tree component's color scheme, shifting its semantic from variability to divergence.In other words, the updated color scheme will now denote the degree of divergence in the attribute values of the selected set of requests (i.e., those that show end-toend response time in the selected range) when compared to those in other requests.A red node indicates that the corresponding RPC execution path shows considerably different attribute values in the selected requests when compared to other requests, suggesting a possible relationship between the selected end-to-end response time and the RPC execution path.Conversely, a white node indicates similar attribute values, and therefore a weak relation.We quantify the degree of divergence using Kullback-Leibler divergence [7], where values close to 0 indicate nearly identical distributions (white), while values close to or higher than 1 indicate highly different distributions (red).
The user can then delve deeper into each RPC execution time behavior by clicking on the corresponding node.This action lets appear at screen two new histograms (in the bottom right corner) representing the distributions of the execution time in the selected RPC execution path, respectively in the selected requests (in red) and in other requests (in grey).In doing so, the user can effectively analyze how particular ranges of the end-to-end response time distribution correlate with specific RPC attribute values.The space in the bottom-right is intentionally left blank and will be used to display supplementary visualization components during the interaction, e.g., the bar chart (for forward analysis) and the two histograms (for backward analysis).
It's worth noting that vamp is specifically designed to assist in analyzing requests from the same class, i.e., those originating from the same root RPC.As part of this process, the user is required to first select the root RPC and the RPC attribute (i.e., execution time or path frequency) to be investigated, before proceeding with the actual analysis.
The user can select the root RPC using either a dropdown menu A or a search text-box B .Similarly, the RPC attribute (execution time or frequency) to be analyzed can be selected using a dropdown menu D .Additionally, the dashboard includes a date-time range selector C , where the user can specify the start and end date-times.This feature allows for analyses at different time granularities (e.g., monthly, weekly, and daily) or over specific time ranges known to include system anomalies.
To enhance user experience during the interaction with the tool, vamp supports pinch gestures to enable zoom in and zoom out of the tree.In addition, it allows the user to hide the RPCs invoked within a particular execution path by double-clicking on the related node.

Architecture and Implementation
Fig. 6 outlines the key architecture components of vamp.The Trace Collector (i.e., Jaeger [39]) continuously collects traces from the microservices system and stores them in a Trace Storage (i.e., Elasticsearch [12]).Given the large volume of data collected each day, we have devised a Preprocessing step to enhance the efficiency of interaction with vamp.This preprocessing step operates in batches and is intended to be executed periodically (e.g., hourly or daily).For each end-to-end request (i.e., trace), vamp recursively reconstructs all the involved RPC execution paths, along with their attribute values (namely, execution times and frequencies), and stores them in an Optimized Trace Storage based on MongoDB.Each path associated with a request is stored as a separate document in a dedicated MongoDB collection, and includes: the name of the path, the trace ID, the number of occurrences of the path in the trace, the observed execution time, the timestamp, and the name of the root RPC.Similarly, vamp stores the end-to-end response time values, along with related information, in a separate MongoDB collection.This information includes the RPC root, the trace ID, the response time value, and the timestamp.This data reorganization allows for greater flexibility in easily and efficiently querying the data needed for the vamp dashboard to function properly.As can be seen from Fig. 6, the Dashboard app directly queries the Optimized Trace Storage to efficiently generate visualizations.
vamp currently supports distributed traces stored in the Jaeger [39] format using Elasticsearch [12] as Trace Storage, but it can be easily extended to other technologies.The dashboard and visual components have been developed using D3.js, which handles the visualization rendering, and Flask, which serves as the backend service.The preprocessing scripts are implemented in Python.

EVALUATION
The conducted evaluation is centered around one main research question: To what extent does vamp support performance analysis?We want to understand whether vamp can be successfully utilized to gain insights about the relationship between request attributes and end-to-end performance response time.
In the following, we first describe the methodology used to gather the answer.Then, we report and discuss the results of the experimental evaluation.Finally, we describe the threats to validity of our study.

Methodology
To achieve our study goal, we generate 33 datasets of distributed traces, where each dataset reflects a distinct scenario that induces a specific variation in the relationships between request attributes and end-to-end response time.Subsequently, we manually analyze each dataset using vamp to evaluate the effectiveness of our tool in highlighting these relationships.
4.1.1Datasets generation.The 33 datasets are generated from TrainTicket [47], which, as best as we know at the time of writing, is the largest and most complex open-source microservice-based system.TrainTicket provides a typical train ticket booking web service; it involves 41 microservices implemented in four programming languages, and it utilizes Jaeger [39] and Elasticsearch [12] for collecting and storing distributed traces.We have chosen TrainTicket as a representative case system due to its complexity and because it has been recently used in software engineering research [23,41,46,47].
Each dataset of our study contains distributed traces related to one specific root RPC of the system, which are stored on Elasticsearch using the standard Jaeger format.
To simulate different scenarios that induce different variations in the relationship between RPC attributes and end-to-end response time, we rely on two different approaches: (i) we inject synthetic performance issues in specific RPCs to increase the overall end-toend response time, and (ii) we use complex mixtures of varying workloads that may alter the relationships between RPC execution time/occurrences and end-to-end response time.These two distinct approaches lead to the generation of two categories of datasets.
The first category of datasets is generated using a methodology similar to the one presented in [8,41].Initially, the system's source code is modified to inject random performance issues.Following this, load-testing sessions are run to simulate user interactions with the system and generate distributed traces.Each injected performance issue affects approximately 10% of requests, introducing a delay into one specific RPC.
To generate a dataset, we first select two random RPCs that will be impacted by the performance issues.Subsequently, we choose a random delay to increase the end-to-end response time by %, where  ∈ {10, 20, 30}.In addition, in half of the datasets, we inject a random delay of % (with  ∈ {10, 20, 30}) into an asynchronous RPC, which does not produce any effect on the end-to-end response time.This is a common practice used to test the robustness of pattern detection approaches in the context of microservices systems [8,20,41].After modifying the system accordingly, we conduct load-testing sessions to generate the distributed trace datasets.Each load testing session involves 20 synthetic users, simulated by Locust [18].Each user makes a request to the system and randomly waits between 1 and 3 seconds before making the next request.Each session lasts for 20 minutes.Using this methodology, we generate 20 datasets featuring various combinations of performance issues that affect different RPCs with different delays.For a more detailed explanation of this process, we refer readers to the work of Traini and Cortellessa [41].Due to space constraints, we do not elaborate further here.
The second kind of dataset does not involve any performance issue injection, but it is generated using a more elaborate workload generator.Similarly to recent studies [24,25] we use load mixtures that involve multiple types of simulated users (i.e., load drivers), where each user type performs different classes of requests on the system.For example, some types of users may only visit the homepage and subsequently search trains for some random locations, while others first login into the system and then book random tickets.Besides this, we also ensure that the number of simulated users per type keeps changing over time.In this way, workloads will more closely resemble real-world ones, as they generate mixtures of different classes of requests that change over time [3].To this aim, we slightly modified PPTAM [4], a workload generator that involves 5 different user types, to continuously change the number of users of each type at run-time.Overall, the number of simultaneous users ranges from a minimum of 20 to a maximum of 31, and the load-testing session lasts for 1 hour.The workload fluctuations over time are randomly generated upfront.This process leads to 13 distinct datasets, each one related to a different API.
The generations of the datasets were done on a bare-metal machine running Linux Ubuntu 18.04.2LTS on a dual Intel Xeon CPU E5-2650 v3 at 2.30 GHz, with a total of 40 cores and 80 GB of RAM.All non-mandatory background processes except SSH are disabled, and we ensured that no other users interacted with the dedicated machine during our experiments.
To enhance clarity throughout the rest of the article, we will use specific notations for different categories of datasets.Datasets characterized by performance issues (i.e., first category) will be referred to as   , where 1 ≤  ≤ 20.Conversely, datasets free from performance issues (i.e., second category) will be denoted as   , with 1 ≤  ≤ 13.

Manual analysis.
To assess the effectiveness of our approach, two authors conducted manual inspections of the 33 distributed trace datasets using vamp.Our evaluation focused on determining the extent to which vamp facilitated the comprehension of the relationship between RPC execution time/frequency and end-toend response time.
It is worth noting that neither author was aware of the specific performance issues or the workload variations present in each dataset.This is because the process for the dataset generation, including performance issue injections and load testing modifications, was entirely random and automated.Nonetheless, both authors were familiar with the TrainTicket system before the study.

Results
vamp has proven to be effective in highlighting the relationship between RPC execution time and end-to-end response time, throughout all the datasets featuring injected performance issues.The analysis was straightforward for the majority of the datasets (18 out of 20), demanding minimal interaction with vamp.In these datasets, both forward and backward analysis demonstrated comparable effectiveness, with no noticeable difference in the effort needed to understand these relationships.Due to space constraints, we are unable to present the exhaustive results of our analyses across all datasets.However, we have included a selection of representative examples that underscore both the utility and potential challenges associated with employing vamp.Additionally, for the sake of completeness, we have made available screenshots capturing interactions with vamp across all the datasets in a supplementary replication package [44].Fig. 7 showcases an example of forward analysis using the dataset As depicted in the left screenshot, vamp significantly streamlines the identification of the two RPCs impacted by performance issues, i.e., the ones highlighted in bright red.Following this, the user can select these nodes to investigate correlations between specific RPC execution times and end-to-end response times.For example, the screenshot on the right of Fig. 7 reveals that the selected RPC execution path (highlighted in green) exhibits two distinct execution time behaviors: in 9.87% of the requests, the RPC getRouteByTripId has an execution time ranging from 27.46 to 33.67 milliseconds, and in the remaining 90.13% of requests, the execution time ranges from 2.62 to 11.25 milliseconds.This screenshot displays the view of vamp during the investigation of the first behavior, that is, after clicking on the corresponding bar (highlighted in red).As evident from the figure, vamp reveals that all the requests with an execution time ranging from 27.46 to 33.67 milliseconds in the selected RPC execution path fall within a specific region of the end-to-end response time distribution, as shown by the red highlight in the histogram.Understanding these kinds of relationships would have been particularly challenging when using traditional performance analysis tools.Fig. 8 offers another example of how vamp allows users to rapidly identify the RPC responsible for a particular end-to-end response time deviation.Specifically, this figure demonstrates an instance of a vamp backward analysis using the dataset  9 .The left screenshot shows that by selecting a specific range of end-to-end response times, the user can immediately pinpoint the RPC execution paths  that display significantly divergent behavior in the execution time (highlighted in bright red).The screenshot on the right displays the investigation of one of these nodes (i.e., the one highlighted in green), illustrating how vamp assists users in comprehending the correlation between specific RPC execution times and the selected range of end-to-end response times.For instance, it is noticeable that when the RPC getRouteByTripId has an execution time exceeding 27 milliseconds, it results in an end-to-end response time that falls within the range of 137 and 168 milliseconds.
Another interesting aspect of vamp is its ability to identify execution time fluctuations in RPCs that do not have influence on the end-to-end response time.For instance, Fig. 9 illustrates two distinct execution time behaviors in the selected RPC calculateSoldTicket: one ranging between 33.42 and 55.95 milliseconds, and another between 1.03 and 14.96 milliseconds.Through the use of vamp, we were able to easily notice the lack of correlation between the execution time of this RPC and the end-to-end response time.As illustrated in Figures 9a and 9b, the selected execution time behaviors (i.e., the bars highlighted red) are evenly distributed across the end-to-end response time distribution, implying a lack of notable correlation with specific regions of the end-to-end response time.This indicates that even if the RPC execution time varies drastically from one request to another, it does not have any significant impact on the end-to-end response time.
In two datasets, specifically  4 and  19 , the analysis was more complex, requiring a higher number of interactions with vamp.The peculiarity of these datasets was that the two performance issues led to an increased end-to-end response time that overlaps within the same range.This made the connection between the RPC execution time and end-to-end response time more challenging to understand.We did not include the specific details of these cases in the paper because of limited space, but we refer the reader to our supplementary materials [44] for the related screenshots.
With regards to the 13 datasets in the second category, we found that a substantial majority of them -precisely 11 datasets -feature a unique mode in the end-to-end response time distribution.Given the objectives of our analysis, these cases were not considered.Consequently, we used two datasets for our evaluation.The first dataset,  1 , consists of requests originating from the root RPC getByCheapest, while the second dataset,  2 , comprises requests initiated from queryInfo.vamp enabled us to characterize the correlation between the frequency of each RPC execution path and specific modes of the end-to-end response time.Fig. 10 provides an example of this characterization, illustrating that each mode of the end-to-end distribution corresponds to a specific number of invocations of a selected RPC execution path (highlighted in green).For example, as depicted in Fig. 10c, the right-most mode is characterized by 14 invocations of the path queryInfo → queryForStationId.Similarly, the center mode is characterized by 6 invocations of this path, while the left-most mode is marked by 2 invocations.Uncovering such patterns using traditional observability tools would have been notably challenging.
Summing up, we answer our RQ as follows: vamp proved to be effective in supporting performance analysis of microservices.In 18 out of the 20 datasets involving performance issues, we were able to rapidly identify the affected RPCs, their corresponding execution time behaviors, and their relationship with end-to-end response time.However, in a few specific cases (2 out of 20 datasets), the analysis proved to be more challenging, necessitating a greater number of interactions with vamp.Moreover, our evaluation demonstrates how vamp can facilitate an understanding of how structural differences in requests (i.e., varying frequencies of RPC execution paths) influence end-to-end response time.

Threats to Validity
4.3.1 Construct validity.We conducted the evaluation in-house rather than using external participants.Our familiarity with the tool and the experimental setup could potentially introduce bias into the evaluation outcomes.To mitigate this threat, we generated 20 diverse scenarios randomly, each involving various combinations of performance issues.Additionally, the two authors who conducted the evaluations were kept uninformed about the RPCs affected by the performance issues.Using artificial delays as part of the evaluation may not perfectly mirror real-world performance issues.However, this methodology aligns with prevailing practices in software engineering research, as evidenced by several studies [21,25,27,41].Furthermore, in contrast to many previous studies [21,25,27], which typically employ a limited set of predetermined regressions with fixed magnitudes, our approach offers a more comprehensive evaluation on 20 diverse scenarios involving different combinations of RPCs and delay magnitudes 4.3.2Internal validity.The workloads used in our experimental setup may not be representative of real-world workloads.To (partially) mitigate this limitation, we perform an additional analysis using mixtures of continuously changing workloads, generated through PPTAM [4].Our evaluation may be subject to confirmation bias, wherein the authors may unconsciously confirm their pre-existing beliefs on the effectiveness of vamp.Nonetheless, the results obtained using vamp unambiguously demonstrate its effectiveness across a majority of the datasets evaluated.In the interest of transparency and to enable readers to independently assess this evidence, we have made all screenshots documenting the use of vamp across the datasets in our study publicly available [44].

External validity.
We cannot ensure that vamp can achieve the same effectiveness on other datasets outside our experimental setup (e.g., real world scenarios).Nevertheless, through an evaluation on 33 datasets, we have demonstrated that our approach effectively aids in the performance analysis of microservices systems.vamp's efficiency was evaluated on datasets of varying sizes, ranging from 11181 to 22348 requests.It's worth noting that realworld microservices systems may involve a much larger volume of requests.As part of our future work, we plan to enhance the scalability of vamp by incorporating sampling techniques and optimizing preprocessing procedures.

RELATED WORK
Previous research on visualization for distributed systems has primarily focused on analyzing individual requests or comparing two requests.The swimlane visualization, a widely used technique to represent individual end-to-end executions, was originally proposed by Singelman et al. [37].Today, most distributed tracing tools offer this visualization.TraVista [2] enhances the standard swimlane visualization by augmenting it with information that assists users in contextualizing the performance of the analyzed request in relation to others.Beschastnikh et al. [6] introduced a novel visualization tool called ShiViz, which includes an interactive time-space diagram for visualizing individual end-to-end executions of a distributed system.Sambasivan et al. [36] studied and compared three visualization approaches (i.e., side-by-side view, difference view, and animation) for comparing two request-flow traces.Jaeger [39] provides a feature to visually compare the structural characteristics of two requests [15].
Several visualizations have also been introduced to analyze the performance behaviors of multiple end-to-end requests in aggregate.One example is TransVis by Beck et al. [5], which provides a visualization technique for specifying and analyzing transient performance behaviors in microservice systems.Other examples of visual techniques for aggregate performance analysis can be found in commercial APM tools [1], such as Dynatrace, AppDynamics, or Instana.For instance, Dynatrace's Service Flow feature [16] allows to display aggregate workflows of end-to-end requests along with their associated characteristics.
To the best of our knowledge, despite the many existing visualization techniques for microservices performance analysis, there is still a lack of dedicated visualizations to analyze the correlation between requests' attributes and their end-to-end performance behavior, which is the goal of our study.
Other related works include the recent Davidson and Mace's survey [9], which underscores the critical role of visualization within systems research, and the qualitative interview study conducted by Davidson et al. [10], which highlighted the limitations of current distributed tracing tools.

CONCLUSION
In this paper, we presented vamp, a novel visual analytics tool for microservices performance analysis.vamp overcomes the limitations of current distributed tracing tools by providing a wide set of interactive visualizations that enables effective performance analysis of multiple end-to-end requests.Through an evaluation of 33 datasets generated from an established open-source microservices system, we demonstrate how vamp can be effectively used to understand the relationship between the RPC attributes and end-to-end response time.For future work, we plan to enhance the efficiency of our tool to facilitate its transition to practice.As part of this process, we intend to validate our future improvements using real-world distributed traces from large-scale microservices systems, similar to those shared by Alibaba [27].To aid reproducibility we provide the data and source code needed to replicate our findings [44].

Figure 2 :
Figure 2: End-to-end response time distribution.

Fig. 5
Fig.5outlines the vamp dashboard.As can be observed by the figure the two main visual components, namely the tree and the histogram, are positioned in the center-left and in the upper-right corners, respectively.The space in the bottom-right is intentionally left blank and will be used to display supplementary visualization components during the interaction, e.g., the bar chart (for forward analysis) and the two histograms (for backward analysis).It's worth noting that vamp is specifically designed to assist in analyzing requests from the same class, i.e., those originating from the same root RPC.As part of this process, the user is required to first select the root RPC and the RPC attribute (i.e., execution time or path frequency) to be investigated, before proceeding with the actual analysis.The user can select the root RPC using either a dropdown menu A or a search text-box B .Similarly, the RPC attribute (execution time or frequency) to be analyzed can be selected using a dropdown menu D .Additionally, the dashboard includes a date-time range selector C , where the user can specify the start and end date-times.This feature allows for analyses at different time granularities (e.g., monthly, weekly, and daily) or over specific time ranges known to include system anomalies.To enhance user experience during the interaction with the tool, vamp supports pinch gestures to enable zoom in and zoom out of the tree.In addition, it allows the user to hide the RPCs invoked within a particular execution path by double-clicking on the related node.

Figure 7 :
Figure 7: Forward analysis on execution time for dataset  2 .

Figure 8 :
Figure 8: Backward analysis on execution time for dataset  9 .

Figure 9 :
Figure 9: Forward analysis on execution time for dataset  1 .

Figure 10 :
Figure 10: Forward analysis on frequency for dataset  2 .
Davidson et al.'s findings involved several open research challenges that span multiple research areas, including visualization research.