CAMEO: A Causal Transfer Learning Approach for Performance Optimization of Configurable Computer Systems

Modern computer systems are highly configurable, with hundreds of configuration options that interact, resulting in an enormous configuration space. As a result, optimizing performance goals (e.g., latency) in such systems is challenging due to frequent uncertainties in their environments (e.g., workload fluctuations). Recently, transfer learning has been applied to address this problem by reusing knowledge from configuration measurements from the source environments, where it is cheaper to intervene than the target environment, where any intervention is costly or impossible. Recent empirical research showed that statistical models can perform poorly when the deployment environment changes because the behavior of certain variables in the models can change dramatically from source to target. To address this issue, we propose CAMEO, a method that identifies invariant causal predictors under environmental changes, allowing the optimization process to operate in a reduced search space, leading to faster optimization of system performance. We demonstrate significant performance improvements over state-of-the-art optimization methods in MLperf deep learning systems, a video analytics pipeline, and a database system.


Introduction
Modern computer systems are continuously deployed in heterogeneous environments (e.g., cloud, FPGA, SoC) and are highly configurable across the software/hardware stack [28,48].In such highly configurable systems, optimizing performance indicators, e.g., latency and energy, is crucial for faster data processing, better user satisfaction, and lower application maintenance cost [15,61].One possible way to achieve these goals is to tune the systems with configuration options across the stack, such as cpu frequency, swappiness, and memory growth, to achieve optimal performance [6,10,65].Optimal in Xavier Optimal in TX2 Figure 1: The optimal configuration for MLPerf Object Detection pipeline deployed on TX2 is not optimal in Xavier.
Finding an optimal configuration in a highly configurable system, however, is challenging [2,8,21,29,60,62]: (i) Each component in the system stack, i.e., software, hardware, OS, etc., has many configuration options that interact with each other, giving rise to combinatorial configuration space, (ii) estimating the effect of configurations on performance is expensive as one needs to collect run-time behavior of the system for each configuration, and (iii) unknown constraints exist among configuration options, giving rise to many invalid configurations.Moreover, to meet growing user requirements and reduce service management costs, underlying systems often undergo environmental changes, that is, hardware updates, changes in deployment topology, etc. [14].Therefore, optimizing the performance of these evolving systems becomes even more challenging since there is no guarantee that the optimal configurations found in one environment will remain optimal in a different environment [29,30,32] 1 as shown in Figure 1.
To address these challenges, in real-world deployment scenarios, developers often use a staging (development) environment, a miniature of a production environment, for testing and debugging.Developers collect many experimentation and performance evaluations in staging environments (hereafter, we call them source environments) to understand the performance behavior of the system (what configurations potentially produce performance anomalies, what configurations produce stable performance, or where good configurations lie).Developers then use that knowledge in target production settings for downstream performance optimizations or debugging.However, in most cases, the staging environment result is completely different from the production result, resulting in a misleading or even wrong indication about the configurations that produce optimal performance.These differences in the results occur mainly due to the hardware gap or workload differences between the development environment and the production environment.For example, the workload of an ML system may surge, and as a result, the batch size behind the model server needs to increase to sustain the latency requirement; however, due to the different memory hierarchy and CPU cores between the source and the target environments, the optimal setting for inter-op parallelism of the model server would be vastly different in each environment [52].Existing works and gap.Performance optimization in configurable systems.Several approaches have been proposed for performance optimization of configurable systems, e.g.Bayesian optimization (BO) [4,25,29,32,44,64,66], BO with regression [15], prediction models [7], search space modification [24], online few shot learning [6], and uniform random sampling and random search algorithms [47].However, using these approaches in a production environment requires many queries, which are often too expensive to collect or may be infeasible to perform.The optimal configuration found by these methods in a source environment is also suboptimal for the targets, as the optimal configuration determined in the source environment usually no longer remains optimal in the other (see Figure 1 for an example).Transfer Learning for Performance Analysis.In real-world deployment scenarios, developers typically have access to performance evaluations of different configurations from a staging environment.Exploiting this additional information using transfer learning can result in efficient optimization, as demonstrated by recent work [26,32,33,39,40,43].For example, searching for optimized performance in the target setting can use the summary statistics of the models built using the performance of the source [67].However, each environmental change can potentially cause a distribution shift.The ML models used in these transfer learning methods are vulnerable to spurious correlations, which do not hold between distribution shifts and result in inferior performance [27,45,68] (see Section 2.1 for an example).
Usage of Causal Analysis in Configurable Systems.To address the problem of spurious correlations, recent work has leveraged causal inference [16,27,53] to build a causal performance model2 that captures the dependencies among configuration options, system events, and performance objectives.However, the causal graphs in the source and target can still have some differences (see Figure 3 for an example).Recent work [27] shows that the source causal model could be reused for performance debugging in the target environment; however, further measurements are needed for the learning and optimization of the performance model.In summary, all these existing works are suboptimal for performance optimization when the environment changes because the knowledge extracted by these methods from the source (i.e., optimal configuration) has changed and cannot be directly applied to the target, the model (i.e., ML-based transfer learning model) may capture spurious correlations, or the model (i.e., causal model) is mostly stable but needs further adaptation in the target environment (see Table 1).Our approach.An ideal optimization approach should leverage the knowledge derived from the source, which is a close replica of the target environment with a cheaper experimentation cost.Our key insight is that by using causal reasoning, we should be able to identify the non-spurious invariances across environments that truly impact the performance behavior of the system.These invariances can then be transferred to the target environment for performance optimization tasks, thus reducing the need for observational data in the production environment.Therefore, we will reduce the cost of optimization tasks without compromising accuracy.
To this end, we propose Cameo (Causal Multi Environment Optimization), a causal transfer-based optimization algorithm aimed at overcoming the limitation of prior approaches.Our approach is built on top of two previous works, JUMBO (a multitask BO method) [19] and CBO (a causal BO method) [3].A typical BO approach consists of two main elements: the surrogate model and the acquisition function.The surrogate model tries to predict the performance objective when given a configuration, and the acquisition function assigns a score to each configuration and chooses the one with the highest score to query for the next iteration.In Cameo, we first build two causal performance models to learn the dependency among the configuration options, system events, and performance objectives for each environment using the previous performance measurements of the source environment and a considerably smaller number of measurements of the target environment.After that, we simultaneously train two Causal Gaussian Processes (CGPs) (which leverage the causal performance models when estimating means and variances) as two surrogate models: a warm CGP in the source and a cold CGP in the target.The acquisition function combines the individual acquisition functions of both CGPs to leverage knowledge from both the source and target.This way of combining individual acquisition functions of both CGPs allows one to rely only on the core features from the source environment that remain stable across environments and update belief about the environment-specific features in the target, making the optimization more effective.Evaluation.We evaluated Cameo in terms of its effectiveness, sensitivity, and scalability, and compared it with four stateof-the-art performance optimization techniques (Smac [25], ResTune-w/o-ML and ResTune [67], cello [15], and Unicorn [27]) using five real-world highly configurable systems, including three MLperf pipelines (object detection, natural language processing and speech recognition), a video analytics pipeline, and a database system, deployed on edge and cloud under different environmental changes.Our results indicate that Cameo improves latency by 3.7× and energy by 5.6× on average than the best baseline optimization approach, ResTune.Contributions.Our contributions are as follows: • We propose Cameo, a novel causal transfer-based approach that allows faster optimization of software systems when the environment changes.Cameo is one of the first approaches to use causal transfer learning for the optimization of the performance of configurable systems.• We conducted a comprehensive evaluation of Cameo by comparing it with state-of-the-art optimization methods in five highly configurable systems in the real world under a range of different environmental changes and studied the effectiveness of design explorations with different varieties and severity of environmental changes and showed the scalability of our approach to colossal configuration spaces.
The artifacts and supplementary materials can be found at https://github.com/softsys4ai/CAMEO.

Motivation and Insights
In this section, we motivate our approach by illustrating why causal reasoning can contribute to more effective optimization of system performance.In particular, we focus on how the properties of the causal performance models can be leveraged across environments.For this purpose, we used the Mlperf Object Detection [50] pipeline as part of the MLPerf Inference Benchmark 3 following the benchmark 3 https://mlcommons.org/en/inference-edge-30/rules 4 , with the following setup: Model: Resnet50-v1.5;Test scenario: Offline; Metric: inference latency; Workload: 5000 ImageNet samples; workload generator: Mlperf Load Generator; Source Hardware: Jetson TX2; Target hardware: Jetson Xavier and TX1.For better control, we limit the configuration space to 28 options across the stack-4 hardware options (e.g., cpu cores), 22 OS options (e.g., dirty ratio), and 2 compiler options (e.g., allow memory).We sampled 2,000 random configurations and measured the inference latency in each environment.We also collected performance counters and system events statistics using Linux perf profiler 5 .

Why performance optimization using causal reasoning is more effective?
To deploy a configurable computer system such as Mlperf Object Detection in a new environment with low latency and energy consumption, the dominant approach is to train a performance model using a limited number of samples and use the model to predict performance for unmeasured configurations and select the configuration with the optimal performance.To show how spurious features could mislead performance optimization, we investigate the impact of confounders and how they make it difficult for an ML model to determine the accurate relationship between configuration options and performance objectives.We perform a sandbox experiment where we carefully tune swappiness 6 and dirty ratio 7 both in source and target, while leaving all other options at their default values.Here, the observational data collected from the experiment indicates that as IPC 8 (one of the system events) increases, latency increases, which is a spurious proportional relationship.Relying on spurious features (IPC in this example) can lead to poor performance predictions (as one might try to reduce IPC and expect lower latency but end up getting higher latency) when the environment changes because they are susceptible to correlation shifts-i.e., the direction of correlation may change across environments.As shown in Figure 2(a)-(b), a correlation shift occurs in this sandbox experiment, as IPC is positively correlated with latency in the source, but negatively correlated in the target.
To investigate the reason behind the correlation shift, we group the data based on their swappiness (50% and 80%, 4 https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc 5 https://perf.wiki.kernel.org/ 6swappiness is the rate at which the kernel moves pages into and out of the physical memory.The higher the value, the more aggressive the kernel will be in moving the pages out of physical memory to the swap memory. 7dirty ratio is the value that represents the percentage of physical memory that can consume dirty pages before all processes must write dirty buffers back to the disk. 8IPC represents instruction per cycle, which is the average number of instructions executed for each clock cycle.respectively) and observe that the correlation between swappiness and latency remains the same (larger swappiness implies higher latency in both environments) whereas the correlation between swappiness and IPC reverses (from proportional to inverse proportional) as shown in Figure 2(a)-(b).Figure 2(c) shows the causal structure where swappiness is a common cause of both IPC and latency.swappiness should be considered for latency since it remains invariant across environments.On the contrary, the relationship between IPC and latency is environment dependent, and their correlation can change when another confounder variable, dirty ratio, is different in source and target.In our example, since the source has 4× lower physical memory than the target, the allocated memory for the dirty pages becomes filled sooner and must be returned to the disk.As a result, the source will have higher IPC for a lower value of swappiness as the dirty pages will be flushed before the limit for swappiness is reached.However, the application is not making any forward progress here, resulting in increased latency.In the target (due to larger memory), the dirty pages might never become full, and only swappiness would cause the IPC to be positively correlated to latency.The example in Figure 2 shows that the casual model can better capture the data generation process as it only relies on invariant causal mechanisms (swappiness for latency) and can remove spurious correlations (IPC for latency) that are specific to a particular environment.Therefore, causal models may suffice to predict the consequences of interventions (what if scenarios) on variables to particular values for effective search during optimization and allow better explorations in limited budget scenarios.
To show the benefits of correctly identifying the invariant features, we train different ML-based regressors, e.g., the Gaussian Process Regressor (GPR) and the Random Forest Regressor (RFR), using data collected for the sandbox system deployed in TX2 and determined their prediction error in TX1 and Xavier (shown in Table 2).Here, we observe that the ML-based regressors have considerably higher errors in the target environment despite low source errors.The prediction error increases further as the distributions become more dissimilar (indicated by a higher KL-divergence value).In contrast, the causal approach, Causal Gaussian Process Regressor (CGPR), has a considerably lower error and remains stable as the degree of distribution shift increases.Takeaway 1 Causal models generalize better in performance prediction tasks across environments by distinguishing invariant from spurious features.

Learning from Causal Structural Properties in Various Environments
As we have established that a causal model can be reliably used for performance predictions in new environments, we next study the properties of the causal graph that can be exploited for faster optimization.We build a causal graph using a causal structure discovery algorithm [56] in the source and target, respectively, and compare them.As shown in Figure 3, both causal graphs are sparse (the white squares indicate no dependency relationship exists) and share a significant overlap (the blue squares indicate the edges present in both).Therefore, a causal model developed in one environment can be leveraged in another as prior knowledge.However, reusing the causal graph entirely might induce some wrong biases as the causal graphs in the two environments are not identical (the green and red squares indicate the edges present uniquely in the source and target, respectively).We must discover the new causal connections (indicated by the red squares) based on the observation.Since the number of edges that must be discovered is small, this can be easily done with a small number of observational samples from the target environment.
To eliminate biases, we need to remove unique edges of the source.Removal operations can be accomplished by performing interventions that estimate the effects of deliberate actions.For example, we measure how the distribution of an outcome (e.g., latency Y) would change if we intervened during the data collection process by forcing the variable cpu frequency O  to a certain value   while retaining the other variables as is.We can estimate the outcome of the intervention by modifying the causal performance model to reflect our intervention and applying Pearl's do-calculus [49], which is denoted by  (Y |  (O  =   )).However, since many configurations need to be measured, it is not feasible to perform interventions to estimate the existence of every edge.Instead, we can significantly reduce the number of configurations by avoiding interventions on nodes with limited causal effects on the performance objective.For this purpose, we rank the causal effects of all existing nodes on latency and observe that only one source-specific edge (policy) is among the top 10 most influential nodes.Therefore, we can select the K nodes with the highest causal effects and combine the Markov blanket9 of them, which would eliminate all the nodes that have lower causal effects.Figure 5(a) shows that pruning the edges helps to reach the optimal value 19% faster.Therefore, we require an approach that relies on intervening only in the top K nodes based on the source knowledge in the target environment.
Takeaway 2 Employing rich knowledge in a causal performance model, we can intervene in specific configurations to learn the most about the underlying causal structure and be able to gather the most relevant data under a limited budget.

Cameo Design
In this section, we present Cameo-a framework for performance optimization of highly configurable systems.

Problem Formulation
Let us consider a highly configurable system of interest with configuration space O, system events and performance counters space C, and a performance objective Y. Denote O  to be the  ℎ configuration option of a system, which can be set to a range of different values (e.g., categorical, Boolean, and numerical).The configuration space is a Cartesian product of all hardware, software, and application-specific options: , where d is the number of options.Configuration options and system events are jointly represented as a vector X = (O, C).We assume that in each environment  ∈ E (a combination of hardware, workload, software, and deployment topology), the variables (X  , Y  ) have a joint distribution P  .In the source environment  s , there are  independent and identically distributed (i.i.d) observations.The task is to find a near-optimal configuration,  * , with a fixed measurement budget, , in the target environment,  t , that results in Pareto-optimal performance: where O represents the configuration space, Y is a set of performance metrics measured in the target environment  t .

Cameo Overview
Cameo is a causal transfer learning optimization algorithm that enables developers and users of highly configurable computer systems to optimize performance objectives such as latency, energy, and throughput when the deployment environment changes.Figure 6 illustrates the overall design of our approach.Cameo works in two phases: (i) knowledge extraction phase, and (ii) knowledge update phase.In the knowledge extraction phase, Cameo first determines the user requirements using a query engine.Then, it learns a causal performance model G s using cheaper offline performance measurements D s from the source environment  s , which is later reused to obtain meaningful information that is shared with the target environment  t for faster optimization.As performance evaluations in the target are expensive, this way of warm-starting the optimization process by reusing the causal performance model G s enables us to navigate the configuration space more effectively with less number of interventions in the target.However, relying solely on the source's information is insufficient to effectively optimize performance in the target due to the differences across environments (as shown in Section 2.2).Therefore, in the knowledge update phase, Cameo employs an active learning mechanism combining the source causal performance model G s with a new causal performance model G t collected from a small number of samples, D t , from the target environment.
Once the two causal performance models are constructed, we simultaneously train two causal Gaussian processes (CGPs) as the surrogate models-CGP warm and CGP cold -to model performance objective Y from G s and G t , respectively.The two CGPs operate on different input spaces.CGP warm works on a reduced configuration space that is derived from G s .In contrast, to ensure that any information omitted in the source is not left undiscovered in the target, CGP cold works on the entire configuration space.We integrate the posterior estimates from both CGP warm and CGP cold to develop an acquisition function  that can regulate the information from two CGPs through a controlling variable .The larger , the more we rely on the information in CGP warm .Next, we evaluate our acquisition function  for different configurations and select the one for which the  value is maximum for observation or intervention.The choice of observation and intervention for performance evaluation is guided by an exploration coefficient .Finally, we use the newly evaluated configurations to update the causal performance and surrogate models.We continue the active learning loop until the stopping criterion is met (i.e., the maximum budget  is exhausted or convergence is achieved).The pseudocode for our approach is provided in Algorithm 1.

Knowledge Extraction Phase
We next describe the offline knowledge extraction phase.User query translation.A developer can use Cameo to find the optimal configurations that optimize a system's performance objectives in a target environment within a limited experimentation budget .The developer can start the optimization process by querying Cameo with requests like "How to improve latency within 1 hour or 50 samples" or "I want to find the configuration with minimum energy for which latency is less than 20 seconds within 45 minutes?".The query engine initially translates user requests to determine the allowable budget , constraints  , and the performance goal Y to optimize.In the first query, the budget is 1 hour or 50 samples, the performance objective is latency, and no constraints exist.In the second query, the budget is 45 minutes, the performance objective is energy, and the constraint is a latency of less than 20 seconds.The query translator extracts this information by directly accepting user inputs with some fixed guided keyword directives.
Learning causal performance model.We begin by building two causal performance models: G s and G t using the offline performance evaluation dataset D s from the source with  configurations and the performance dataset D t from the target with randomly sampled  initial configurations, respectively.We use an existing structure discovery algorithm fast causal inference (FCI) to learn G s and G t that describes the causal relations among configuration options O  , system events and performance counters C  , and performance objectives Y.We select FCI as the causal structure discovery algorithm because (i) it accommodates variables that belong to various data types such as nominal, ordinal, and categorical data common across the system stack, and (ii) it accommodates the existence of unobserved confounders [18,46,56].This is crucial because we do not assume absolute knowledge of configuration space, so there may be configurations in which we cannot intervene or system events we have not observed.FCI operates in three stages.First, we construct a fully connected undirected graph where each variable is connected to every other variable.Second, we use statistical independence tests (Fisher's z test for continuous variables and mutual information for discrete variables) to remove edges between independent variables.Finally, we orient undirected edges using prescribed edge orientation rules [11,12,18,46,56] to produce a partial ancestral graph (or PAG).In addition to both directed and undirected edges, a PAG also contains partially directed edges that need to be resolved to generate an acyclic-directed mixed graph (ADMG), i.e., we must fully orient partially directed edges with the correct edge orientation.This work uses an informationtheoretic approach to automatically orient partially directed edges using the LatentSearch algorithm [38] by entropic causal discovery.
Refining causal performance model.Now that we have constructed the causal performance models based on the invariant features, we may be tempted to directly reuse the source model G s in the target to warm-start the optimization process.However, since some edges are specific to the source (as discussed in Section 2.2), directly reusing G s will bias the optimization in the target.To avoid wasting the budget allocated for the online optimization procedure, we try to minimize those biases as much as possible in this offline phase.To do so, we transfer the Markov blanket (Mb) of the top  nodes ranked according to their causal effects on the performance objective to eliminate unwanted information.
Higher causal effects indicate a stronger influence of the configuration option on performance.When scaling option values within a constant context, options with higher causal effects become top features.This is an important step, as we need to rely on the optimal core features that remain invariant when a performance distribution shift happens to reason better in the new environment.Theoretically, the Mb of a node is the best solution to the feature selection problem for that node [34].The variables in the Mb can be confidently employed as causally informative features in the target because they provide a thorough picture of the local causal structure around the variable.Initially, we determine  using the method proposed in [22].Then we extract the Mb of the  nodes to determine the final G s that will be reused in the subsequent phase using the IAMBS algorithm presented in [41].
The IAMBS algorithm is focused on constructing an Mb for multiple variables (top  nodes).It operates by determining whether the additivity property holds for Mb of  variables and, further, how to proceed if the additivity property is violated by selectively performing conditional independence tests using a growing and a shrinking phase [41].

Knowledge Update Phase
In this phase, we use the knowledge gained from the earlier phase to guide the optimization search strategy using the three components described below.
Build causal Gaussian processes.At this stage, we train two surrogate models: CGP warm and CGP cold for the performance objective Y from G s and G t , respectively.For this purpose, we use the mathematical formulation proposed in the CBO approach [3] to build a CGP.Unlike GPs, CGPs represent the mean using interventional estimates via docalculus, allowing the surrogate model to capture the behavior of the performance objective better than GPs (as shown in Figure 17), particularly in areas where observational data are not available.Therefore, we fit a prior on  () =  [Y| (  =   )] with mean and kernel function computed via do-calculus separately for each CGP obtained from G s and G t as the following: where   () = √︃ V (Y| (  =   )) with V representing the variance estimated from the configuration measurements (D s or D t ) for a particular environment.  is the radial basis function of the kernel defined as ), where  is a hyperparameter.As a result, the shape of the posterior variance enables a proper calculation of the uncertainties about the causal effects (enabling identification of influential configuration options and interactions).We extract the exploration set (ES) for each environment, guided by G s and G t , and compute the mean and uncertainty estimates for the configurations in the exploration set.
Compute acquisition function for sampling.Denote by   warm () and   cold () to be the single objective acquisition functions of the two CGPs.For Cameo, we choose to use expected improvement (EI) as an acquisition function [63] since EI has been shown to perform well in the configuration search.EI selects the configuration that would have the highest expected improvement to the current best interventional setting separately from  s and  t in all configurations in the respective exploration set: where  =  [Y| (  =   )] and  * is the optimal value observed thus far.In our implementation, we rank the configurations based on the   warm () scores and then select the ones with the highest   cold () score.Our acquisition function is defined as the following: where   is an interpolation coefficient that controls the proportion of knowledge used from source and target and is dependant on   and the expected improvement of a configuration.The above equation shows that when  is 1; it would use the contribution from   and use   when  is 0. The interpolation coefficient   is defined as the following: where   warm* is the optimal acquisition value obtained from   warm scores.The choice of   is critical since it balances the knowledge used from the source and the target.We set   to 0.1, which shows good empirical performance (as shown in Figure 15.Intuitively, the acquisition function should operate in such a way that it uses   for the configurations that are near the optimal points.Here,   is an acquisition threshold hyperparameter used to define near-optimal points w.r.t.  .Therefore, configurations that are closer to the optimal of   (configurations that satisfy   ≤ 0.1) will provide the expected higher improvement for   .
In contrast, configurations that are further away from the optimal points of   (configurations that do not satisfy   ≤ 0.1) will have a higher expected improvement value for   .This indicates that such configurations contain options that have some environment-specific behavior that is not captured or learned correctly by the source causal model and the source causal model needs to be updated.
Observation-intervention trade-offs.We find a configuration  r+1 for either observation or for intervention for which the   value is maximum.We employ the -greedy strategy used by the CBO to choose between observation and intervention.Observational data may be used to correctly predict the causal effects of configuration options on the performance objective.On the other hand, estimating consistent causal effects for values outside of the observable range requires intervention.The developer must identify the optimal combination of these operations to capitalize on observational data while intervening in regions with higher uncertainty.Following CBO, we define  as : where Compute the exploitation coefficient  using Equation ( 8) and sample a random number  ∼ U (0, 1) make a new observation (  +1 ,   +1 ,   +1 ).Set interpolation coefficient: Set the acquisition function: Pick a new configuration: Intervene on the system to obtain an interventional measurement (  +1 ,   +1 ,   +1 ).achieved with more observations.We update the convex hull incrementally for computation purposes.
Evaluate selected configuration and belief update.We measure the selected configuration  r+1 and check whether it satisfies the constraints.If not, we replace the performance objective value with an infinitely high value to force the optimizer to avoid searching in regions of the space where the constraints are not satisfied.We update the causal performance and surrogate models using the new measurement.We repeat the optimization loop until the maximum budget  is exhausted or convergence is reached, and return the configuration with minimum Y as optimal.

Evaluation
Subject systems and configurations.We selected five configurable computer systems, including a video analytics pipeline, a cassandra database system, and three deep learning systems (for image, speech, and NLP, respectively).Following configuration guides and other related work [20,27,55], we used a wide range of configuration options and system events that impact scheduling, memory management, and execution behavior.As opposed to prior works (e.g., [58,59]) that only support binary options due to scalability issues, we additionally included discrete and continuous options.We use the recommended values and ranges from system documents for both of these categories of options.
We run each software with a set of popular workloads that are extensively used in benchmarks and prototypes (more details are provided in Section 5-7).We use various deployment platforms with distinct resources (e.g., computation power, memory) and microarchitectures to demonstrate our approach's versatility.We use NVIDIA Jetson TX2, TX1, AGX Xavier, and Xavier NX devices for edge deployment.To deploy a particular system on the cloud, we use Chameleon cloud resources where each node is a dual-socket system running Ubuntu 20.04 (GNU/Linux 6.4) with 2 Intel(R) Xeon(R) processors, 64 GB of RAM, hyperthreading, and TurboBoost.Each socket has 12 cores/24 hyper-threads with multiple Nvidia Tesla P100 16GB GPU and K80 24GB GPU for deep learning inference.Data collection.We measure the system's latency/throughput and energy for each configuration.Following a common practice [14,15], we randomly select 2000 configurations for each system for performance measurements to determine the ground truth.We also empirically justify our selection of ground truth in Figure 18 in the appendix A. We repeat each measurement 5 times and record the median to reduce the effect of measurement noise and other variabilities [26].Experimental parameters.We use a budget of 200 iterations for each optimization method, similar to standard system optimization approaches [67].We repeat each method's optimization process with 3 different random seeds for reliability.We follow the standard tuning and report parameter values for Smac, Unicorn, ResTune-w/o-ML, ResTune, and cello.More details about experimental choices (Table 7-13), implementation (Figure 19-23), and hyperparameters (Table 14-15) are in appendix A. Baselines.We compare Cameo against the following: • SMAC [25]: A sequential model-based configuration optimization algorithm.• Unicorn [27]: A method that can be used for optimization by transferring the source causal model in the target and later updating it using an active learning strategy.• Cello [15]: An optimization framework that augments Bayesian optimization with predictive early termination.• ResTune [67]: An optimization approach that uses multiple models (ensemble) to represent prior knowledge.• ResTune-w/o-ML [67]: ResTune without meta-learning, i.e., it only learns from scratch in the target.Evaluation Metrics.When running them for the same time limit, we compare the best performance objectives (e.g., latency, throughput, energy, etc.) achieved by each method.We also compare their relative error (RE) to present the summarized results using  = , where Y pred is the best value achieved by each method, and Y opt is the optimal measured value from our observational dataset of 2000 samples.A method with a lower RE value is considered more effective.Research questions.We evaluate Cameo by answering three research questions (RQs).RQ1: How effective is Cameo in comparison to the stateof-the-art approaches when the following environmental changes happen?(i) hardware change, (ii) workload change, (iii) software change, and (iv) deployment topology change.RQ2: How does the effectiveness of Cameo change when the severity of environmental changes varies?RQ3: How sensitive is Cameo when (i) the number of samples in the source environment varies?(ii) the value of   varies?and (iii) the size of the configuration space increases?

RQ1: Effectiveness in Design Explorations
We consider four types of environmental changes typically occurring when a system is deployed into production to evaluate the effectiveness of Cameo in finding an optimal configuration compared to the state-of-the-art.Table 3 shows the summarized results for each approach averaged over different environmental changes considered in this paper.It indicates that Cameo outperforms other optimization approaches for both latency and energy, e.g., Cameo achieves 3.7× and 5.6× lower RE for latency and energy, respectively, compared to ResTune, the next best method after Cameo.We describe the experimental setting and the results for the four environmental changes below.Hardware change.We consider the Mlperf object detection pipeline that uses ResNet-18 for inference of 5k images selected from the 100k test images of the ImageNet dataset [51].We use TX2 as the source hardware and Xavier as the target hardware.We examine these hardware changes since there are variable degrees of microarchitecture differences among this hardware separately.As shown in Figure 8, Cameo finds the configuration with the lowest values of latency (left) and energy (right).For example, Cameo finds a configuration with 1.6× lower latency than ResTune.We also observe a similar trend for energy.Software change.We consider variants of a natural language processing (NLP) model-BERT [13] and TinyBERT [35]deployed on Xavier in our experiments.We set up a software change by changing the model architecture across environments, where we use TinyBERT with 3 million parameters as the source and BERT-Base with 109M parameters as the target.As a workload, we perform sentiment analysis on 1000 of the 25,000 reviews from the IMDB test dataset [42].The results presented in Figure 9 demonstrate that the optimal configurations found by Cameo have a 1.1× lower latency and a 1.7× lower energy value compared to ResTune.Workload change.We consider Cassandra database deployed on Chameleon cloud instance (see Section 4) while varying different workloads to create different source and target environments using the TPC-C benchmark [1].We use a YCSB workload generator to generate 3 workloads: (i) read only-100% read, (ii) balanced -50% read and 50% update, and (iii) update heavy -95% update and 5% read.To optimize throughput, we use a read only workload as the source and the remaining two workloads as the target separately.
Results for workload changes are presented in Figure 10.
When the workload changes from read only to balanced, ResTune outperforms Cameo by finding a configuration with 1.02× higher throughput.Upon further investigation, we found that the distributions between source and target were relatively similar, and the shared covariance learning in ResTune helped to find a better configuration.Additionally, the knowledge extraction module in ResTune is particularly developed to correctly capture workload behavior, making it more suitable for this workload change scenario.However, as the distribution difference increases, Cameo outperforms ResTune, e.g., for update heavy workload Cameo has 1.06× higher throughput than ResTune.Deployment topology change.To test the effectiveness of Cameo across deployment topology change, we consider a video analytics pipeline: DeepStream that uses 4 camera ResTune slightly outperforms Cameo in finding configurations with higher throughput (left).For read-only to updateheavy workload change, Cameo dominates other approaches in finding optimal configuration higher throughput (right).
streams as the workload.Our DeepStream pipeline has four components: (i) an x264 decoder, (ii) a multiplexer, (iii) a TrafficCamNet model with ResNet-18 as the detector, and (iv) an NvDCF tracker, which uses a correlation filter-based discriminative learning algorithm online for tracking.As the source environment, we adopt a centralized deployment topology in which all four components run on Xavier NX hardware.For the target, we use a distributed deployment topology with two Xavier NX hardware, deploying the decoder and multiplexer in one and the detector and tracker in the other.We use Apache Kafka to send and receive the output of the multiplexer to the detector that uses a binary protocol over TCP.Our experimental results for the changes in the deployment environment (Figure 11) show that Cameo significantly outperforms others in finding the optimal throughput and energy.For example, the optimal configuration discovered by Cameo has an improvement of 1.3× and 1.5× (as) for throughput and energy, respectively, than the next-best method.Summary of observations.Methods based on guided knowledge reuse (Cameo and ResTune) consistently are the top performers over methods that do not reuse knowledge.The steep performance curves during the earlier iterations indicate that the optimization process's warm-starting helps quickly go to the region containing good configurations.As a result, in all environmental changes, methods that reuse knowledge from the source outperform Smac, ResTune-w/o-ML, and cello that do not rely on previous information and cannot achieve the the budget.Unlike ResTune and Cameo, Unicorn directly uses source inforthereby introducing bias, which must be This unlearning is not necessary for Cameo due to its knowledge transfer strategy.
better?To further explain Cameo's advantages over methods, we a case study using the same experimental setup mentioned in Section Detection pipeline is deployed on TX2 as the source and Xavier as We key findings in the following.(i) The combined correctness of performance models allows one to effectively identify the values of optimal options.the configuradiscovered by different approaches.It is evident that Cameo can correctly maximum number of options values other approaches (only misidentified vm.dirty_bytes).This is possible the causal models G s and G t as shown on the left of Figure in Figure 12 shows the iterative changes in structural differences the causal the ground truth using only G s or G t or when combining the two.find that Hamming distance is significantly low when both G s and G are 4: configuration discovered by different baselines.Configuration options in order based on their average causal effect (ACE) value on the performance objective, i.e., Latency.combined, indicating that the discovered causal performance model is nearly identical to the ground truth causal performance model in the target, as shown in Figure 12. (ii) Cameo has utilized the budget more efficiently by carefully evaluating core configuration options.To better understand the optimization process, we visualize the response surfaces of three sets of options pairwise with different degrees of average causal effect (ACE) on latency (Figure 13).The leftmost subfigure of Figure 13 contains options with lower ACE values, while the rightmost contains the options with high ACE values only).The middle subfigure of Figure 13 contains options that have ACE values near the median (the ACE values of the configuration options are provided in Table 4).The right-hand subfigure of Figure 13 shows that the response surfaces of the options with higher ACE values are more complex than those with lower ACE values.Table 4 demonstrates that Cameo can accurately find the optimal values of options with higher ACE values, such as cpu_frequency and dirty_ratio, demonstrating a better understanding of such complex behavior.Figure 13 also shows how Cameo has investigated more configurations by varying more options with higher ACE values than lower ones.By focusing on more sophisticated surfaces rather than wasting resources on less effective options, Cameo can make the best use of resources to better understand the performance behavior for navigating the search space.
(iii) Cameo reaches better configurations by achieving better exploration-exploitation trade-offs.From Figure 13 (left and middle), we observe that for options with lower ACE values, Cameo quickly reaches the region with configurations with lower latency within fewer explorations and then focuses on exploitation behavior to quickly determine the optimal configuration.In the rightmost subfigure of Figure 13, configurations evaluated by Cameo cover the largest number of different regions (indicating a better exploration).Here, we also observe that Cameo has evaluated a higher number of configurations near the optimal configuration (blue) regions of the response surface (indicating better exploitation).Therefore, Cameo has a higher coverage of the configurations evaluated during the optimization procedure compared to other approaches for the core options with higher ACE values.The identification of such core features is central to achieving better exploration-exploitation trade-offs.

RQ2: Severity of Environmental Changes
The effectiveness of Cameo changes due to the amount of distribution shift during environmental changes.Predicting how much the distribution will change when an environmental change occurs is impossible.Therefore, it is critical to understand how sensitive Cameo is to different degrees of severity of change.Following previous work [30], we consider various environmental changes of varying severity to answer this question.The scale and the number of changes that occur indicate the severity.For example, an environment change is more severe if both hardware and workload change, compared with only hardware changes.
We consider the centralized deployment of DeepStream used in RQ1 as the source and use the following as the targets: (i) Low severity: We only change one category, hardware (AGX Xavier to Xavier NX); (ii) Medium severity: We consider the change of two categories, hardware and deployment topology.In this setup, the target is deployed with DeepStream in a distributed fashion on two Xavier NX devices with a decoder with four camera streams as workload; and (iii) High severity: We consider a change of four categories, workload, deployment topology, hardware, and model.Our target has DeepStream distributedly deployed on two TX2s, with a workload of eight camera streams.We also changed the detector from ResNet-18 to ResNet-50.Results.As shown in Figure 14, Cameo constantly outperforms the baselines by achieving maximum throughput for all severity of environmental changes.For example, Cameo finds a configuration with 1.3×, 1.5×, and 1.9× higher throughput than ResTune with low, medium, and high severity of changes, respectively.The KL divergence values between the distributions of the source and the low, medium, and high severity environmental changes setup are 418, 951, and 1329.Therefore, we conclude that Cameo performs better than the baselines as environmental changes become more severe.

RQ3: Sensitivity and Scalability
First, we investigate Cameo's performance under different source measurements and how this affects the knowledge transferred from the source to the target and overall performance.Second, we determine how the value of   influences Cameo's effectiveness.Finally, we investigate Cameo's scalability in larger configuration space.Sensitivity to the number of source measurements.We consider the Mlperf object detection pipeline deployed in TX2 as the source and the same pipeline in Xavier as the target, varying the number of measurements in TX2 from 30 to 10000 for evaluation and comparison of their optimal values discovered by different approaches.As shown in Figure 15 (left), increasing the number of source measurements positively influences Cameo's as compared to ResTune.Including a greater number of source samples increases the danger of bias from the source environment, particularly when the distributions of two environments are extremely disparate.From this figure, we can infer that Cameo can prevent those biases from being introduced into the target because more samples are used to extract knowledge from  the source.We also observe that Cameo reaches a plateau (after 2000 samples) faster than ResTune, indicating that Cameo can find better configurations with fewer source samples.Because of Cameo's ability to detect the core features, it can be reliably used across environments without much modification.Scalability to the number of configuration options.We consider a speech recognition pipeline that uses Deepspeech [23] for inference.As workload, we use 2 hours of data extracted from 300 hours of test set of the Common Voice dataset for 5 languages (English, Arabic, Chinese, German, and Spanish).We run inference on the Chameleon cloud instance with one P100 GPU for the source and one K80 GPU for the target.To evaluate the scalability of our approach to colossal configuration space [47], we increase the number of variables from 4 to 100 and determine the discovery time and time for each iteration using 300 samples in the target.Figure 16 indicates that the discovery time and time per iteration increase sub-linearly.Therefore, Cameo is scalable to a large number of configuration options and events.The scalability of Cameo can be attributed to the sparsity of the causal graph, leading to a small exploration set for the acquisition function.

Additional Related Work
Performance optimization in configurable systems.BO-based optimization methods discover the best configuration suited for a particular application and platform [44] to streamline compiler autotuning [7].SCOPE [37] improves system performance and reduces safety constraint breaches by collecting system activity and switching from resource to execution space for exploration.cello [15] uses predictionbased early termination of sample collection by censored regression.Siegmund et al. [54] proposed a performanceinfluence model for configurable systems to understand the influence of configuration options on system performance using machine learning and sampling heuristics.However, they are platform-specific and unsuitable when a distribution shift occurs due to environmental changes.In comparison, Cameo tackles the shift by transferring causal knowledge.Transfer learning for performance modeling.To accelerate optimization using transfer learning, it is essential to identify what knowledge is necessary to be reused.Jamshidi et al. [31] showed that when environmental changes are small, knowledge can be transferred to predict performance, while only knowledge can be transferred to efficient sampling when environmental changes are severe.Krishna et al. [39] determined the most relevant source of historical data to optimize performance modeling.Valov et al. [57] proposed a novel method to approximate and transfer the Pareto frontiers of optimal configurations across different hardware environments.Ballesteros et al. [5] proposed a dynamic evolutionary transfer learning algorithm to generate effective quasi-optimal configurations at runtime.All these techniques incorporate transfer learning based on correlational statistics (ML-based).However, Section 2.1 shows that ML-based models tend to capture spurious correlations.In comparison, Cameo uses causal models, which identify invariant features despite environmental fluctuations.Usage of causal analysis in configurable systems.Causal analysis has been used for various debugging and optimization tasks in configurable systems.Fariha et al. [17] proposed AID that intervenes through fault injection to pinpoint the root cause of intermittent failures.Johnson et al. [36] proposed Causal testing to analyze and fix software bugs by identifying a set of executions that contain important causal information.Dubslaff et al. [16] proposed a method to calculate feature causes effectively and used them to facilitate root cause identification and estimation of feature effect/interaction.The causality analysis in these works is solely on one environment, whereas we focus on efficiently transferring the causal knowledge from one environment to another.

Limitations
Causal graph error.Causal discovery is an NP-hard problem [9].Thus, the learned causal graphs might not be the ground-truth causal graphs and do not always reflect the true causal relationship.However, such causal graphs can still be leveraged to achieve better performance than ML-based approaches in system optimization and debugging tasks as they avoid capturing spurious correlations [16,27].Noisy Measurements.The system performance measurements are noisy and can affect the results.To mitigate this, we take each configuration's median across 5 runs.More model computational time.Due to the use of two CGPs, Cameo takes more time than the baselines.For example, on average, Cameo takes 27.1s per iteration versus 19.4s per iteration taken by ResTune (see Table 5).However, this time is usually small compared to the time required for each evaluation (44s on average in our experiments).

Conclusion
The goal of performance optimization of software systems is to minimize the number of queries required to accurately optimize a target black-box function in the production, given access to offline performance evaluations from the source environment and a significantly small number of performance evaluations from the target environment.When the environment changes, existing ML-based optimization methods tend to be sub-optimal since they are vulnerable to spurious correlations between configuration variables and the optimization performance goals (e.g., latency and energy).In this work, we propose Cameo, an algorithm that overcomes this limitation of existing ML-based optimization methods by querying data based on a combination of acquisition signals derived from training two Causal Gaussian Processes (CGPs): a cold-CGP operating in the input domain trained on the target data and a warm-CGP that operates in the feature space of a causal graphical model pre-trained on the source data.The decomposition dynamically controls the reliability of information derived from the online and offline data and the use of CGPs helps avoid spurious correlations.Empirically, we demonstrate significant performance improvements of Cameo over existing methods on real-world systems.specified by its mean function, its mean function  (), and its covariance function   (,  ′ ).The kernel or covariance function   captures the regularity in the form of the correlation of marginal distributions  () and  ( ′ ).After the surrogate model outputs the predictive mean and uncertainty for the unseen configurations, Cameo needs an acquisition function to select the best configuration to sample.A good acquisition function should balance the trade-offs between exploration and exploitation.the corresponding distributions.Table 16 reports the summarized results compared to cello, as this is the only baseline that incorporates constraints.We that in addition to latency optimization under energy for workchanges, Cameo consistently outperforms cello hardware, and deployment changes, for example, under latency constraints, Cameo finds configurations with 1.3× 1.5× software and topology changes, respectively.

Figure 2 :
Figure 2: (a)-(b) The relationship between IPC and latency reverse from source (TX2) to target (Xavier) while the relationship between swappiness (the values are denoted as colors) and latency stays invariant.(c) The true causal relationship among the relevant variables.

Figure 4 :
Figure 4: Combining the top K nodes' Markov blankets eliminates the wrong biases (shown as black squares).

Figure 5 :
Figure 5: Pruning edges with a Markov blanket identifies the optimal configuration faster.In our example, if we select K=6 with Markov blankets then the wrong biases, migrations->syscalls enter and migrations->llc stores (the nodes marked by black in Figure 5(b)), are eliminated.Figure5(a) shows that pruning the edges helps to reach the optimal value 19% faster.Therefore, we require an approach that relies on intervening only in the top K nodes based on the source knowledge in the target environment.

Figure 7 :
Figure 7: Refining the causal performance model from the source to eliminate unwanted information.

Figure 8 :
Figure 8: Compared to other state-of-the-art methods, Cameo identifies the configurations with reduced latency (left) and energy (right) when hardware changes take place.

Figure 9 :Figure 10 :
Figure 9: Cameo finds the configurations with lowest latency (left) and energy (right) when software changes.

Figure 11 :
Figure 11: Cameo maximum effectiveness in finding configurations with the highest throughput (left) and energy (right) when deployment topology changes.

Figure 12 :Figure 13 :
Figure 12: The causal performance models become more accurate with increasing iterations.The correctness of G s and G t when combined helps Cameo in detecting the optimal configuration more effectively than others.A lower hamming distance value indicates a smaller difference with the ground truth causal performance model in the target.

Figure 14 :Figure 15 :
Figure 14: Cameo achieves higher throughput when different severity of environmental changes take place.

Figure 16 :
Figure 16: As the number of configuration options and system events increases, discovery time (left) and total time per iteration (right) increase sub-linearly.

Figure 18 :
Figure 18: After 2000 configuration, the value of the optimal performance objective reaches a plateau as the number of configurations continues to rise.

Tables 6
to 15 and Figures 20 to 23.

Figure 21 :
Figure 21: Experimental setup when the type of workload is different with Cassandra where the source uses a Read Only workload where the target uses a Balanced and Update Heavy workload, separately.

Figure 22 :
Figure 22: Experimental setup for our experiments the deployment topology is changed from centralized to distributed in the target in the target using two NX.

Figure 20 :
Figure 20: Experimental setup when a software change takes place from TinyBERT to BERT-Base in the target for a NLP system.

Table 1 :
Comparison of Cameo with state-of-the-art system performance optimization approaches.
( (D  )) represents the volume of the convex hull for the observational data and   (  ∈ O (D (O)))gives the volume of the interventional domain.represents the maximum number of observations the developer is willing to collect in a particular environment, and  is the current size of   .The interventional space is larger than the observational space when the volume of the observational data   ( (D  )) is smaller than the number of observations  .Therefore, we must perform interventions to explore regions of the interventional space not covered Offline source dataset D s , Initial target dataset D t , Configuration space O, Total budget , Threshold   , Performance Objective Y, Constraint Ψ.Knowledge Extraction Phase 1: Construct a causal performance model from G s , G t using D s , D t , respectively.2: Extract the top  nodes from G s in terms of causal effect on performance objective.3: Extract Markov Blanket of the top  nodes to construct a new updated G s .Knowledge Update Phase 4: Initialize CGP warm and CGP cold .5:   = 0 6: while   ≤  do by observational data.On the other hand, if the volume of observational data   ( (D  )) is large in relation to  , we need to make observations.This is because we need to obtain consistent estimates of the causal effects, which can only beAlgorithm 1 CameoRequire:

Table 3 :
Summarized results averaged over all environmental changes.

Table 5 :
Comparison of computation time in seconds periteration for baselines compared to Cameo.Lower is better.

Table 13 :
Performance system events and tracepoints.

Table 14 :
Hyperparameters for DNNs used in Cameo.

Table 15 :
Hyperparameters for FCI used in Cameo.