BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these parameter types and efficiently deliver high-quality code, BaCO uses Bayesian optimization algorithms specialized towards the autotuning domain. We demonstrate BaCO's effectiveness on three modern compiler systems: TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For these domains, BaCO outperforms current state-of-the-art auto-tuners by delivering on average 1.36X--1.56X faster code with a tiny search budget, and BaCO is able to reach expert-level performance 2.9X--3.9X faster.


Introduction
Modern compilers are rapidly evolving to keep pace with the growing range of increasingly specialized hardware targets, as well as the ever-changing domains of interest.A recent trend is to separate policy (what to compute) from mechanism (transformations and code generation describing how to compute) by using scheduling languages.Prominent examples of this paradigm include Halide [44], TVM [7], TACO [49], and RISE & ELEVATE [19,55].This design pushes the optimization task of finding good 1,10 {erik.hellsten,luigi.nardi}@cs.lth.se, 2 arturluis94@gmail.com, 3 j.le@wwu.de, 4,5,7,9{rubensl, owhsu, kjolstad, kunle}@stanford.edu, 6aejjeh@illinois.edu, 8michel.steuwer@ed.ac.uk schedules outside of the compiler core, where it can be done manually or automatically by an autoscheduler.Scheduling languages may express more complex optimization spaces, and, thus, require more advanced autotuning features to effectively and efficiently tackle the autoscheduling task.Modern hardware backends-like GPUs, as in RISE & ELEVATE [55], and FPGAs, as in HPVM2FPGA [33,13]-further increase the complexity of relevant optimization spaces.
The separation of concerns between policy and mechanism in compilers exposes a great opportunity.If we can design a portable autoscheduler that is effective across many compilers, like the design shown in Fig. 1, then we can reduce the complexity of the overall ecosystem.New compilers get an autoscheduler with minimal effort, and improvements in the autoscheduler automatically benefit all compilers and their subsequent domains and backends.
However, a portable autoscheduler must be designed with a rich input language to allow users to accurately describe the search space exposed by their particular compiler.This autoscheduling search space is determined by the product of the hardware target, the compiler's scheduling language features and configuration tuning parameters.In modern compilers, this search space is often complex, including both continuous parameters (e.g., real-valued tuning parameters), and discontinuous parameters broken down into integers (e.g., tiling factors), permutation categories (e.g., loop reordering), ordinals/ordered categories (e.g., unroll factors), and categoricals/unordered categories (parallelization schemes).These parameter types are often abbreviated to RIPOC [41,2,22].
However, even the large class of search spaces that can be generated as the Cartesian combination of these parameters is often inadequate.Scheduling parameters frequently depend on the settings of other parameters, leading to constraints on the scheduling space.One such example is a loop bound that must be an exact multiple of a given tiling factor.We refer to these as known constraints, which are provided to the autoscheduler ahead of time.Other constraints are initially unknown and must be learned throughout the autoscheduling.An example of this is learning sets of parameters that would generate programs that adhere to hardware constraints, such as avoiding out-of-memory errors on a GPU.Such constraints are often referred to as hidden constraints.For a general autotuning framework to be efficient and portable across a multitude of compilers, it needs to support as many of these features as possible.
Once able to express a complex search space, the autoscheduler must be effective and efficient in finding a good schedule within this space.For optimizing compilers, the performance of the generated code is of primary concern.Therefore, the autoscheduler will invariably use some sort of search combined with a cost model to evaluate points of the search space.The cost model could be analytic or data-driven, but the most accurate cost model is to generate and run the code on its target platform.For a general autoscheduler used across diverse compilers, a cost model based on running the code makes it easy to use the autoscheduler with a new compiler.We refer to empirical autoschedulers, whose cost model is to run the actual code generated by the compiler, as autotuners.
To achieve composability, and to work effectively and easily across a diverse set of compilers, it is vital for an autoscheduler to treat each compiler as a black-box system.The autoscheduler's job will then be to optimize the black-box system using the smallest possible budget of trials and errors, i.e., evaluations of the black-box system.
Many successful autotuning frameworks have been proposed, some of which are listed in Table 1.These frameworks have helped deliver high-performance software in the past.However, they do not support all features required to effectively search over the complex search spaces described by diverse scheduling languages across modern domain-specific compilers targeting various hardware backends.For example, Table 2 shows the features required by three modern compilers (TACO, RISE & ELEVATE, and HPVM2FPGA), and Table 1 shows that none of 14 recently proposed autotuners support all required features in the manner we define below.

Some of the frameworks in Table
Hence, we propose BaCO, a novel general autotuning framework optimized towards the autotuning of modern compilers, which efficiently handles all features mentioned above.BaCO does not require any user-provided cost model but instead learns from observations from running the generated code throughout the optimization procedure.The support for sophisticated search spaces and online learning means that BaCO finds good schedules in fewer iterations than existing autotuners, while being easy to use.Notably, we do not adapt BaCO to individual compilers nor applications but show that it yields expert-level performance out of the box for a wide range of applications.Our contributions are: • The first Bayesian autotuning framework that supports all RIPOC features, including permutation types, through the definition of separate distance metrics, thus improving search performance (Sec.3 and Sec. 4).• The integration of a feasibility model for hidden constraints, simplifying the portability to new compiler backends (Sec.4).• The first system to use the chain-of-tree technique in a Bayesian optimization setting for portable autotuning on sparse search spaces (Sec.4).• Applying autotuning to three distinct compiler frameworks for different domains and hardware targets, along with a survey of the autotuning challenges of those three recent domain-specific compilers (Sec.2).The end result is that BaCO generates expert-level code significantly faster than the state of the art.We demonstrate the effectiveness and robustness of BaCO across three real-world compilers and code generation systems targeting CPUs, GPUs, and FPGAs (Sec.5).BaCO reliably produces good schedules across different compilers without any customization for each compiler, such as hyperparameter tuning or custom constraint filtering.It achieves expert performance on TACO 3.15×-5.0×faster than the state-of-the-art, while RISE & ELEVATE achieves 1.35×-1.58×better performance with a tiny autotuning budget, and HPVM2FPGA achieves peak performance 2.43×-2.77×faster.

Complexity of Modern Autotuning
To develop the next generation general-purpose autotuning framework, we need to better understand the real-life challenges faced by various modern compiler frameworks.Therefore, we investigate the autotuning features needed by the TACO, RISE & ELEVATE, and HPVM2FPGA compiler frameworks.This will allow us to identify an autotuning framework that is able to generalize across a wide spectrum of compilers and backend targets.As we shall see, this ideal general-purpose autotuning framework needs to support the features described in Table 2, which includes support for a wide range of parameter types and both hidden and known constraints.
The Tensor Algebra Compiler (TACO).TACO [26] is the state-of-the-art compiler for sparse tensor algebra.It generates high-performance code for tensor operations expressed in a high-level Einstein notation, such as the sampled dense-dense matrix multiplication (SDDMM) computation represented as A(i, j) = B(i, j) * C(i, k) * D(k, j).A particular strength of TACO is its capability to generate code for a large variety of sparse tensor formats [8].
TACO's scheduling language defines an iteration space transformation framework that dictates how to traverse a tensor stored in any particular format [49].This provides a way to introduce optimization transformations, such as tiling, parallelization, vectorization, loop reordering, and more.An autotuning framework selecting the optimizations exposed by the scheduling language needs to provide not just traditional real, integer, ordinals, and categorical parameters as provided by most frameworks in Table 1, but also permutation parameters for selecting loop reordering.Typically, optimization is performed inside the compiler and controlled by a heuristic, but in TACO, as well as other similar compilers with scheduling languages, it is exposed as a tunable parameter.These optimization parameters also need to follow known constraints that TACO provides.An example is loop reordering variables that TACO enforces for concordant traversal.[55] and ELEVATE [19] are a powerful combination of compiler and scheduling languages.Computations are described in the RISE [55] language using well-known data-parallel patterns like map and reduce in the spirit of LIFT [20,56].Optimizations are applied and described in the ELEVATE [19] scheduling language as compositions of semantic preserving rewrite-rules.The optimized RISE program is compiled to high-performance CPU or GPU code.

RISE & ELEVATE. RISE
Transformations, such as loop tiling, may introduce numerical tuning parameters, such as a tile size, which are often constrained by other numerical values, such as loop bounds.When automatically optimizing RISE programs, an explorative rewrite process speculatively applying program transformations is performed.To evaluate the performance of a transformed program, the system relies on an autotuning framework to pick all numerical parameters while respecting all known parameter constraints that the system can collect automatically and provide to the autotuning framework.Compiling for GPUs also introduces hidden constraints for the autotuning framework, such as choosing only parameter values that will result in a program fitting in the tight register and memory requirements.When these constraints are not satisfied, the compiler generates code that will fail to execute.Therefore, the autotuning framework must be able to learn these hidden constraints automatically.
HPVM2FPGA.HPVM2FPGA [13] is a compiler that enables hardware-agnostic programming of fieldprogrammable gate arrays (FPGAs).The compiler uses sophisticated optimizations, coupled with design space exploration (DSE), to automatically tune and generate well-performing FPGA designs from programs that have not been written by hardware and FPGA experts.HPVM2FPGA is part of the Heterogeneous Parallel Virtual Machine (HPVM) compiler infrastructure [33,12], which provides a retargetable virtual ISA and compiler IR for programming heterogeneous systems.During HPVM2FPGA's DSE, compiler transformations, such as loop unrolling, greedy loop fusion, argument privatization, and kernel fusion are explored.HPVM2FPGA generates its parameter space automatically through a static analysis of the IR, and the design space varies depending on the size of the application being compiled.The majority of the parameters are boolean parameters, with hidden constraints among them, making it challenging to explore the space efficiently.

The BaCO Framework
We introduce Bayesian Compiler Optimization (BaCO) 1 , a Bayesian optimization (BO) framework that learns highperforming auto-scheduling strategies.BaCO thrives in a small data world where configuration evaluations are costly, either due to high runtimes of the kernel or expensive simulations of code generation passes.BaCO is backend-agnostic, and it can be equally applied to CPU, GPU, and FPGA compilers.Building on the BO paradigm, BaCO is centred around a configuration recommendation-evaluation loop: it recommends promising new configurations that are subsequently scheduled and evaluated by the corresponding compiler toolchain.The evaluation results are used to fit two predictive models: one modelling the predicted value and one modelling the predicted probability of feasibility of new configurations.To initialize the two models, the procedure starts with an initial phase, where the first few configurations are sampled uniformly at random from the search space.
However, for BO to reach its full potential, it needs to be customized for autotuning tasks.This section explains the various modules of BaCO's architecture, shown in Fig. 2, whereas further specialization towards autotuning search spaces is emphasized in Sec. 4. 1 Baco is Italian for bug.

Bayesian optimization
Bayesian optimization is a steadily growing methodology for solving black-box optimization problems.Those are problems where the objective function f (x x x) can only be accessed point-wise through expensive evaluations.At the core of BO is the use of a surrogate model, which estimates the objective function.This helps with performing well-informed decisions about which configurations to evaluate next.The goal is to find a good configuration in as few evaluations as possible.The most common choices of surrogate model are Gaussian Process (GP) or Random Forest (RF) predictors, which both provide useful uncertainty estimates.For any given point x x x, the model provides mean and variance, which is used to balance exploration and exploitation.This trade-off is quantified by an acquisition function.Common examples are Expected Improvement and Lower Confidence Bound.The surrogate model is dynamically updated to learn from observations, creating a feedback loop where the BO framework proposes new points, that are then evaluated.The information from the evaluation is subsequently used to train the model.BO was historically developed for continuous compact domains, and the extension towards more exotic search spaces is currently being used in this work and studied by the BO community.

Surrogate models over compiler domains
Choice of probabilistic model One core element of an efficient BO algorithm is an accurate surrogate model [15].While complex parameter domains have little impact on less intricate methods such as random sampling, the success of BO depends greatly on clever handling of such parameters.While traditionally, Random Forests have been considered the natural choice as surrogate models over discrete domains [23,41,4], recent studies show that a careful implementation of Gaussian Processes yields superior accuracy [18,9].However, to achieve the true potential of GPs in autotuning and DSE applications, significant customization of the GP is needed.This customization is explained in detail in the following sections.Sec.5.3 shows the impact of some of these major design choices and a comparison between GPs and RFs.

GP kernel similarity function
A key feature in autotuning and DSE is the mixed-variable search space.Thus, the kernel needs to combine distance measures over different parameter types.We propose the weighted Euclidean norm ||d d d|| 2  2 = ∑ D i=1 (d i /l i ) 2 over the vector of individual distance measures d d d, as a unified distance measure.D denotes the dimension of the search space, i.e., the number of parameters being optimized, and l i are the horizontal lengthscales, learned using maximum likelihood estima-tion (MLE) [40], weighting the different parameters.We use the 5/2-Matérn kernel [47], given by where d(x i , x i ) denotes the distance between x i and x i (described in Sec.4.1), as this has shown to be efficient in many real life applications [27,58].To increase stability, we assume that the value observed in each evaluation, y(x x x), is perturbed by some normally distributed noise [15], such that y(x x x) = f (x x x) + ε and ε ∼ N(0, σ ε ).
GP hyperparameter optimization A crucial element in effective optimization using GPs is to find good hyperparameters for the model.Especially important are the length-scales l i presented in Eq. ( 2), which balance the importance ratio between different parameters.The remaining hyperparameters are the outputscale σ in Eq. ( 1), and the magnitude of the Gaussian noise σ ε .BaCO optimizes the hyperparameters using a multistart gradient descent approach, which first uniformly samples a number of possible hyperparameter settings, then chooses a fraction of those with highest likelihood, and optimizes them individually using L-BFGS [37].
Discrete parameter spaces offer a number of practical challenges when fitting the GP model.One such challenge is that the model hyperparameter optimization method described above frequently prefers to give close to zero lengthscale values to some parameters.In practice, this means that configurations which take different values for those parameters have close to zero similarity, making the GP behave as a sparse model.This is undesirable as it reduces the model's expressive power.To address this artifact of GP modeling, as well as to stabilize the hyperparameter selection, BaCO uses gamma priors [40] for the lengthscales.These priors are chosen to be flexible while cutting out extreme hyperparameter settings.In practice, stabilizing the lengthscales means that different parameters are given more equal importance, preventing certain parameters from becoming too dominant or too insignificant due to model over-fitting.Gamma prior distributions are chosen as they have positive support, can be made reasonably concentrated and have long tails towards both zero and infinity.Other good alternatives with similar properties would be the log-normal or inverse-gamma distributions.By normalizing the input data, BaCO can use a single set of priors that works well for the majority of parameters.Note that, this is an artifact from that many parameters take identical values in discrete spaces, which rarely occurs when working with continuous parameters.

Acquisition function
The acquisition function quantifies the anticipated utility of evaluating a new point.We use the Expected Improvement (EI) acquisition function [25], which balances exploration and exploitation.Autotuning and DSE are characterized by both discrete search spaces and noisy function evaluations, in which case we observe that standard EI has a tendency to overly prioritize re-sampling points with good values.To avoid this unintended behavior, we propose a modified EI acquisition function which predicts the expected improvement of observing a noise free evaluation of the blackbox function.Computing the EI without considering the noise in the GP makes sampling repeated points less likely.
BaCO optimizes the acquisition function by multi-start local search.Initially, a large number of configurations are sampled at uniformly random, of which the best configurations are chosen as starting points for the local search.Neighbours are defined as all configurations that can be reached by modifying a single parameter.

Adapting to Exotic Search Spaces
When implementing an efficient autotuning framework, effectively handling all of the search space features is key.As BaCO is built around a GP predictive model, careful design of the distance metrics used for different variable types is of additional importance.In this section we study the intricacies of the different parameter types as well as how to handle known and hidden constraints.

Parameter types
Continuous, integer, and ordinal parameters These types of parameters have the property that the values are comparable, i.e., you can use the greater or equal sign to order them.This can naturally and explicitly be translated into a distance metric, and in particular we use the absolute difference, d(x i , x i ) = |x i − x i |.However, certain such parameters are innately exponential in nature, such as tile size parameters.In that case, we use the Euclidean distance over a log-transformed space instead, The log transformation often more accurately describes the relationship between values.Consider tile sizes as an example.We expect the tile sizes 2 and 4 to be roughly as similar to each other as the tile sizes 512 and 1024.However, the tile sizes 512 and 514 would be much more similar than the pairs above.
Categorical parameters Categorical parameters differ from ordinals in that they have no inherent order.Here, we use the Hamming distance, defined as d h (x i , x i ) = 1 x i =x i , where 1 is the indicator function, which returns 1 if x i = x i and 0 otherwise.In other words, the Hamming distance only considers whether the parameter values are identical or not.The scale here is not relevant as the distance is weighted by the lengthscale l i in Eq. ( 2).
Permutation parameters Permutation parameters are used to describe the reordering of a sequence of elements.
In compiler applications this most commonly appears as the reordering of a set of loops [22].Consider for example a kernel with four nested loops (l 1 , l 2 , l 3 , l 4 ) which can be performed in any order.This ordering can be represented by a single permutation variable π π π, which is a vector whose element i, given by π i = j, describes the index j of loop l i in the new order.For example, the permutation π π π = [2, 4, 3, 1] corresponds to the following loop reordering: for (l1 ... ) for (l2 ... ) for (l3 ... ) for (l4 ... ) ...
Prior black-box optimization literature, autotuning, and DSE frameworks lack the capability to effectively handle this variable type, with the notable exception of Open-Tuner [2].In BO frameworks employing GPs, it is important to accurately estimate how different permutations relate to each other.In other words, for the nested loop reordering example above, the framework needs to determine if the two different loop orderings are likely to yield a similar performance.One naive way of handling permutation variables is to treat them as categorical variables, e.g., to consider one nested loop ordering to be equally similar to every other loop ordering (with the exception of itself).This, however, ignores the underlying structure that can be used to define a more refined similarity measure.We instead present three different semimetrics for permutations: the Kendall distance, Spearman's rank correlation, and the Hamming distance.While the semimetrics are not strictly distance metrics, Lomelí et al. [39] show that they can be used to form a valid GP kernel.These three semimetrics are illustrated in Fig. 3 on a set of four elements, where the two boxes represent two permutations π = [1, 2, 3, 4] and π = [2, 4, 3, 1].
The first semimetric is the Kendall distance, tance represents the number of discordant pairs, i.e., the elements that have swapped order between the two permutations.Each discordant pair is represented by a green, interconnected double arrow.The second semimetric is the Spearman's rank correlation, , which is the sum of squared movements of the elements between two permutations.It is illustrated with blue arrows in Fig. 3, where the dots represent the starting points and the arrows the final position.For example, the number two starts in the second position in π (left) and moves to the fourth position in π (right), meaning that it has travelled a distance of two.The Spearman's rank correlation then sums the squared displacement of all elements.Note that the square substantially emphasizes large rank changes.Lastly, the Hamming distance, is the number of elements in π that are no longer at their original position in π -represented with orange triangles in the figure .For a given permutation set, the choice of semimetric depends on how those permutation parameters are expected to impact the performance metric.Intuitively, Kendall distance focuses more on parameter order, whereas Spearman's rank correlation emphasizes large movements of individual elements.The Hamming distance only considers the number of elements changed and ignores where they moved to.As an example, consider the two loop orders for (l2 ... ) for (l3 ... ) for (l1 ... ) for (l4 ... ) ... and for (l4 ... ) for (l3 ... ) for (l1 ... ) for (l2 ... )....
They have a high Spearman's rank correlation due to the large movement of the first and last element and relatively smaller Kendall and Hamming distances, which is intuitive given the compiler transformation that this represents.This is backed by our ablation analysis in Sec.5.3, where we observe that using Spearman's rank correlation outperforms the other alternatives.By consequence, we use Spearman's rank correlation as a default setting for permutation variables in BaCO.

Parameter constraints
For an autotuning framework to be truly competitive in the complex world of modern autoscheduling, it is essential to effectively handle constraints in the parameter search space.Constraints can be divided into known constraints, which are known prior to optimization, and hidden constraints, which are only discovered during optimization.BaCO is designed to support both these constraint types.Known constraints In autotuning applications, users often possess expert knowledge regarding parameter configurations that lead to inefficient or even infeasible schedules.Incorporating this knowledge into the autotuning framework leads to significant performance improvements.The improvement becomes even greater when the feasible set makes up a small fraction of all possible configurations, i.e., when the search space is sparse.BaCO handles known constraints during the acquisition function optimization, and proposes only feasible configurations.As such, the surrogate model trains exclusively on feasible points.BaCO uses a Chain of Trees (CoT) data structure to deal with sparse search spaces, which was first presented by Rasch et al. [46].The CoT computes all of the feasible configurations a-priori and stores them as a collection (or "chain") of trees.Each tree corresponds to a group of co-dependent parameters, and parameters in different trees are independent of one another.For each tree, each level of the tree corresponds to a single parameter and each node in that level corresponds to a possible value for that parameter.Each path from the root to a leaf then represents a partial configuration, and the tree is built so that only feasible configurations are included.Consider for example the following search space: In this example, there are five input parameters and three constraints.Parameters p 1 and p 2 as well as parameters p 3 , p 4 , and p 5 are co-dependent.We represent them with the CoT shown in Fig. 4. As the parameters in different trees are independent, any combination of partial feasible configurations from the different trees yields a feasible configuration.For example, the leftmost path in the left tree combined with the rightmost path in the right tree yields the feasible configuration (p 1 , p 2 , p 3 , p 4 , p 5 ) = (2, 2, 4, 4, 8).
The use of the Chain-of-trees in BaCO is threefold.First random sampling can be made directly from the CoT.Secondly, instead of evaluating the constraints explicitly, it is significantly faster to instead check whether the configuration belongs to the CoT.Thirdly, it allows working with highly sparse search spaces which are common in the autotuning domain.In such sparse spaces, operating directly on the original domain becomes infeasible. 2orking with constrained spaces is inherently biased since the optimization method presented in Rasch et al. [45] prioritizes different configurations depending on the structure of the CoT.Their approach is equivalent to random sampling a configuration from the CoT by starting at each root node and then iteratively choosing a child with uniform probability, which is biased towards configurations in less dense parts of the tree.This bias is furthermore dependent on the order in which the parameters appear in the tree, which is an undesirable feature.We propose an alternative approach where we instead sample uniformly from the leaves of the trees, which is bias-free, meaning that the random sampling will be performed uniformly on the all the configurations.We study the impact of the bias in Sec. 5.
Another source of sparsity comes from how parameters are defined prior to optimization.In autotuning and DSE applications, due to the nature of how computers store information, parameters are often constrained to take exponential values.Treating such parameters as integers leads to sparsity in the search spaces.BaCO instead applies the logarithmic transformation to such parameters.This transformation makes the search space significantly denser and yields more natural distances for the GP.These qualities improve performance, as we shall see in Sec. 5.
Hidden constraints Requiring the complete feasible domain definition from the user would severely limit the autotuner's usability.Some constraints are either too complicated to describe analytically or unknown a-priori.Instead, BaCO supports the concept of hidden constraints, learned automatically during optimization.BaCO uses a Random Forest model to predict the probability of feasibility for each configuration.It then extends the EI acquisition function presented in Sec.3.3, by multiplying the EI with the probability of feasibility [41].This should be compared with the naive approach of assigning high objective values to infeasible configurations, which suffers from difficulties with setting an accurate penalty.Such penalty terms are further often detrimental for the statistical model.
However, the practical interaction between the acquisition function based on the GP model and the RF feasibility predictor is complex.There is a constant trade-off between the feasibility model wanting to select feasible points and the value predictor that seeks to explore the unexplored infeasible regions.If the surrogate model becomes excessively confident within the feasible region, this balance tends to be skewed.In which case the selec-tion fails to reliably find feasible points.As a practical solution, we consider a minimum feasible limit ε f and only consider configurations with probability of feasibility greater than ε f for evaluation.By randomly sampling a new ε f each iteration, with p(ε f = 0) > 0, we asymptotically guarantee to not cut away any solutions by doing this.

Evaluation
We validate the efficiency, effectiveness, and generalizability of BaCO.We first introduce the reference autotuning methods that we use as a baseline to evaluate the performance of BaCO, followed by the benchmarks from the three real-world frameworks presented in Sec. 2. We then show the performance results.For lack of space we show the extensive empirical results on all the frameworks and benchmarks in Appendix A. All experiments are run for 30 repetitions.We also show a wall-clock time analysis of all the autotuners used in Appendix B.
We answer the following research questions (RQ): RQ1) Does BaCO achieve high performance with a limited autotuning budget?The evaluation in Fig. 5 shows that, with a tiny budget of 20-40 evaluations, depending on the complexity of the benchmark, BaCO achieves 1.35×-1.55×better performance than the stateof-the-art baselines.Furthermore, BaCO consistently achieves expert-level performance with a small budget of 40-80 evaluations, where the baselines struggle to achieve expert-level performance even with a much larger budget.This demonstrates the advantage of BaCO to deliver high performance for a small budget, even for complex search spaces.
BaCO delivers expert-level performance on average 2.87×-3.87×faster than the state-of-the-art autotuning frameworks.Fig. 6 highlights the quicker performance evolution of BaCO for representative benchmarks, and a more detailed breakdown is presented in Table 9 in the appendix.These results show that BaCO delivers performance much quicker than the baselines.
RQ2) Does BaCO generalize across compiler frameworks and benchmarks?Our evaluation across three diverse real-world compiler frameworks shows consistently that BaCO significantly outperforms the baselines.In fact, BaCO is the only framework that outperforms the expert configuration on all benchmarks across compiler frameworks, as shown in Fig. 6, with a more detailed breakdown in Table 5 in the appendix.This observation suggests that the techniques discussed in the paper generalize well across compiler frameworks and benchmarks.
RQ3) What is the performance benefit of customizing Bayesian optimization for compiler autotuning?Our system demonstrates advantages over prior work BO-based autotuning frameworks not specifically customized for compiler domains (Ytopt in Fig. 8).As explained in Sec. 3 and Sec. 4, these improvements validate our design choices and suggests that there are significant performance benefits to be gained by customizing the BO framework for this particular domain.
RQ4) What are the findings of autotuning our distinct real-world compilers using BaCO?This question attempts to provide insight on why BaCO outperforms baselines for our real-world compiler benchmarks.We identify three main areas: exploration of new configurations, testing schedules that did not previously exist in prior work, and better handling of both known and unknown constraints for complex real-world applications.Evaluations for this question comes from Fig. 5, Fig. 7, and Fig. 11 in the Appendix.

Baseline Methods
To contextualize the performance of BaCO, we evaluate it alongside two state-of-the-art autotuning frameworks and two random sampling approaches.
ATF with OpenTuner The Auto-Tuning Framework (ATF) [46] extends the popular OpenTuner [2] to handle known constraints.We chose OpenTuner as a baseline since it is one of the leading frameworks for autotuning.
Ytopt Ytopt [63] is an autotuning framework using BO and is part of the PROTEAS-TUNE project [1].It supports both Random Forests and Gaussian processes.We run it here with Random Forests as the GP implementation does not support constraints.When infeasible solutions are found due to hidden constraints, they are added to the data set with a high objective value.We compare against Ytopt since it is one of the only frameworks that supports either constraints or GP.When addressing RQ3, we run Ytopt with GPs.
Uniform and CoT random sampling These are uniform random sampling methods.The CoT random sampling is a method that randomly samples at uniform directly from the CoT.This baseline allows us to study the impact of the bias introduced by the known constraints, as explained in Sec.4.2.
Default and expert configurations For reference, we show the performance of two baseline configurations: The default configuration, and an expert configuration, carefully handcrafted by domain experts in the respective programming languages.It is unlikely that a developer would exceed the expert performance baseline, which makes it a suitable data point for our empirical analysis.The HPVM2FPGA benchmarks are automatically generated by the autoscheduler and do not provide any expert configurations, in which case we only report the default.The expert configurations are taken from prior publications: TACO [49] and RISE & ELEVATE [19,29,54,57].The original authors used manual or semi-automated methods to determine well-performing configurations based on their experience, hardware characteristics, or data properties.The authors had the incentive to produce the best configurations but, presumably due to time constraints, they may have occasionally missed better-performing configurations.

Benchmarks
BaCO is evaluated over 15 kernels from linear algebra, machine learning, image processing, statistics, and signal processing.We integrate BaCO in the three real-world frameworks presented in Sec. 2. The benchmarks have been chosen based on prior work by the authors of the three frameworks.Furthermore, most of these benchmarks have an expert optimized code which allows for a fair comparison.The search space size ranges from tens of thousands to billions of configurations, as described in Table 3, which is beyond the scope of exhaustive search.
To define the evaluation budget for each benchmark, we first establish a full budget for each benchmark, as shown in the last column of Table 3.The full budget is defined using a rule of thumb of around 5 to 6 minutes.This max compilation time is commonly adopted in large companies, such as Google.We then define tiny and small budgets as 1/3 and 2/3 of the full budget.TACO benchmarks We benchmark 5 tensor algebra expressions, commonly used in machine learning and tensor factorization [16].Namely, they comprise of sparse matrix-vector multiply (SpMV) a i = ∑ k B i j c k , sparse matrix multiply (SpMM) A i j = ∑ k B ik C k j , sampled densedense matrix multiply (SDDMM) A i j = ∑ k B i j C ik D jk , tensor times vector (TTV) A i j = B i jk c k , and fourth-order matricized tensor times Khatri-Rao product (MTTKRP) A i j = B iklm * C k j * D l j * E m j .Each tensor expression is given a scheduling template that exposes tiling parameters (split and unrolling factors) and permutation parameters (loop reorderings).BaCO searches for the set of parameters, and therefore the schedule, that yields the best performance.The characteristics of these parameters and the search space is described in Table 3.We use tensors from a wide variety of real-world applications ranging from power networks and circuits to fluid dynamics and social networks.We run matrix expressions on a subset of SuiteSparse matrices [10,31] and synthetic uniform random tensors, and we run higher-order expressions on the Facebook Activities tensor [59], a subset of the FROSTT tensor collection [51], and synthetic tensors as well (see Table 4).The selected tensors vary widely across tensor properties including number of nonzeroes, dense dimension size, and irregular nonzero pattern.
The TACO benchmarks were run on nodes with two Intel Xeon Gold 6130 processors locked at 2100Ghz, using all 32 cores and 96GB of RAM.

RISE & ELEVATE benchmarks
We use six benchmarks covering multiple domains, optimizations, and hardware devices.This results in benchmarks requiring various autotuning features, as described in Table 3.
The CPU Matrix Multiplication (MM_CPU) benchmark from [19] is run on a CPU and applies tiling, vectorization, and loop-permutation optimizations.The remaining benchmarks are run on a GPU and apply GPU-specific optimizations, including the OpenCL-specific work-group configuration, memory hierarchies, and coalescing.The MM_GPU and K-means_GPU dense linear algebra benchmarks are inspired by implementations used in [56].The linear algebra algorithms Asum_GPU and Scal_GPU are from [54], the stencil from [57].The remaining image processing algorithm Harris_GPU is a corner detector described in [29].
The RISE & ELEVATE evaluation was executed on 8 cores of an Intel Xeon E5-2650 v3 @2.30Ghz processor accompanied by 32 GB of RAM.For the GPU benchmarks, we used a NVIDIA K80 GPU.
HPVM2FGA benchmarks We use the benchmarks presented in [13]: (1) Breadth First Search (BFS), and (2) the computational fluid dynamics algorithm of euler with pre-computed fluxes (PreEuler), are taken from the Rodinia Benchmark suite [6], and (3) 3D Spacial Audio Encoder (Audio) from the ILLIXR testbed [24].The benchmarks represent diverse workloads from different domains with varying parameter space sizes, ranging from 4 parameters for BFS, to 15 for Audio.
We ran these benchmarks through HPVM2FPGA's optimizer, reporting the estimated execution time targeting an Intel Arria 10 GX FPGA in our evaluation results.

RQ1) Does
BaCO achieve high performance with a limited autotuning budget?Fig. 5 shows the average performance of BaCO and the baselines for three different levels of autotuning budget for all our benchmarks.The tiny budget is only 20-40 evaluations. 3BaCO clearly outperforms the baselines, delivering on average better performance than all baselines for 19 out the 24 benchmarks.For TACO, the tiny budget is even sufficient to deliver expert-level performance.With the small budget, BaCO delivers expert performance for all three compiler frameworks.The baselines struggle to deliver good performance even with the full budget, particularly for the challenging spaces in the RISE benchmarks.Tables 6,  7, and 8 in the appendix show the detailed performance results for each individual benchmark and autotuning framework.
BaCO also achieves performance faster, i.e., it reaches the final performance of the other baselines using fewer configuration evaluations.Fig. 6 shows the performance evolution for three selected benchmarks, and that BaCO delivers performance with 2.9×-5× fewer evaluations than ATF and Ytopt.On average, BaCO finds the best ATF and Ytopt configurations 2.87× and 3.82× faster, respectively.Our experiments confirm that these results generalize well across our benchmarks, however, due to space constraints these additional results are presented in Table 6 in the Appendix.RQ2) Does BaCO generalize across compiler frameworks?To see how the performance generalizes, we show in Fig. 7 (and Fig. 11 in the Appendix) how the mean of the best-found solution by each framework improves over time for each individual benchmark.In the figures, the average performance is plotted for each method and each benchmark.The goal is to achieve a lower value, which identifies a better-performing configuration, and to find low values as far to the left as possible, which means using a low tuning budget.The performance of the default configuration and expert configuration is presented for reference when available.We further denote  when each method reaches expert-level performance with a star, such that a shorter distance between the y-axis and the star is better.As finding improvement over the default configuration initially is easy, we split the plots into two regions with different scales.This helps focus on the interesting part closer to the expert-level performance.BaCO reliably yields high-level performance and overall provides the best schedule in 22 out of the 24 benchmarks.It is further frequently the only method to reach expert-level performance within the given budget.

RQ3) What is the performance benefit of customizing Bayesian optimization for compiler autotuning?
We study the BaCO design by running three matrices for SpMM with default settings and with a number of major features turned off.We use the SpMM benchmark as it is reasonably well-behaved and only has few constraints.The average speedup over expert is shown in Fig. 8.We denote the restricted version by BaCO--.In more detail, BaCO-is BaCO without variable transformations, model priors, and local search for the acquisition function.It further uses the naive distance for permutation variables that ignores their underlying structure and it does not use BaCO's more advanced fitting of the GP hyperparameters.We see that by doing those changes, BaCO takes about a 20% performance loss.
Next, we compare it with the GP implementation of Ytopt.Ytopt uses none of the above mentioned features, but additionally has a less advanced GP and BO toolkit.Ytopt only supports constraints for RF so this Ytopt configuration does not support constraints.However, for this benchmark we have manually pruned the search space for Ytopt, so that the only remaining constraint is a single less-than relationship between two variables.BO with GPs requires a lot of care to be efficient, which we see from the difference between Ytopt (GP) and BaCO--.
Lastly, we show the difference between a well implemented GP and RF as surrogate model.Especially for smaller budgets, the GP model shows much stronger performance.This is relevant as there is currently a paradigm shift towards using GPs in discrete settings.
Ablation analysis To further understand the impact of the different design choices in BaCO, we perform an ablation analysis in Fig. 9.The impact of the permutation kernel, variable transformations and model priors are studied in an ablation analysis.First, BaCO in default settings is presented, which is using Spearman's rank correlation for permutation variables.Then we study the impact of changing the permutation metric to Kendall distance, Hamming distance, as well as the naive approach of treating permutations as categorical variables (Sec.4.1).The Spearman metric yields the best performance, especially in early iterations.
Secondly, we study the impact of removing the logarithmic transformations of variables and output (Sec.4.2), and the model priors (Sec.3.2).Removing the log transforms significantly deteriorates the performance at all evaluation counts.The lengthscale priors, however, have a larger impact early on, where they work to stabilize the procedure, and become less important when more data has been observed with which to fit the model.
Overall, we see that the changes have a much larger impact early on.Except for transformations, ignoring any of the individual features fails to prevent good performance after more iterations.It is noteworthy that no individual design choice has a major impact, but together they make a large difference.
Hidden constraints Next, we study the impact of predicting feasibility with respect to hidden constraints on two benchmarks from the RISE/ELEVATE suite.In Figure 10, we show the average improvement over expert after 20, 40, 60 iterations with and without the feasibility predictor.Additionally, we show the impact of the minimum feasibility limit presented in Sec.4.2.It shows that the hidden constraints predictor has a significant positive impact, particularly after more iterations where it has had more samples to train on.But it also indicates that the introduced minimum feasibility limit (Sec.4.2) is important to stabilize the interaction between the feasibility predictor and the surrogate model.
Chain-of-Trees Even after manual sparsity-reducing Evaluation of Average Best Runtime (Lower is better) transformations, some of the search spaces remain highly sparse.When this is the case, CoTs greatly increases the efficiency of sampling from, and searching, the parameter domain.On the MM_GPU search space for example, over ten runs, using the CoTs reduced the time spent on evaluating constraints in the local search by a factor 6× and the random sampling by a factor of 80×.Overall, this resulted in that the time spent by the internal working of BaCO was reduced by 70%.For even more sparse search spaces, operating directly on the search space quickly becomes untenable.RQ4) What are the findings of autotuning our distinct real-world compilers using BaCO?
Configuration insight BaCO underperforms baselines in only 1 of our 24 (4%) benchmarks across the 15 kernels.Opentuner is able to beat BaCO (as shown in Fig. 7) when running SpMV on cage12 (see Table 4).The SpMV benchmark is interesting as it has a good default setting, but ill-designed schedules can increase the run time by several orders of magnitude.After inspection of the con- figurations, it shows that ATF picks configurations similar to its previous configurations.This behavior in ATF exploits sampling around prior good configurations each evaluation, whereas BaCO's algorithm is more explorative  in finding completely new configurations.Exploiting configurations works for simple kernels, like SpMV, but fails for real-world kernels with increased complexity and runtime variance.Exploitation sampling is likely to get stuck in a local minima, which is more likely to be globally bad for complex problems (as is the case with TTV on the random1 in Fig. 11 in the Appendix).We do not augment BaCO in any specific manner to explore configurations, but the global nature of the BO paradigm emphasizes exploration over methods with local-exploration elements such as OpenTuner.
Performance over expert BaCO is able to achieve better than expert performance in some cases (see Fig. 5).Even experts in the domain, with insight of the underlying hardware architecture, may miss out on the optimal configuration and the best performance simply due to the amount of user-time needed to explore the vast search space of possible configurations.BaCO allows users to automate that search while potentially offering better performance than the expert could find.For example, BaCO is able to find over a 1.1× speedup on average for TACO (see Fig. 5) since experts in [49] only considered the de-fault loop ordering (permutation) for all expressions.In addition, many of the configurations the autotuner discovers are hard to find by hand due to the concordant traversal of a compressed tensor data structure.Therefore, it is difficult for an expert to search the space of loop orderings and know which of them are infeasible.
Constraints in real-world applications Over half (8/15) of our benchmarks, and notably all of the HPVM2FPGA benchmarks, use hidden constraints.Additionally, all but one benchmark use known constraints, significantly reducing the feasible search space as shown by the Feasible column in Table 3. Predictor modeling of hidden constraints has a significant impact on performance (as discussed in RQ3), and this impact is apparent in our real-world compiler benchmarks.

Related Work
Bayesian optimization for autotuning Several Bayesian optimization frameworks have been presented for autotuning [41,63,61,38].One of the earlier frameworks was introduced by Nelson et al. [42], who present SURF that uses Random Forests models to optimize tensor constraction operations on GPUs.This work is later extended to become Ytopt by Wu et al. [63].The authors use the skopt Bayesian optimization framework to optimize LLVM Clang/Polly pragma configurations on the Poly-Bench benchmark suite.Ytopt further allows the usage of additional surrogate models such as Gaussian Processes and Boosted Trees.Ytopt implements a method based on Bayesian optimization, which BaCO builds on.However, we show that the Bayesian optimization pipeline needs to be further customized to work well on autotuning domains, which is the scope of our work.
The work by Sid-Lakhdar et al. [50] focuses on the meta-and multi-task learning aspect.It was extended by Liu et al. [38] into the GPTune framework.The authors use linear coregionalization models (LCMs), to model multiple similar problems simultaneously to increase efficiency.This was further extended by Zhu et al. [65] to also handle multifidelity applications.While this is out of the scope of the current work, the use of meta-learning can be used in combination to BaCO to achieve greater efficiency.Willems et al. [61] use Bayesian optimization to autotune GPU kernels using Gaussian Processes and known constraints on the search space.Another recent approach is Bliss by Roy et al. [48], that probabilistically chooses a combination of models and acquisition functions each new optimization iteration based on previous performance observations.Bliss' approach is orthogonal to BaCO and it is possible that combining the methods further efficiency can be achieved.Recently, Dorier et al. [11] present DeepHyper, a Bayesian optimization framework for HPC storage system autotuning, that fo-cuses on transfer learning through the use of variational autoencoders.
Bayesian optimization for design space exploration Nardi et al. [41,30] use Bayesian optimization with a RFs surrogate model to optimize FPGAs.They consider both multiobjective and hidden constraints.Ejjeh et al. [13] use the same DSE framework to tune hardware-agnostic programs targeting FPGA backends.[53] design a humancentric DSE approach, where expert priors accelerate the convergence of the autotuner.While this is not the focus of our work, a simple adaptation of the BaCO acquisition function can benefit the same user priors when available.
There is substantial work in the literature about DSE techniques in HLS [5,14,62,60,64].However, most existing work focuses on using DSE for tuning HLS, rather than using it to select compiler optimizations [5,14,64].These works are not based on Bayesian optimization and we view them as complementary to our work.
The phase-ordering problem Autoscheduling tackles the task of applying a number of transformations to optimize a kernel automatically.Typically, the scheduling language parametrizes the application of those transformations by a bounded set of options which we refer to as parameters that are easier to optimize over.This parametrization approach is the one used by TACO and ELEVATE.However, a different approach is to operate directly on the space of transformations.Optimization over this unbounded tree-like space is commonly known as the phase-ordering problem [34,32,22,21].

Conclusions and Future Work
We introduce the Bayesian Compiler Optimization framework (BaCO), a plug-and-play solution to autoscheduling tasks for modern scheduling languages targeting various hardware backends.BaCO is able to reach expert-level performance 2.7×-10× faster than the state of the art autotuners.The separation of concerns between policy and mechanism allows compiler users to delegate the complex and time-consuming task of scheduling to BaCO so that they can focus on their applications instead.
While we show that BaCO can provide highperforming solutions in less than 100 seconds, this time is still too long for use in software development.The holy grail of autoscheduling is to be able to use an autotuner during the development, and ideally enable the user to run autotuning every time they compile their code.That way users can check both functional and non-functional properties on a regular basis during the various program lifecycle phases.Indeed, increasing the efficiency of the autotuner would enable a new level of autotuningin-development-loop paradigm which is not accessible with the current state of autotuning technology.

Figure 1 :
Figure 1: An autoscheduler that is portable across the scheduling languages of diverse compilers.

Figure 2 :
Figure 2: Overview of the BaCO framework.

Figure 3 :
Figure 3: Illustration of similarity metrics between two permutations.The number of discordant pairs (top of right-hand box, green) is the Kendall distance, the squared element movement (bottom of right-hand box, blue) is the Spearman's rank correlation, and the number of changing elements (triangles, orange) is the Hamming distance.

Figure 6 :
Figure 6: Evolution of average best runtime for one kernel from each framework.The figure is split vertically into two different scales.BaCO reaches the final performance of the state-of-the-art methods using as little as 30% of the function evaluations of the other methods.

Figure 7 :
Figure 7: Evolution of average best runtime among evaluated configurations for selected benchmarks from Table 3. Fig. 11 in the Appendix contains from Table 3 not shown here.The figure is split vertically into two different scales, and we mark the iteration where each method beats the expert configuration with a star.

Figure 8 :
Figure 8: Geometric mean of the performance relative to the expert configuration for the TACO SpMM kernel applied to the filter3D, email-Enron and amazon0312 matrices after 20, 40 and 60 evaluations.

Figure 9 :
Figure 9: Geometric mean of the performance relative to the expert configuration for the SpMM kernel applied to the filter3D, email-Enron and amazon0312 matrices after 20, 40 and 60 evaluations.(Note the cut axis).

Figure 10 :
Figure 10: Geometric mean of the performance relative to the expert configuration for the MM_GPU and Scal_GPU kernels after 20, 40 and 60 evaluations.

Figure 11 :
Figure 11: Evolution of average best runtime among evaluated configurations for all benchmarks in Table 3 that are not shown in Fig. 7.The figure is split vertically into two different scales and the iteration where each method beats the expert configuration is marked with a star.

Table 3 : We evaluate BaCO on 15 important kernels from domains like machine learning, statistics, and signal pro- cessing. The benchmarks expose search spaces with varying number of parameters (Dim). They cover all pa- rameter types considered
(Params): real (R), integer (I), ordinal (O), categorical (C), and permutation (P).Constr.describes the type of constraints used by the benchmark: known (K) and hidden (H) constraints.

3
Besides the BFS benchmark, for which it is only 6 evaluations