MUPPET: Optimizing Performance in OpenMP via Mutation Testing

Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes Muppet, a new approach that identifies program modifications called mutations aimed at improving program performance. Muppet's mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), Muppet uses the idea of source-level mutation testing to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the Muppet's concept in the OpenMP programming model. Muppet generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as Bayesian Optimization and delta debugging to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When Muppet is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations in 70% of the evaluated programs that induce speedup.


Introduction
Performance optimization continues to be a challenge in modern HPC software.The adoption of multi-core heterogeneous systems and the use of multi-process and multithreaded programming models to fully utilize modern architectures are some of the factors that limit the ability of developers to solve performance issues; these issues can result in poor user experience, lower system throughput, limit scalability, and a waste of computational resources [5,7,53].
Problems with Existing Techniques.A large amount of work has been proposed to identify performance issues and a number of tools are used in the current HPC production environment to analyze applications' performance [3,21,32,45].However, the process of isolating performance problems and/or generating tests to identify them is still mostly a manual process.
Most performance optimization techniques focus on highlighting "hot spots" but ultimately rely on programmers to identify code modifications that fix a performance problem or improve overall performance.Other approaches are based on the concept of quantifying hardware or runtime system events [17,35,36], but do not explicitly inform the programmer how to modify the code to improve performance.Compiler optimizations improve performance usually at the intermediate representation (IR) level; however, reasoning about correctness at the IR level is much more difficult than at the source level.As a result, compiler optimizations can leave optimization opportunities on the table.Moreover, IR-level optimizations are not portable across compilers.
We could potentially solve performance problems given accurate performance models for each available platform and application.If performance models are available, we could simply check if the application's behavior falls into the bounds of such models.However, such an ideal mechanism is hard to realize in practice as performance models are notoriously difficult to build accurately, given the complexity of the HPC software stack and underlying hardware.There are solutions to build performance models for specific aspects of the hardware and applications [12,30,51], but these models are usually not composable and as a result of little practical use in modeling an entire application and platform.
Our Contributions.We present an approach based on mutation testing [25] to identify source code changes, or mutations, that (1) improve performance, and (2) help developers reason about performance at the source-code level (in contrast to IR-or assembly-level like in existing methods).Since such an approach is based on source modifications, it is portable across compilers.
Mutation testing has been proposed to identify correctness faults [25], and assumes that a syntactic change (a mutant) along with an exploration campaign of multiple mutants can help discover programs' defects faster than traditional methods.While some previous work has applied mutation testing to solve performance defects [10], mutation testing for performance has not been applied on parallel code and/or HPC programs.We demonstrate our approach in the OpenMP programming model, which is widely used in HPC.
We implement our approach in the framework named Muppet (Mutation-Utilized Parallel Performance Enhancement Tester).First, Muppet generates a list of OpenMP mutations that could alter the program performance in various ways.A mutation is defined as a change in an existing OpenMP directive in the program that could change the performance of the code block that the directive targets.Muppet considers only mutations that are not likely to change the correctness of the code block.Next, Muppet considers different optimization algorithms, such as Bayesian optimization (BO) [34] and delta debugging [56], to find a subset of mutations that, when applied to the original program, cause the highest speedup.We implement Muppet in the clang/LLVM front-end and evaluate it in the NAS Parallel Benchmarks [31] and three proxy applications (LULESH [26], HPCG [14], CoMD [18]).
In summary, our contributions are: • We present a source-level approach that uses mutation testing to optimize HPC code.Our approach considers four classes of source mutations and applies them in OpenMP directives.To the best of our knowledge, we are the first to explore using mutation testing to optimize OpenMP code (Section 3).
• We design and implement our idea in the Muppet framework via the clang/LLVM front-end.Our approach integrates Muppet with several optimization algorithms, such as BO and delta debugging.The output of Muppet is a set of source modifications, or mutations, that produce a maximum speedup among the explored mutations, without affecting correctness (Section 4).• We evaluate Muppet on several benchmarks and proxy applications.We demonstrate that Muppet is capable of identifying mutations that improve performance in 70% of the evaluated programs, with the best speedup in average running time of 15.64% (Section 5).

Overview
In this section, we describe the philosophy of our approach, provide background information on mutation testing, and provide a simple mutation example in a matrix multiply kernel that improves performance.

Approach's Philosophy
Existing approaches to isolate performance issues are difficult to use in practice.A number of performance problems can be fixed by changes in the source code; however, existing methods do not directly point to developers' source modifications that fix such issues.Compilers optimize code at the IR level but such solutions are not portable across compilers and make it harder to reason about correctness than solutions based on source modifications.We believe that tools and techniques for performance optimization should have the following features: • Fine granularity detection: tools should pinpoint, with fine granularity, the location (code line) of performance issues or potential performance improvements.• Guided fixes: the approach should help programmers understand and reason about performance defectswithout a good understanding, it is hard to solve the problem or avoid it in the future.• Automatic recommendations: the approach should automatically suggest code modifications that improve performance or fix a performance problem.
We designed Muppet using the above criteria to identify changes in OpenMP directives that improve performance.

Mutation
Testing for Performance 2.2.1 Challenges.The key idea of Muppet is to perform small changes in the code, called mutations, and use exploratory algorithms to search for cases where mutations improve performance or fix a performance problem.Mutation testing has been studied before to detect faulty programs by injecting small syntactical changes that expose correctness defects [25].The idea of mutation testing is to generate sufficient data to expose real software defects in the code.However, it is challenging to use traditional mutation testing in isolating performance defects because the syntactic changes could create faults, i.e., breaking the semantics of the program and producing incorrect programs.

Our Solution.
Inspired by the previous work on mutation testing, we propose a different approach: to inject only mutations that are semantically correct and do not yield an incorrect program for the purpose of exposing performance defects or speedup opportunities.Semantically correct mutants, or equivalent mutants, are considered problematic for traditional mutation testing because by definition, they cannot fail the test suite, so they should be avoided to increase the effectiveness of mutation testing.In contrast, our approach explores semantically correct mutations, or a weaker form of mutations that successfully pass correctness tests, to identify any mutations that increase performance, thus indicating performance defects.

Mutation Example
Here, we present a synthetic matrix-multiplication example, shown in Listing 1, that demonstrates Muppet's capabilitieswhen we apply Muppet, it can find a set of mutations that yields faster code execution.
Listing 1. Example code with a mutation found by Muppet that improves performance.#pragma omp tile sizes (16,16,16) 12 for ( int i = 0 ; i < ARRAY_SIZE ; ++ i ) 13 for ( int j = 0 ; j < ARRAY_SIZE ; ++ j ) Originally, the code has only the OpenMP parallel for directive to parallelize the loop.Then, Muppet applies mutations to the existing OpenMP directives found in the code.Note that while Muppet only considers semantically correct mutations (and are likely to produce a correct program), it relies on existing correctness checks of the program, as shown in Section 5.1.1 for the evaluated programs.When we run Muppet on this example with delta debugging, after 20 tryouts, Muppet reports a mutation that, when applied to the program, improves performance.With BO, it takes 66 tryouts to finish the optimization process; but the mutation was reported with 11 tryouts.The identified mutation is highlighted in the source code.In this simple example, the mutation is the addition of the OpenMP tile construct, which tiles one or more loops.In the end, Muppet reports to the developer that adding this construct to the loop introduces a 18.84x speedup, from 7.116801 seconds to 0.377674 seconds.

Problem Statement
Given an OpenMP program  with running time  , Muppet analyzes the program and generates a set of mutations,  = { 1 ,  2 , ...,   }, which potentially could induce program speedup.We define the program running time for the original variant program as: We define the running time for a variant program as: , where  ′ ⊆ , and  (,  ′ ) =  .
We define the ideal minimum program running time as: where   ⊆ , and  (,   ) =  , The goal of Muppet is find a subset of ,   ′ , with   ′ as close to   as possible.

Tool Workflow.
The overall workflow of Muppet is illustrated in Figure 1.
The purposes of these modules are described below: • Mutation generator analyses the program and finds a set of source code mutations, which can potentially be applied to change the OpenMP parallelism of the program.• Transformer generates a program variant with a subset of mutations found in the Mutation generator module.• Tester runs the mutated programs from Transformer and tests the performance speedup and correctness of the mutated variant.• Optimizer applies a user-specified optimization algorithm to find the minimum of the function  ′ =  ( ′ ).Next, we delve into the details of these modules, following the order as they appear in Figure 1.

Mutation Generator
The Mutation Generator module traverses the abstract syntax tree (AST) of the program, looking for source code locations that potentially can be mutated so that program parallelism is changed.The time complexity of this step is O () where  is the number of statement nodes on the AST.The mutators in Muppet focus on mutating parallel/loop OpenMP constructs such as the parallel directive, for directive, or the parallel for directive.All of these directives specify a source code region to be executed in parallel, but the parallelism may not be high enough to utilize all available cores for the OpenMP program.It also looks for the beginning of for loops for SIMD mutations.Once such language constructs (parallel for, for, etc.) are detected, the Mutation Generator module will then check the associated source code around the current language construct.If the source code around it satisfies certain statically defined criteria (see below), then unique information regarding the current mutation, such as source location, the way source code is modified (insert before, insert after, modify), and the mutation type, is added to the list of mutations.The algorithm for this process is shown in Algorithm 1.

Criteria Selection.
The criteria for each type of mutation simply follows the syntax of OpenMP language specifications.These criteria can be customized for any new type of mutations added.Here are some examples: "collapse" mutations are identified by an OpenMP directive followed by a rectangular, nested loop, within which there is no jump statements such as break, continue or return; "simd" and "tiling" mutations are identified by a serial or parallel loop statement without OpenMP parallel constructs or jump statements inside; lastly, every variable inside an OpenMP parallel region is checked for eligibility to become firstprivate variables.Some of the OpenMP mutations that can be applied to in the previously shown matmul example are shown in Figure 2. The one that shows the highest speedup in matmul is the tiling mutation.

Optimizer
Once a list of mutations is generated, it is exported to the Optimizer.This module runs an optimization algorithm specified by the end user to find the minimum point of  ′ =  ( ′ ).
During the optimization process, it finds specific points on the  ′ =  ( ′ ) function by selecting/deselecting a subset of mutations, sending these mutations to the Transformer and Tester module, and receiving  ′ from the Transformer and Tester module once the mutated program has finished execution and running time statistics are collected.Muppet supports two optimization algorithms: Bayesian Optimization (BO) [34] and delta debugging [56].The goal of these algorithms, albeit vastly different in implementation, is the same: find the subset of source mutations that would introduce maximum speedup.We selected BO because it is a common optimization algorithm that does not have the assumption of the function forms, which makes it an appropriate algorithm to use in Muppet.Delta debugging, on the other hand, was originally developed as a software testing algorithm to isolate bugs inside a program, which is then adapted into finding speedup in program variants in previous work such as Precimonious [44] with regards to precision tuning.The inclusion in Muppet of both algorithms shows how algorithms with vastly different original purposes can solve the same problem in different ways.Muppet can also be extended to support other optimization algorithms such as genetic algorithm or simulated annealing.
For BO, since the input parameter of the function to be optimized,  ′ =  ( ′ ), is a subset of mutations, which does not fit the function format of BO, we optimize  ′ =  ( ′ ) instead where: In this way, we convert the subset parameter into a list of binary parameters signaling whether a mutation is included in the subset so that BO can accept this list as input parameters for the function it optimizes.
As for delta debugging, we follow the LCCSearch algorithm in [44], where a change set in our adaptation of the algorithm is defined as the set of mutations that are applied to the original program, and the outputs are a minimal change set which causes speedup.

Transformer and Tester
The Transformer and Tester modules read the list of mutations from the Optimizer module, mutate the program into a variant, and run the variant to see if there is any speedup while maintaining the correctness of the program.

Compilation and Conflicts Checks.
Even though there are already criteria placed in the Mutation Generator module for each mutation type to ensure that all mutations generated are syntactically correct, there are still situations where different mutations, when applied to the same programs at the same time, cause conflicts between them.If Muppet lets these conflicts pass without checking during the transformer phase, it will cause a large number of mutated program variants that do not compile.
In order to save execution time, when the module transforms the program, it also statically checks and circumvents certain conflicts.These conflict checks can also be customized in the case where new types of mutations are implemented or new conflicts are discovered during testing.Currently, the conflict checks include: no tiling directives should be inside a SIMD region; and no SIMD directives or clauses should be inside a tile or collapse region.

Implementation Details
Muppet is implemented with a variety of programming languages and toolsets.The Mutation Generator and the Transformer modules are implemented via Clang plugins.Clang plugin system is one of several systems in the Clang compiler architecture that are capable of performing sourceto-source code transformation, along with libtooling and libclang.Clang plugin is used so that our code transformation runs alongside the build environment of the evaluated programs, with the same kind of dependency checks.Muppet only requires minimal changes to the build scripts for it to work on new programs.This is described in 4.3.
The Optimizer and Tester modules, and the overarching framework managing the communication between modules, on the other hand, are implemented in Python.This is done to leverage the existence of a mature set of Python numerical optimization modules such as scikit-optimize [24].

Language Support
Muppet uses the modular approach; each of the three modules can be replaced in order to implement an analogous functionality.Currently Muppet targets C/C++ programs with OpenMP language constructs, though it is possible to target FORTRAN programs by rewriting the Mutation Generator and Transformer modules with a source-to-source FORTRAN compiler such as ROSE [40].

Customizing Muppet Runtime Parameters
Muppet supports BO and delta debugging in our implementation.BO is implemented with scikit-optimize, while delta debugging is implemented from scratch using the algorithm described in Precimonious [44], since it has no publicly available Python implementations.
Since running time for each program run may have variations that should not be counted as speedup, in order to suppress such variations, users can customize Muppet parameters to change how it measures running time.The times parameter specifies Muppet to run a number of repetitions for each variant, and collect running times for each run; the shuffle switch, only available for delta debugging, randomly shuffle the order of mutations so that delta debugging algorithm partitions these mutations differently each time (users can still specify the same random seed for the same shuffle result).Lastly, users can choose between using the minimum running time in all repetitions as program running time, or use the average running time.

Integrating New Programs with Muppet
For better management of programs in evaluation, Muppet calls a customized version of the FAROS build system [22].Muppet calls a variety of functionalities offered in FAROS in order to analyze, transform, build and run the specified program.With FAROS, it is easy to add new programs to be mutated by simply adding new entries into the YAML config file.

Experimental Evaluation
This experimental evaluation answers the following research questions: RQ1 Does Muppet discover source code mutations that induce speedup for OpenMP programs?RQ2 What are the factors that may determine the efficacy of Muppet in finding these source code mutations?
5.1 Evaluation Setup 5.1.1Benchmarks.We use a set of 10 C/C++ OpenMP programs to evaluate Muppet.The programs include benchmark programs such as NPB-CPP [31] and HPCG [14], and proxy applications such as LULESH [26] and CoMD [18].We use these programs in order to evaluate the efficacy of Muppet in finding speedup in different programs, on a reference implementation or on manually optimized code.On the benchmarks side, NPB-CPP is the C++ version of NAS Parallel Benchmarks ported to various programming frameworks on shared-memory architectures including OpenMP.We use 7 benchmark programs in varying problem sizes for evaluation: BT.A, CG.B, EP.B, FT.A, LU.A, MG.A, SP.A. HPCG is a benchmark program that performs multigrid preconditioned conjugate gradient iterations.We run it with a grid size of 96*96*96.All benchmarks contain result verification routines in their source code, so we use them in order to determine program correctness.
On the proxy applications side, LULESH is a proxy application simulating the Shock Hydrodynamics Challenge Problem, while CoMD is a proxy application implementing classical molecular dynamics algorithms and workloads as used in materials science.Evaluating these programs may show the efficacy of Muppet in helping software developers in scientific computing optimize the parallel performance of their programs.LULESH is run with the parameter -i 1500 -s 35, and CoMD with -e -i 1 -j 1 -k 1 -x 20 -y 20 -z 20.We use the approach presented in [28] to determine the correctness of the program.For LULESH, we consider iteration count, final origin energy, and TotalAbsDiff as the output; for CoMD, we use the final energy as output.
5.1.2Algorithm Parameters.We use both BO and delta debugging in our experimental evaluation.Given the fact that program running time varies across the programs being evaluated, we put a tryout limit of 100 on both algorithms instead of using a total time limit.Our parameters for BO are _ = 100, __ = 10, and  = 0.01.

Evaluation Environment.
We use a workstation computer with two 14-core Intel Xeon E5-2694v3 CPUs and 32GiB of RAM, running Ubuntu 22.04.We use Clang 16.0.6with OpenMP 5.1 support as the compiler for both sourceto-source code transformation, and for building and running the evaluated programs.Using OpenMP 5.1 enables us to build programs with collapse clauses as well.
We also ensure that performance variation is minimized between program runs.We avoid CPU context switching by limiting the programs to run on hardware threads on the second CPU by forcing the taskset -c 14-27 command in FAROS.Hardware quiescing, as defined by [1], is also performed to reduce performance fluctuations, such as turning off both simultaneous multithreading and dynamic frequency scaling.
As for running time statistic collection, we run each mutated variant 5 times and use the minimum running time as the program running time  .As a comparison, we also record the average running time for each tryout and evaluate if there is any possible discrepancy between average and minimum running time, but this statistic is not used as the fitness function output for optimization algorithms.We use the minimum running time for the fitness function because as stated in [1] it is best at rejecting noise introduced by the evaluation environment, since running time higher than the minimum must be due to such noise.However, we still calculate speedup for average running time to see how performance variability affects running time.

Evaluation Results
Even though we have taken various measures to reduce performance variability between each program run, it is still a factor that is not completely removed.Therefore, to determine if a program shows speedup when mutated, we use the 1% threshold.If amongst the 5 runs, the speedup between the minimum running time or between the average running time lower than 1%, then the current subset of mutations is discarded.
The results of both algorithms can be found in Table 1.We compare the minimum running time for the mutated program against the minimum of the original (columns 2-4), and its average running time against the average running time of the original program (columns 5-7).Our evaluation shows that there are 7 out of 10 evaluated programs in which delta debugging can find a subset of mutations that, when applied, can cause speedup while maintaining the correctness of the program.The other 3 programs below the horizontal line in Table 1 show no speedup.
The speedup with regards to the minimum time ranges from 1.35% in LU to 14.99% in MG.Meanwhile, the speedup with regards to average time is generally about the same or smaller than the speedup with regards to the minimum time, especially so in HPCG where the speedup in average time is only 2.78% compared to the 10.85% speedup in minimum time.Such a drop in speedup is likely from increased inherent performance variability introduced by the sole mutation discovered.Running these programs more than 5 times, or further static program analysis, may be needed to determine a more robust speedup result.BO on the other hand can find mutation subsets that cause speedup in only 6 out of 10 programs, as it cannot find such a subset for SP.Furthermore, the speedup discovered is not greater than delta debugging except HPCG, which shows 7.61% in average running time speedup and 14.54% in minimum running time speedup.
We have also recorded the number of mutations that are applied in the subsets that cause the highest speedup with both algorithms (column 9 in Table 1), compared to the total number of possible mutations (column 8).The results in BO all have more mutations in the subsets, except EP which neither algorithm shows speedup.When we investigate all tryouts and their running times in programs such as SP, we deduce that a lot of mutations in these programs cause negative speedup, while only a few cause positive speedup.BO works worse in programs like these compared to delta debugging because it takes more tryouts than delta debugging to remove mutations with negative speedups from consideration.On the other hand, programs like HPCG likely have a few mutations that cause large speedup, but most others do not cause negative speedup.In these cases, BO works better than delta debugging and can find a subset of mutations that contain mutations with both large and small speedup.

Related Work
Mutation Testing.Mutation testing has been proposed to identify correctness defects [25].The assumption in mutation testing is that a syntactic change (a mutant) can help discover programs' defects.Mutation testing, however, has not been applied deeply in HPC programs and on performance defects.Some attempts to build mutation testing for cloud systems have been reported [8].Mutation operators (i.e., syntactic changes) have been proposed to reveal faults in small-size MPI programs [46].With the increased use of LLVM, researchers are exploring the support of mutation testing in LLVM [11].To the best of our knowledge, the only work that considers mutation testing for performance is [10].However, this work does not consider parallelism and mutations in numerical (floating-point) code-these two aspects are critical to HPC applications.To the best of our knowledge, we are the first to explore using mutation testing for performance in OpenMP scientific codes.
General Auto-tuning.There is a significant corpus of past work on auto-tuning techniques.Typical examples include ATLAS [50], Active Harmony [48], FFTW [19], POET [54], CHILL [9], GEIST [49], OpenTuner [4], CLTune [38], Apollo [6,52], and Dutta et al. [15,16].Their common theme is that they tune compile-time, such as tiling, or runtime parameters, such as the number of threads, presupposing a given source code representation of a program.Typical search algorithms for tuning they propose include random, grid, or Bayesian search, or various machine learning-based search models.By contrast, Muppet mutates the source code of the program, which exposes a large, combined set of both source code modifications as compile-time parameters and their possible configurations as runtime parameters to tune for.Furthermore, Muppet automates the generation of tuned source code variants without user intervention and it is the first to propose the delta debugging search algorithm for tuning.Integrating machine-learning techniques for fast searching in Muppet is an interesting future extension.
A number of papers research domain-specific tuning using code generation, alternate data layouts, or algorithmic parameters, such as [2,13,20,27,37] for linear algebra kernels and [33,41,42,55] for stencils.Those approaches require users to express the programs in specialized domain-specific languages amenable to tuning, which limits their generality.Muppet tunes unaltered, user-provided, general OpenMP code to generate tuning source code variants and optimizing runtime parameters.
Auto-tuning OpenMP.Specifically on OpenMP, Adaptive OpenMP [23,29], Sreenivasan et al. [47] propose OpenMP language extensions to support auto-tuning on OpenMP regions, such as scheduling policies of parallel loops, number of threads or teams.Those approaches require significant refactoring of the code and domain-specific knowledge from the programmer to successfully integrate tuning extensions and their possible configuration parameters in their OpenMP code.Instead, Muppet treats source code modifications as a tunable parameter and independently explores the runtime configuration space.
Bliss [43] proposes probabilistic Bayesian optimization to tune hardware (core frequency, hyperthreading) and software execution parameters (OpenMP threads, algorithmic alternatives) for the whole application, specified by the user.Bliss does not modify the program's source code and tunes all regions in unison, by contrast, Muppet both enables source code modifications and specializes tuning to each region, since mutations are region-specific.
Scalable Record-Replay [39] is a mechanism extracts the LLVM IR of OpenMP GPU target region kernels to tune for each kernel in parallel the GPU launch bounds as compiletime parameters, by modifying the IR to re-compile, and the number of threads/teams as runtime parameters.Performing the kind of mutations in Muppet on LLVM IR is challenging compared to source code, which motivates our choice of a source code mutation tool.Nevertheless, the idea of extracting OpenMP regions and tuning them independently is a possible extension to Muppet to speed up search time.

Conclusion
We presented Muppet, a novel application of mutation testing aimed at improving the performance of OpenMP programs.Muppet uses different search algorithms to apply and compose program mutations to reduce application execution time.Because program transformations are performed at the source level, Muppet's mutations are transferable across different OpenMP implementations and compilers.We demonstrate that Muppet is capable of identifying mutations that improve performance in 70% of the evaluated programs achieving a maximum average speedup of 15.64%.
In the future, we plan to extend Muppet to automatically update OpenMP code bases with the latest OpenMP features that improve performance while maintaining correctness.Currently, it is the responsibility of the code maintainer to manually update their code base to use newly available OpenMP features, which require significant manual efforts.The source code and data of Muppet are publicly available at https://github.com/LLNL/MUPPET/.

Figure 1 .
Figure 1.The workflow of Muppet.Red texts in italic indicates the mutation is applied, or the source file is changed.
' make CC = clang ++ OPT_LEVEL = 3 OMP = 1 '] , Work consumed (\ d +\.\ d +) seconds ' 15 clean : 'rm -r *.*; cp ../../../../ extra / matmul /*.* .' Classes.There are four types of mutations possible to apply to certain source code locations: the loops, and thus may also introduce performance speedup.Due to the difficulty in determining loop size at compile time, Muppet only supports setting a fixed set of differently sized tiles as different mutations.For example, we can only set the tile size as a power of 8, 16, or 32.Given the limitations, users can still see from the optimization results whether using a smaller or larger-sized tile can have a higher speedup.4. Firstprivate Mutations put read-only shared variables into a firstprivate clause for an OpenMP parallel region, in order to reduce dependency between parallel threads.
Correctness.An example entry for a locally stored simple matrix multiplication program is shown in Listing 2. It sets up commands for each step used in Muppet, such as building, calling plugins for mutations, running the program, extracting running time statistics from program output, and cleaning.The only required change to the matmul source code is (a) modify the build scripts (Makefile in this case) so that it accepts parameters for calling the Clang plugins; and (b) add correctness check code that parses program output in order to determine if the mutated program still runs correctly.Listing 2. YAML config file for matmul.

Table 1 .
Mutation speedup discovered by delta debugging and Bayesian Optimization.