Guiding Greybox Fuzzing with Mutation Testing

Greybox fuzzing and mutation testing are two popular but mostly independent fields of software testing research that have so far had limited overlap. Greybox fuzzing, generally geared towards searching for new bugs, predominantly uses code coverage for selecting inputs to save. Mutation testing is primarily used as a stronger alternative to code coverage in assessing the quality of regression tests; the idea is to evaluate tests for their ability to identify artificially injected faults in the target program. But what if we wanted to use greybox fuzzing to synthesize high-quality regression tests? In this paper, we develop and evaluate Mu2, a Java-based framework for incorporating mutation analysis in the greybox fuzzing loop, with the goal of producing a test-input corpus with a high mutation score. Mu2 makes use of a differential oracle for identifying inputs that exercise interesting program behavior without causing crashes. This paper describes several dynamic optimizations implemented in Mu2 to overcome the high cost of performing mutation analysis with every fuzzer-generated input. These optimizations introduce trade-offs in fuzzing throughput and mutation killing ability, which we evaluate empirically on five real-world Java benchmarks. Overall, variants of Mu2 are able to synthesize test-input corpora with a higher mutation score than state-of-the-art Java fuzzer Zest.

Their key idea is to evolve a corpus of test inputs via an evolutionary search that maximizes code coverage: in each iteration, a new input is synthesized by randomly mutating an existing input from the corpus. The mutated input is added to the corpus if the corresponding execution of the test program increases code coverage.
Fuzzing is traditionally used to discover inputs that crash programs and reveal security vulnerabilities [5,11,14,20,25,42,50,54,59,68]. In the absence of new bugs, fuzzers are evaluated based on code coverage achieved during the fuzzing campaign [10,48]. However, in the vast majority of fuzzing research, the end goal is to find bugs in the moment [42]; not much attention is paid to the inputs saved along the way.
In this paper, we explicitly focus on the quality of the test-input corpus produced at the end of a fuzzing campaign. Such a corpus can be used for continuous regression testing during subsequent program development. This practice is recommended by Google's OSS-Fuzz [28], and is already adopted by some mature projects. For example, in SQLite, "Historical test cases from AFL, OSS Fuzz, and dbsqlfuzz are collected [...] and then rerun by the fuzzcheck utility program whenever one runs make test" [71]. Similarly, OpenSSL uses several distinct fuzzer-generated corpora and their corresponding fuzz drivers for continuous testing [72]. Even though these test corpora are used for regression testing, the only metric being targeted by conventional greybox fuzzers is code coverage. However, coverage alone is not the necessarily the strongest predictor of fault detection ability [15,36]. Now, the technique of mutation testing [19], which evaluates the ability of tests to catch artificially injected bugs (a.k.a. mutation   Figure 1: A mutation-analysis-guided fuzzing loop. Each fuzzer-generated input is run through a set of program mutants to compute a mutation score. Inputs are saved to the corpus if they improve mutation score. analysis), has shown promise as an adequacy criteria for improving test-suite effectiveness [15,39,64]. A test is said to kill a program mutant if it fails when executed on the mutant, whereas mutants that fail no tests are said to survive. A goal of mutation testing is to produce a test corpus that has a high mutation score, defined as the fraction of all mutants that are killed by the test suite. A natural question thus arises: can we use mutation scores to guide the fuzzer?
In this paper, we develop and evaluate a framework for incorporating mutation analysis in the fuzzing loop, building on our previous work which first proposed the approach [46]. The idea is as follows (see Fig. 1): after a new input is synthesized by a fuzzer via random mutation of a previously saved input, it is evaluated by executing a set of mutants of the program under test. If the new input kills any previously surviving program mutant, then it is added to the corpus. In this process, we distinguish between input mutations (e.g., randomly setting input bits or fields to zero) and program mutations (e.g., replacing the expression a+b with a-b in the target's source code). Our Java-based implementation, called Mu2-for Mutation-Based Greybox Fuzzing + Mutation Testing-incorporates program mutations from the popular PIT toolkit [17] into a custom guidance in the JQF [58] greybox fuzzing framework. Mu2 is open source and available at: https://github.com/cmu-pasta/mu2. This paper details two main aspects of Mu2's design. First, with a conventional fuzzing oracle that only identifies program crashes or aborts, many inputs will be discarded for not killing any mutant even though they exercise interesting program functionality. For mutation testing to be useful, we need a stronger test oracle. Mu2 incorporates the idea of differential mutation testing, which validates the output of program execution. Second, evaluating each fuzzer-generated input on the set of all program mutants is prohibitively expensive, thereby reducing fuzzing throughput. Mu2 prunes the set of mutants to run at each fuzzing iteration using dynamic analysis of the original program's execution in two ways: (a) sound optimizations that prune mutants which cannot be killed by a given input, and (b) aggressive optimizations that select only a bounded subset of candidate mutants to run in each iteration.
We evaluate Mu2 on five real-world Java targets using state-ofthe-art greybox fuzzer Zest [59], which is also built on top of the JQF framework, as a baseline. We also empirically evaluate 7 variants of Mu2 employing different strategies for improving performance. Our combined evaluation represents 21,600 CPU-hours (2.5 CPU-years) of fuzzing campaigns.
Our results indicate: (1) an optimized version of Mu2 has an overall improvement of up to 20% in mutation scores across five benchmarks (5% increase on average); (2) mutation-analysis feedback generates test-input corpora with higher reliability of killing nontrivial mutants compared to coverage-only feedback; (3) the differential testing oracle is significantly valuable to Mu2, detecting 30% more mutants on average than a conventional fuzzing oracle; To summarize, this paper makes the following contributions: (1) We investigate the various challenges of combining mutation testing and greybox fuzzing, and propose solution approaches to include in our framework. (2) We incorporate differential testing as an oracle for mutation testing in the fuzzing loop and find that it significantly improves the strength of the fuzzing oracle.

BACKGROUND 2.1 Greybox Fuzzing and Corpus Generation
Coverage-guided greybox fuzzing (CGF) is a technique for automatic test-input generation using lightweight program instrumentation. It was first popularized by open-source tools such as AFL [81] and libFuzzer [49], but has since been heavily studied and variously extended in academic research [5,9,14,20,25,44,50,52,58,59]. Algorithm 1 describes the basic greybox fuzzing algorithm, with many details elided. First, a corpus of test inputs is initialized with a set of one or more seed inputs (Line 2), which could be user-provided or randomly generated. Then, in each iteration of the fuzzing loop (Line 3), a new input is synthesized by first picking an existing input from the corpus (Line 4) and then performing random mutations to produce ′ (Line 5). The heuristics to sample an input (PickInput) vary, and often use some sort of energy schedule [9]. Some inputs may also be marked as favored, and receive higher energy than other inputs. The random mutations performed on to get ′ (MutateInput) also vary depending on the known format of inputs (e.g., bitflips for binary data or random keyword insertion for text files). Structure-aware fuzzing tools [4,44,59,65,76] perform mutations that preserve the syntax or type safety of inputs, e.g. by mutating parse trees using a grammar or by mutating pseudorandom choices backing a Quickcheck-like [16] generator function. The program under test is then executed with the new input ′ , using lightweight instrumentation to collect code coverage during execution. The function coverage referenced in Algorithm 1 returns a set of program locations executed when processing an input. If the run of ′ causes new code to be covered (Line 8), then ′ is saved to the corpus (Line 9); thus, ′ may be used as the basis for further input mutation in subsequent iterations of the fuzzing loop.
Algorithm 1 Coverage-guided greybox fuzzing 1: procedure CGF(Program , Set of inputs seeds, Budget ) 2: corpus ← seeds ⊲ Initialize saved inputs 3: repeat ⊲ Fuzzing loop 4: ← PickInput(corpus) ⊲ Sample using heuristics 5: if running ( ′ ) leads to a crash then 7: raise ′ ⊲ Bug found! 8: if coverage( , ′ ) ⊈ ∈corpus coverage( , ) then 9: corpus ← corpus ∪ ′ 10: until budget 11: return corpus ⊲ Final corpus If the execution of any synthesized input ′ causes the program to crash, then a bug is reported (Line 7). The fuzzing loop continues until a user-provided resource budget runs out (Line 10), where this budget may be in terms of the number of fuzzing trials (i.e., iterations of the fuzzing loop) or in terms of wall-clock time. The corpus of fuzzer-synthesized test inputs is finally returned (Line 11) and may be used either as a regression test suite, for seeding future fuzzing campaigns, or for other applications [28,55,71,72,75]. The quality of the final test-input corpus is often evaluated using code coverage [10,42], though mutation scores-which we describe in the next section-have also been used [75].

Mutation Testing
Mutation testing (also known as a mutation analysis) is a methodology for assessing the adequacy of a set of tests using artificially injected "bugs", or program mutants [19,37]. In assessing test adequacy [27], we are given a program and a suite of passing tests . The goal is to evaluate the quality of by computing a score that grows monotonically [79] with additions to the set . Code coverage is an example of a test adequacy criteria.
In mutation testing, a set of program mutants, say Mutants( ), is first generated. Each mutant ′ ∈ Mutants( ) is a program that differs from in a very small way. Most commonly, mutations are replacements of program expressions. For example, an expression a+b at line 42 in may be replaced with the expression a-b. We can use the notation ⟨ , a+b, a-b, 42⟩ to refer to this mutation. For purposes of this paper, we use the notation: to refer to a program mutant ′ as a modification of program where expression is replaced with ′ at program location . The main idea is that a program mutation simulates a simple programmer error or an artificially injected "bug". The test suite is then run on each mutant ′ . If some test ∈ fails when run on mutant ′ , then the mutant ′ is said to be killed, which we denote as Kills( ′ , ). If the test suite still passes, then the mutant ′ is said to survive.
Ideally, we want our tests to be able to identify "bugs" and so we hope to have tests that fail on each mutant ′ . So, the adequacy of test suite is defined by the mutation score, which is computed as the fraction of mutants killed: | { ′ ∈Mutants( ) | ∃ ∈ :kills( ′ , ) } | |Mutants( ) | . In general, a mutation score of 100% is rarely achievable because some mutants ′ may actually be equivalent to -that is, ∀ : ( ) = ′ ( ). Similar to code coverage-where 100% may not be achievable due to unreachable code-the best use of the adequacy score is as a relative measurement rather than an absolute one.

MUTATION-ANALYSIS-GUIDED FUZZING 3.1 Problem Statement and Scope
In this paper, we focus on the following problem: Can we use mutation analysis to guide greybox fuzzing in order to synthesize a test-input corpus with high mutation score?
Recently, Gopinath et al. [31] have identified and discussed several challenges of combining mutation analysis with fuzzing, including (1) the strength of oracles used by the fuzzer, (2) the computational expense of performing mutation analysis, (3) dealing with equivalent mutants, and (4) the lack of mutation testing frameworks that focus on fuzzers. We directly address such challenges in this paper. Oracles are discussed in Section 3.3 and performance concerns in Section 3.4. Our evaluation is not dependent on identifying equivalent mutants, since we only care about relative mutation scores (higher=better) rather than the exact number of mutants killed by a test-input corpus. Section 3.4.2 deals with reducing the performance impact of equivalent mutations.
Scope. Since there is a vast amount of literature on the many variables involved in mutation analysis, as surveyed by Papadakis et al. [63], we restrict ourselves in this paper to investigating only the aspects of combining mutation analysis with greybox fuzzing. In particular, we (1) work with the assumption that a high mutation score is a desirable property of a test-input corpus used for regression testing, referring the reader to several empirical studies examining the relationship between mutation scores and real faults [2,12,15,32,39,41,64], and (2) directly use the default set of mutation operators provided by PIT (ref. Section 2.2), which have been chosen based on several empirical studies of effectiveness, sufficiency, and to align with developer expectations [1,17,45,57].

The Mu2 Framework
To address our problem statement, we present the mutationanalysis-guided greybox fuzzing technique in Algorithm 2. This is an extension of Alg. 1, with changes highlighted in grey. The key additions of this algorithm are in evaluating whether a fuzzergenerated input ′ should be saved to the corpus. The function ProgMuts2Run (Line 8) returns a set of program mutants to evaluate with input ′ . For now, assume it to return Mutants(P) as defined in Section 2.2, though we will refine this in Section 3.4.2. ← PickInput(corpus) 5: for all ′ ∈ ProgMuts2Run( , corpus, ′ ) do 9: if kills( ′ , ′ ) ∧ ′ ∉ killed( , corpus) then 10: corpus ← corpus ∪ ′

11:
until budget 12: return corpus 13: function killed(Program , Set of inputs X ) 14: We then determine whether the input ′ is the first input to kill some mutant ′ . If ′ is killed by ′ and ′ has not previously been killed by any input in the corpus (Lines 9 and 14), then we add ′ to the corpus (Line 10). Broadly, this algorithm saves fuzzer-generated inputs if they increase either code coverage or mutation score. Additionally, inputs that increase mutation score are marked as favored, giving them more energy to be picked for fuzzing (Line 4). As before, the final corpus of fuzzer-generated inputs is returned as the result (Line 12). We have implemented Algorithm 2 for fuzzing Java programs by integrating PIT [17] into JQF [58]. We call this system Mu2, since it combines Mutation-based Greybox Fuzzing with Mutation Testing.
We chose PIT and JQF because of their maturity, extensibility, and their common target platform. As described in Section 2.2, PIT is an actively developed mutation testing framework that operates on JVM bytecode. The JQF framework [58] was originally designed for coverage-guided property-based testing, which is a structureaware variant of greybox fuzzing (ref. Section 2.1) and instruments JVM bytecode for collecting code coverage. JQF also has a highly extensible design for creating pluggable guidances, which supports rapid prototyping of new fuzzing algorithms [43,55,56,59,69,75,83].
In Mu2, Mutants( ) includes all of PIT's default expression mutation operators (ref. Sections 2.2 and 3.1). For heuristics such as PickInput and MutateInput, Mu2 reuses the logic and code from Zest [59], which we also use as a baseline for evaluation (Section 4).

Oracle: Differential Mutation Testing
One challenge of mutation-analysis-guided fuzzing is determining whether a program mutant is killed by a particular input. This corresponds to the kills function invoked in line 9 of Algorithm 2.
In mutation testing, a program mutant ′ is considered killed if any test in the test suite fails. The logic that determines whether a test passes or fails is known as the test oracle.
Greybox fuzzing generally relies on implicit oracles, which aim to detect anomalous behavior such as crashes or uncaught exceptions, or property tests, which assert a predicate over the output of some computation. For example, consider the insertion sort method  Figure 2: Java program that implements insertion sort, annotated with four sample program mutants.
defined in Figure 2 and the following test method, which is written in the property-testing style using JQF's @Fuzz annotation: 1 @Fuzz // Inputs generated using greybox fuzzing For Mu2, we could use this property test as an oracle. Consider the following examples, using the notation introduced in Section 2.2: executing mutant ′ 1 = ⟨Sort, i+1, i, 10⟩ with input array = [3, 2, 1] would result in an uncaught IndexOutOfBoundsException (-1) on line 10, triggering a failure via the implicit oracle. Additionally, executing ′ 2 = ⟨Sort, i>=0, i>0, 5⟩ with would result in an assertion failure in the property test because the result of ′ 2 ( ) would be the array [3, 1, 2], which is not sorted. So, both mutants ′ 1 and ′ 2 would get killed by the fuzzer if it discovers such an input.
Unfortunately, the property test is not a complete oracle in that it does not fully specify the expected behavior of the sort function. Consider a third mutant ′ 3 = ⟨Sort, arr[i], 1, 7⟩, which assigns a constant to every array element at line 7. This is clearly a bug in insertion sort, yet the output is always sorted. For example, when . Such a mutant would incorrectly survive on any input the fuzzer generates.
Writing a complete oracle for testing insertion sort is possible, but quite cumbersome. In general, this is a hard problem [6]. For many applications, a complete oracle would need to be as complex (or in some cases exactly the same) as the original program itself.
In Mu2, we use the well-known concept of differential testing to define our oracle. In differential testing [21,53], different implementations of a program that are expected to satisfy the same specification are executed on a single input, and their results are compared to identify discrepancies. In Mu2, our different "implementations" are the original program and program mutants; any discrepancy between the original program output and a mutant's output leads to that mutant being killed.
To support the comparison of outputs, we create a differential mutation testing framework. This allows for (1) output values to be returned from a fuzzing driver (as opposed to the void returns used by conventional property testing methods) and (2) a userdefined comparison function for specifying how outputs from the original program and a program mutant should be compared. An example of differential mutation testing methods in our framework  Figure 3: A Mu2 differential mutation test driver and comparison method for the insertionSort method (Fig. 2).
is shown in Figure 3. The @Diff method runInsertionSort returns an output value of type int[]. The user-defined comparison method checkEq simply determines if the output arrays are equal. If unspecified, the @Compare function defaults to the java.lang.Objects.equals() method. Our interface is general enough to support complex differential testing oracles such as the ones used in CSmith [80].
With differential mutation testing, we are able to kill mutants such as We can now precisely define Kills( ′ , ) which was referenced in Algorithm 2. Given a mutant ′ = ⟨ , , ′ , ⟩ and an input , is the user-defined @Compare method (e.g., checkEq in Figure 3) or Object.equals() if one is not defined; or (2) ( ) = but executing ′ ( ) results in an uncaught run-time exception being thrown; or (3) Executing ′ ( ) takes longer than a predefined TIMEOUT. The timeout is required for killing mutants such as ′ 4 = ⟨Sort, i-1, i, 8⟩, which effectively removes the decrement of i, leading to an infinite loop on the input [3,1,2].
We evaluate the improvement in completeness using the differential oracle over the greybox fuzzing implicit oracle in Section 4.4.

Performance
The biggest challenge with incorporating mutation testing inside a fuzzing loop is performance. Given its need to execute many mutants on each iteration, mutation testing is in general a very expensive technique [63], so scaling Mu2 to real-world software is a non-trivial task. Two aspects of improving scalability are: (1) reducing the average time required to execute each program mutant, and (2) reducing the number of program mutants that must be evaluated at each iteration of the fuzzing loop.
3.4.1 Improving Performance of Mutant Execution. When running a mutation testing tool such as PIT [17], each mutant and test is run in a different JVM. For general mutation testing, this is ideal because it simplifies managing multiple copies of the same program (sans mutations), and prevents global state changes from one program mutant affecting the state of another program mutant. However, this is not necessary for Mu2. For in-process fuzzing, test driver surviving ← mutants( ) \ killed( , corpus) 3: if AGGRESSIVE_OPT is configured then 6: return filter(killable, AGGRESSIVE_OPT )

7:
return killable methods are expected to be self-contained and not depend on global state. Like JQF and Zest, Mu2 is designed to work in a single JVM. Mu2 thus adopts a different strategy than PIT and takes advantage of the Java class-loader mechanism to load and run program mutants within the same JVM, essentially by having copies of the entire class hierarchy (one per mutant) in memory at the same time. First, a CoverageClassLoader (CCL) is responsible for loading the original target program and collecting code coverage using on-the-fly instrumentation. For differential testing, the CCL-loaded classes compute the ground-truth outcome ( ). Second, a family of MutationClassLoaders (MCL) are used to load program mutants; one MCL per mutant ′ = ⟨ , , ′ , ⟩. When a mutant test program is loaded by the MCL, it performs on-the-fly bytecode instrumentation exactly at location , replacing expression with ′ and loading the rest of the program without changing semantics. The MCL adds instrumentation at backward jumps (i.e., loops) in order to detect timeouts and exit test execution cleanly if necessary. Further, assuming that fuzz tests do not affect global state, Mu2 loads only one copy of each library class (defined as classes outside a specified package identifying the target application as long as they and their transitive dependencies do not reference any application class) using a common SharedClassLoader-this dramatically reduces memory pressure when mutating large programs.
To validate our design, we ran an informal preliminary experiment of performing mutation analysis with PIT and Mu2's inmemory set-up on a fixed corpus of seed inputs for the Google Closure Compiler [30]. In the steady state (after the first 8 inputs), Mu2's in-memory analysis runs with a 9.6× speed-up over PIT.

Reducing the Number of Mutants to Run in the Fuzzing Loop.
For each trial-i.e., iteration of the fuzzing loop-(1) the input must be executed once by the original program and (2) the input must be executed by each mutant. Thus, we can model the time required to execute each trial as the following: trialTime = time orig + * avgTime mut (1) where = |ProgMuts2Run( , corpus, )| as per Algorithm 2.
Observe that the time per trial scales linearly with . We can improve the fuzzing throughput (i.e., the number of trials executed per unit time) directly by reducing . From Algorithm 2 (Lines 9-10), we can see that we only care about executing a program mutant if it will help us determine if a given input is the first input to kill it. We can therefore reduce by dynamically pruning mutants whose execution will necessarily lead to Line 9 evaluating to false.
So, we begin by applying the following conditions for a given ′ = ⟨ , , ′ , ⟩, which are shown in Algorithm 3, lines 2-4: (1) If ′ ∈ killed( , corpus), then ′ does not need to be executed for any future inputs. (2) If the program mutant ′ applies a mutation to a program location , but is not covered when executing the original program on , then ′ cannot be killed by . This corresponds to execution-based pruning in the PIE model [38]. (3) If we can guarantee that all dynamic evaluations of during the execution of on are equivalent to the corresponding evaluations of mutated expression ′ , then ′ cannot be killed by . This corresponds to infection-based pruning in the PIE model [38], which we implemented as a dynamic analysis of the execution of the original program ( ).
The last two strategies from the PIE model require additional overhead when executing : (1) the execution-based pruning depends on coverage instrumentation, and (2) infection-based pruning requires evaluating and comparing the mutation expression each time that it is executed by . Referring to Equation 1, the optimization results in a trade-off for trialTime due to the increase in time orig and decrease in the number of mutants to run . However, we find this is quite beneficial overall. Table 1 shows the results of preliminary experiments on 4 benchmarks included in our evaluations in Section 4 to validate these optimizations; clearly, they improve performance significantly. We note that all the pruning methods mentioned above are sound optimizations: a mutant is pruned only if it is guaranteed to survive when executed. Effectively, we are pruning mutants that are equivalent modulo inputs [47].

Aggressive Mutant Selection
Optimizations. While the execution and infection optimizations significantly improve the overall throughput of Mu2, the factor in Equation 1 still grows linearly with the size of the program (more code = more mutants). We can be aggressive about reducing by attempting to bound it by a constant , at the risk of potentially missing out on analyzing some mutants that could have been killed by a given input. We call these aggressive optimizations. We use the function filter in Algorithm 3 (Line 6) to optionally apply a selection strategy [66,73] that returns a bounded subset of the killable mutants. We have implemented two types of filters in Mu2: (1) -Random Mutant Filter: For each generated input, mutants are randomly sampled from the killable set in Alg. 3. (2) -Least-Executed Mutant Filter: For each generated input, the killable mutants are sorted by the number of times they have been executed on previous inputs. The first mutants are then selected. The goal is to prioritize executing mutants that have not been tested as frequently during the fuzzing fuzzing campaign. This is a novel reduction strategy designed specifically for the fuzzing loop.
Section 4.2 evaluates the impact of these aggressive optimizations.

EVALUATION
We evaluate Mu2 on 5 different Java program benchmarks, using state-of-the-art coverage-guided fuzzer Zest [59] as the baseline. We structure our evaluation around four research questions: RQ1: Does mutation-analysis guidance produce a higher quality test-input corpus than coverage-only feedback in greybox fuzzing? RQ2: How do the performance optimizations impact the quality of the test-input corpus produced by mutation-analysis guidance? RQ3: How does the reliability of killing nontrivial mutants differ between mutation-analysis guidance and coverage guidance? RQ4: How much stronger is the differential mutation testing oracle than the implicit oracle?
Benchmarks. We consider five real-world Java programs: 2 (1) ChocoPy [7,61] reference compiler (~6K LoC): The test driver (reused from [75]) reads in a program in ChocoPy (a statically typed dialect of Python) and runs the semantic analysis stage of the ChocoPy reference compiler to return a type-checked AST object. The test driver (reused from [59] and [75]) takes in a JavaScript program and performs source-to-source optimizations. It then returns the optimized JavaScript code.
Mutation selection. Following previous work on semantic fuzzing [59,75], we filter on package names to identify classes relating to the core logic of the program under test. The mutation operators are then applied on these classes. We use the same generators, oracles, and filters for both Zest and Mu2. All of the test drivers return objects that override Object.equals, and were thus properly compared by the differential oracle.
Duration. Following best practices [42], we use a time bound of 24 hours for each experiment.
Repetitions. To account for the randomness in fuzzing, we run each experiment 20 times and report statistics.
Metrics. For our evaluations, we compute the branch coverage and mutation scores across each fuzzer-generated test-input corpus. We report mutation scores as the absolute number of mutants killed instead of as a fraction (ref. Section 2.2), since we only care about comparing these numbers across fuzzing variants, and since the denominator is meaningless when considering a single test entry point. We additionally compute the kill frequency of each of the Reproducibility and Data Availability. We have published a replication package and evaluation data at: https://doi.org/10.5281/ zenodo.7647828. The evaluation data contains logs of fuzzing campaigns used to generate all evaluation figures and tables [74].

RQ1: Test-Input Corpus Quality
Does mutation-analysis guidance produce a higher quality testinput corpus than coverage-only feedback in greybox fuzzing? RQ1 focuses on evaluating mutation-analysis-guided fuzzing with a fixed time budget. Higher mutation score from the Mu2produced corpus and comparable coverage results would demonstrate that mutation-analysis can be used as an off-the-shelf replacement for coverage-only guidance. We first discuss results for Mu2-Default, then evaluate two variants against the Zest baseline. Figure 4 visualizes the mutation scores for each fuzzer-generated corpus. The default mutation-analysis guidance (Mu2-Default) is able to produce a corpus with higher mutation scores than coverageonly feedback in the first three benchmarks, achieving statistically significant increases in all three. Additionally, Figure 5 shows equivalent branch coverage between Zest and Mu2 for these benchmarks. For the Tomcat WebXML parser, the number of killed mutants saturated at 239 in almost all of the repetitions of the fuzzing campaigns. For the Closure Compiler, our largest benchmark, the Mu2-Default corpora achieve, on average, approximately 17% less branch coverage than Zest (shown in Figure 5). This is likely due to the performance overhead of running mutation analysis for a large benchmark, and also likely accounts for the Zest corpora on average killing 12 more mutants than Mu2-Default, as covering code is a necessary condition for killing mutants in that part of the code. This suggests Mu2-Default may not scale well to very large programs.
One way to mitigate this slowdown is to add mutation-analysis feedback to coverage-guided fuzzing later in the campaign. The Mu2-Split variant utilizes coverage-only feedback for the first half of the campaign (which is very efficient) and then introduces expensive mutation-analysis feedback for the second half. This is based on an idea by Gopinath et al. [31], who suggested saturating coverage before adding mutation analysis to the fuzzing loop. The Mu2-Split-generated corpora show statistically significant increases in mutation score over Zest for the first 4 benchmarks (Fig. 4), although the effect for Tomcat is very small. There is also a major improvement over Mu2-Default in the Closure benchmark; Mu2-Split is able to bridge the gap in coverage (Fig. 5) and mutation scores (Fig. 4) that Mu2-Default had with the Zest baseline.
Another method of scaling Mu2 is to apply the aggressive optimizations detailed in Section 3.4.3. Mu2-OPT is a particular variant we chose that applies the k-Least-Executed filter with = 10 mutants. The Mu2-OPT generated corpus similarly achieves statistically significant increases in mutation scores across the first four benchmarks over Zest, with up to 20% increase in the Jackson JSON parser (Fig. 4). There is no significant difference between the mutation scores of Mu2-OPT and Zest on the Closure Compiler. Mu2-OPT achieves slightly less coverage than Zest on two benchmarks (ChocoPy and Closure) and more on one (Jackson)-however, the differences are fairly small (below 2%).
We are also curious about whether the additional saving of mutant-killing inputs in Mu2 may bloat the size of the generated test-input corpus, impacting its use in regression testing. Table 2 displays the average sizes and runtimes for each fuzzer-generated corpus and show that no such bloat occurs in Mu2. While there are some differences in the number of test inputs, the runtime of the Mu2-produced corpora are not significantly higher than those  produced by Zest. Thus, mutation-analysis-guided fuzzing is able to produce a higher quality test-input corpus and can be feasibly used for regression testing. We believe that an aggressively optimized version of mutationanalysis-guided fuzzing can be used as a replacement for coverageguided fuzzing if the goal is to produce a test input corpus with high mutation score. Mu2-OPT provides an improvement for 4 benchmarks and scales to the largest target without paying a performance penalty.

RQ2: Aggressive Optimizations
How do the performance optimizations impact the quality of the test-input corpus produced by mutation-analysis guidance?
This RQ focuses on understanding the benefit of the aggressive optimizations in mitigating the scalability concerns of Mu2-Default. We created variants Mu2-LeastExecuted-and Mu2-Random-, each applying the corresponding filter described in Section 3.4.3, and chose three different values of ∈ {5, 10, 20}.
First, we measure just the performance benefit. Table 3 shows the speedups achieved-in terms of number of inputs evaluated over a 24-hour period-by each variant over Mu2-Default. The improvement for the benchmarks Gson and Jackson is relatively minor due to the already small number of mutants executed for each input after applying the execution and infection optimizations (ref. Section 3.4.2 and Table 1). However, the aggressive optimizations provide significant improvement for the larger benchmarks, with almost 25× speedup for the Mu2-LeastExecuted-5 variant on the Closure benchmark. This makes sense, as the main purpose of aggressive optimizations is to enable scaling to large programs.
Due to the aggressive nature of the mutant filtering, it is possible that input candidates that do kill mutants are not saved simply because those killable mutants were filtered. To determine whether the speedup actually results in a test-input corpus with higher mutation score, we must also measure the impact of these optimizations on the mutation score of the generated corpus. Figure 6 displays the mutation scores of all of the variants for each of the 5 benchmarks. At least one optimized variant was better than the default in all benchmarks. Somewhat surprisingly, we observe similar mutation scores between the Mu2-LeastExecutedand Mu2-Random-variants for the same value of in the first four benchmarks. The one exception is Closure Compiler, where Mu2-LeastExecuted-10 achieves a statistically significantly higher mutation score than Mu2-Random-10. Again, the effect of aggressive optimizations is most pronounced in the largest target.
Another interesting observation is that we can visualize the trade-off between execution speed and mutation score in the Jackson benchmark: although the Mu2-Random-5 variant has a faster execution speed than Mu2-Random-10 (Tab. 3) due to the smaller number of mutants, the mutation score slightly decreases (Fig. 6) since the optimization might skip some mutants at the wrong time. Nonetheless, the speedup displayed by the variants for the Closure Compiler results in better test-input corpus quality. All of the Mu2 variants are able to achieve statistically significantly higher mutation scores than Mu2-Default. Specifically, Mu2-LeastExecuted-10, Mu2-Random-5, and Mu2-LeastExecuted-5 kill ∼15 more mutants on average than Mu2-Default.
We found that Mu2-LeastExecuted-10 and Mu2-LeastExecuted-5 were the strongest variants, as they had a statistically significant increase in mutation score over Mu2-Default in the most benchmarks (3 out of 5) out of all variants. There was no significant difference in mutation scores between these two variants in any benchmarks, so we arbitrarily picked Mu2-LeastExecuted-10 as the optimized version of mutation-analysis-guided fuzzing (Mu2-OPT) in our evaluation of RQ1 and RQ4. We do however note for future practitioners that the best aggressively optimized variant of Mu2 may change depending on the target program.

RQ3: Nontrivial Mutants
How does the reliability of killing nontrivial mutants differ between mutation-analysis guidance and coverage guidance?
Not all mutants are equal-some mutants are easier to kill than others. We define a mutant ′ = ⟨ , , ′ , ⟩ as trivial if it is killed by the first input that executes in every experiment (this is the dynamic version of Kaufman et al.'s definition [40]). Since trivial mutants are killed as soon as the corresponding code is covered, conventional coverage-guided fuzzing like Zest suffices to capture them. On the other hand, since nontrivial mutants may or may not be killed even after the mutated expression is covered, we are interested to know whether these get killed based on pure luck or whether these get killed reliably across repetitions potentially due to the guidance in the fuzzing algorithm. We measure reliability by counting the number of repetitions in which each mutant is killed. In particular, we study the difference in reliability of killing nontrivial mutants between Zest and the best variant of Mu2. Figure 7 is a histogram showing the difference in kill rate of nontrivial mutants between Mu2-OPT and Zest. The values on the right side (green) correspond to mutants killed more reliably by Mu2-OPT than Zest. For the sake of visualization, the mutants with no difference in kill rate (X-axis value 0) are excluded from the charts.
Mu2-OPT is able to achieve a significantly higher kill frequency of nontrivial mutants in ChocoPy and Jackson. In fact, there are 29 mutants in Jackson that are killed during all repetitions of Mu2-OPT and zero repetitions of Zest. This is a strong indication that mutation-analysis feedback can consistently discover mutantkilling inputs that coverage-only feedback is incapable of finding. For the Gson parser, there are 22 vs. 24 nontrivial mutants killed more reliably by Zest and Mu2-OPT respectively, though the Xaxis values are generally higher for Mu2-OPT. For Closure, there are over 60 mutants killed by at least one more repetition of Mu2-OPT compared to the 4 by Zest. Overall, Mu2-OPT is able to kill nontrivial mutants more reliably than Zest.
We also note that Figure 7 provides some insight into the diversity of mutants, particularly redundant mutants. By definition, redundant mutants are grouped together in the same bars since they are always killed at the same frequency. Flattening the size of each bar to 1 removes at least all redundant mutants and acts as a lower bound on the number of nonredundant mutants.

RQ4: Differential Mutation Testing
How much stronger is the differential mutation testing oracle than the implicit oracle?
Described in Section 3.3, the differential mutation testing oracle is responsible for determining whether an input kills a mutant by comparing the outputs of the executions. We contrast it with the incomplete greybox fuzzing implicit oracle, which only detects uncaught exceptions or failed property checks. To study the strength of the differential oracle, we evaluate the improvement in the number of killed mutants over the implicit oracle. Figure 8 shows the difference in mutant kills across the benchmarks with the two types of oracles. The differential oracle is able to kill a significantly higher number of mutants across all 5 benchmarks, with an average increase of 25%. In the ChocoPy benchmark, over 85 more mutants are caught! This is because certain mutants are unkillable by the implicit oracle due to their effect on program behavior. We describe one of these mutants below. For brevity, we describe the code functionality, omitting the actual code snippet.
The ChocoPy type-checker has a function to check that the left and right operand types of an expression match when using the "+" operator. If so, the type is returned and assigned to the corresponding expression node in the output AST; otherwise, an   [3]). The differential oracle kills ′ since the output AST produced by ′ contains a type error, whereas does not. The implicit oracle fails to kill this mutant since no exceptions are triggered.
We conclude that the differential oracle is substantially stronger than a traditional implicit oracle and is valuable for capturing a larger set of mutant program execution behaviors.

THREATS TO VALIDITY
Threats to construct validity. First, the measurement of mutation score is of course dependent on the set of mutation operators being applied to generate program mutants [62]. We aim to mitigate this threat by using the default set of operators in the widely used PIT framework, as justified in Section 3.1. Second, our test oracles (ref. Section 3.3) report an outcome of TIMEOUT if a mutant execution does not terminate within a predefined limit. Such a bound is necessary to catch infinite loops (e.g., for mutants that negate loop conditions). However, if this bound is too small, then it is possible in theory that some mutants could be marked as "killed" by a fuzzergenerated input even if their execution would eventually produce a correct output. To mitigate this threat, we compute the mutation scores for the final test-input corpus by re-running saved inputs on all program mutants using a larger timeout. We also manually analyzed a sample of reported timeouts to confirm correspondence to infinite loops-we found no false kills.
Threats to internal validity. Our evaluation uses mutation score when comparing the quality of the generated test-input corpora since our goal was to synthesize a test-input corpus with high mutation score (ref. Section 3.1). We assume that a high mutation score is a valuable objective for fuzzers. However, there is a potential bias from using mutation score as an evaluation metric, as Mu2 benefits from incorporating mutation testing in the fuzzing loop. Our results nevertheless capture the performance overhead impact of mutation-analysis-guided fuzzing on mutation score and code coverage.
Our implementation simply reused all the fuzzing hyperparameters (e.g., PickInput and MutateInput in Algorithms 1 and 2) that were set by the baseline Zest fuzzer. Tuning these heuristics could affect our results, but the size of this search space is too large for us to explore systematically. We stick with the baseline-provided defaults for simplicity and make sure to use the same hyperparameters for both Zest and Mu2 so that our conclusions are exclusively based on the inclusion of mutation-analysis guidance in Mu2 only.
Threats to external validity. Since our implementation is based on JQF [58] and PIT [17], which both target JVM bytecode, we used Zest as the baseline. We do not know if our conclusions will generalize to other programming languages or fuzzing platforms, such as the family of tools based on AFL [81] and libFuzzer [49]. The available mutation testing infrastructure for C/C++ appears to be less mature than that for Java/JVM. Another threat to external validity arises from our selection bias in choice of benchmark programs. Our targets have input and output formats which make them amenable to differential mutation testing. This is not always true for all applications that can be fuzzed-e.g., PDF viewers and other programs whose output is graphical. The study of the general test oracle problem [6] is outside the scope of this paper.

RELATED WORK
Greybox fuzzing. The field of coverage-guided greybox fuzzing has a vast literature, as surveyed by Manès et al. [52]; a more recent and evolving publication list is maintained by Wen [78]. The majority of fuzzing research focuses on improving heuristics such as seed-picking power schedules [9], input mutations [5,48,50], and coverage feedback [14,25]. FuzzFactory [60] generalizes the feedback of greybox fuzzing beyond code coverage to domain-specific metrics that satisfy certain conditions. Our proposed mutationanalysis guidance fits into this framework.
Greybox fuzzing for regression testing. A family of techniques have been developed for directing fuzz testing towards specific code locations [8,13,77] or code commits [84], which can be used for identifying regressions. However, this still requires running a full fuzzing campaign, which can take hours or days. In contrast, we focus on synthesizing a high-quality test-input corpus which can be quickly executed in CI-usually taking a few seconds or minutes-as is often already practiced (ref. Section 1).
Guiding fuzzing with mutation testing. We first proposed the idea of using mutation testing to augment greybox fuzzing in a student research competition [46]; independently, Qian et al. [67] published a similar idea at a regional symposium. However, we believe the current paper is the first to thoroughly evaluate the performance and scalability of incorporating mutation testing in the fuzzing loop. In particular, we identified that the evaluation in Qian et al. 's paper [67] uses an unsound comparison to the baseline Zest; they use mutation analysis with multiple threads but run Zest only single threaded for the same time bound, hence giving higher CPU time to their technique and obscuring the effects of the increased overhead of performing mutation testing. Additionally, they use a selection strategy to choose 10 mutants at random, but do not measure the impact on the overall mutation score, since they never run all killable mutants. We were unable to perform a head-to-head evaluation between Mu2 and their technique since their implementation is not open source.
Using mutation testing in automated test generation. In a registered report, Groce et al. [33] propose fuzzing specially mutated targets to find inputs triggering interesting control flow not in the original program, and then use those inputs as seeds for coverageguided fuzzing. However, they do not target maximizing mutant kills-for example, a mutant which only changes return codes gets low fuzzing priority in their approach because it won't affect control flow [33]. In contrast, Mu2 aims to find inputs that differentiate program output on potentially semantics-altering mutants, which often change data values but not necessarily control flow. Our approach is therefore orthogonal to Groce et al. 's and could potentially even be combined.
-test [24] and EvoSuite [23] are evolutionary test-generation techniques that can use mutation scores as an objective as well as a fitness function. -test, which is based on Javalanche [70], uses a form of differential testing to compare the coverage traces of the original program and a mutant. Unlike these tools, which generate unit test methods for exercising program API, greybox fuzzing focuses on the generation of inputs for system testing, given a fixed entry point.
Improving the performance of mutation testing. A lot of research has been conducted to speed up mutation testing [18,37,63,66,73]. The approaches fall into three categories: (1) reducing the number of mutants to generate, (2) pruning mutants to run on a given test, and (3) speeding up mutant evaluation on a given test. For example, many techniques have been developed to avoid generating redundant or equivalent mutants [51]; we do not currently make an attempt to identify these statically. Just et al. [39] introduce the propagation, infection, execution (PIE) model to prune mutants that are test-equivalent using dynamic analysis. Mu2 implements the execution and infection optimizations from this work. MeMu [26] speeds up PIT's mutation analysis by memoizing unmutated methods with long execution time; this is a promising approach that could be integrated into Mu2. Kaufman et al. [40] prioritize mutants to reach test completeness faster. All these optimizations are sound-they do not avoid analyzing mutants that may be killable.
Other research directions aim to reduce mutation-analysis costs while potentially trading off soundness. For example, weak mutation [35] has been proposed to terminate mutant evaluation quickly by observing the intermediate state after executing the mutated program locations. Many techniques have been developed for mutation reduction [63,66,73]-where only a subset of mutants are evaluated based on some program-specific criteria. In this paper, we have evaluated the random sampling approach and a novel least-executed approach to mutant selection. Recently, Guizzo et al. [34] have proposed an evolutionary approach to automate the generation of optimal cost reduction strategies. Further, predictive mutation testing [82] uses machine learning to estimate which mutants are most likely to be killed. Incorporating such advanced models into the Mu2 framework are promising directions for future work.

CONCLUSION
We investigated the challenges of incorporating mutation analysis to guide greybox fuzzing. Our implementation, Mu2, integrates PIT mutation testing into the JQF framework, and is aimed at producing a test-input corpus with high mutation score. In our design, we incorporated a differential testing as an oracle for killing mutants and proposed optimizations to improve fuzzing throughput by dynamically pruning the number of mutants to be executed. We applied both sound and aggressive optimizations for Mu2 to help scale it to larger programs. After conducting a thorough evaluation on Mu2 and several variants, we found that mutation-analysis feedback can improve the mutation score of a test-input corpus and more reliably kill nontrivial mutants than coverage-guided fuzzing.
One of the challenges identified by Gopinath et al. [31] was to "improve visibility of mutation analysis among fuzzing researchers. " We hope our work increases awareness of mutation analysis techniques in the fuzzing community and encourages other researchers to develop more advanced hybrid techniques.