Green Fuzzer Benchmarking

Over the last decade, fuzzing has been increasingly gaining traction due to its effectiveness in finding bugs. Nevertheless, fuzzer evaluations have been challenging during this time, mainly due to lack of standardized benchmarking. Aiming to alleviate this issue, in 2020, Google released FuzzBench, an open-source benchmarking platform, that is widely used for accurate fuzzer benchmarking. However, a typical FuzzBench experiment takes CPU years to run. If we additionally consider that fuzzers under active development evaluate any changes empirically, benchmarking becomes prohibitive both in terms of computational resources and time. In this paper, we propose GreenBench, a greener benchmarking platform that, compared to FuzzBench, significantly speeds up fuzzer evaluations while maintaining very high accuracy. In contrast to FuzzBench, GreenBench drastically increases the number of benchmarks while drastically decreasing the duration of fuzzing campaigns. As a result, the fuzzer rankings generated by GreenBench are almost as accurate as those by FuzzBench (with very high correlation), but GreenBench is from 18 to 61 times faster. We discuss the implications of these findings for the fuzzing community.


INTRODUCTION
Greybox fuzzing [4,6] has shown to be e ective in nding bugs, thereby improving software quality.In the last decade, it has seen wide industrial adoption and signi cant research advancements.For instance, Google's OSS-Fuzz service [5] has found 28,000 bugs across 850 open-source projects using various fuzzers [3,4,6,16], and, in 2022 alone, the four top software-engineering conferences (ASE, FSE, ICSE, ISSTA) published 36 papers containing "fuzz" in their title.
Over the years however, evaluation of fuzzing techniques has been challenging, mainly due to lack of standard benchmarking platforms, metrics, and benchmarks.In early 2020, FuzzBench [26], an open-source benchmarking service, was released by Google to alleviate such issues.A FuzzBench experiment typically compares about 11 fuzzers; each fuzzer is run on about 20 real-world benchmark programs; each run involves 20 fuzzing campaigns (i.e., trials) of 23 hours.All raw data is made available to the user together with a result report showing statistically signi cant comparisons among fuzzers, i.e., fuzzer rankings.FuzzBench is currently widely used for accurate fuzzer benchmarking.
However, FuzzBench experiments are extremely costly, in terms of both computational resources and time.It is certainly prohibitive to regularly run such experiments on "academic-scale" infrastructure to evaluate improvements to a fuzzer under development.And even though researchers may request FuzzBench experiments to be run on Google's infrastructure for free, these still take days to complete.Overall, a research project in this area could require CPU centuries and tens of thousands of dollars [26], regardless of whether this money is spent by Google.
Setting researcher time aside for the moment, the computational time needed for fuzzer benchmarking raises signi cant environmental and nancial concerns.The global energy crisis has increased the cost of such experiments, and many industrial and academic resources are already under severe scrutiny by cost-saving measures.So, on the one hand, more comprehensive and statistically signi cant fuzzer evaluations advance the state of the art.On the other hand however, we can't a ord them especially since fuzzer improvements are the result of many iterations and intermediate experiments that guide research and development e orts.
Our approach.In this paper, we propose GreenBench, a green benchmarking platform that aims to run comprehensive fuzzer evaluations in a fraction of the resources and time required by FuzzBench.The purpose of GreenBench is to compute quick and inexpensive fuzzer rankings while maintaining high accuracy with respect to FuzzBench-in fact, GreenBench's results are comparable to those of FuzzBench (with 0.82 correlation in our experiments).
We are exploring a trade-o here, between speed and accuracy in ranking fuzzers.With GreenBench, users can obtain highly accurate results without spending prohibitive amounts of time as until now.However, as expected, GreenBench's results are not perfectly accurate with respect to FuzzBench, which is why GreenBench is not meant to replace large-scale experiments-there is still potential gain from running them.
The key idea behind GreenBench is to run on a large number of benchmarks (i.e., thousands) only for a short period of time (i.e., minutes).This is in contrast to FuzzBench, which runs on a few programs for many hours.As a result, GreenBench generates almost the same ranking of fuzzers as FuzzBench (with a correlation of 0.82) from 18 to 61 times faster.But how do we obtain thousands of benchmarks?
GreenBench creates benchmarks by using the existing FuzzBench programs with diverse seed inputs (i.e., 100 seed inputs per program).As a result, each GreenBench benchmark is signi cantly di erent from others since seed inputs are known to have major impact on fuzzer e ectiveness [20,21,29].Intuitively, providing diverse seed inputs for a particular program is analogous to exploring the same maze from di erent starting positions.In contrast, FuzzBench always uses the same seed inputs for each program.We argue that this is a poor design decision independently of approach-it evaluates fuzzers only on a speci c part of the input state space for each program, and therefore, it may lead to over-tting fuzzers under active development to these particular seeds.
Hence, GreenBench evaluates the e ectiveness of fuzzers on benchmarks of the form ( , ), where is a program and a diverse seed input for .As with FuzzBench, the fuzzer that performs the best (with statistical signi cance) according to a given metric, i.e., achieved coverage or detected bugs, within the time limit wins.Given the challenges of bug-based benchmarking [12], achieved coverage is typically preferred and the default metric in FuzzBench.
When using a coverage-based metric, GreenBench implements the following optimization for larger cost savings.During fuzzing, we can of course not know when all possible coverage of a given benchmark has been achieved.Consequently, fuzzers run until the time limit even when there are no more branches to cover.However, to save even more time and energy, we can further specify our benchmarks with target coverage, i.e., the coverage that a fuzzer should achieve for the benchmark to be considered complete.If the fuzzer reaches the target coverage before the time limit, it stops.
We de ne target coverage as a set of target (control-ow) edges that are covered by target inputs but not by seed input .In particular, each benchmark now becomes ( , , ), where is the set of target edges.Parameter is important in controlling the di culty of each benchmark, that is, it guarantees that the fuzzer needs to generate at least inputs to cover all edges in independently of the size of this set.
When using target coverage to rank fuzzers, GreenBench ranks the fuzzer that covers the most target edges rst.To break any ties, we additionally use the time it takes for fuzzers to cover the target edges as well as the total number of covered edges.

Contributions. Our paper makes the following contributions:
• We propose GreenBench, a novel benchmarking platform that speeds up fuzzer benchmarking by orders of magnitude, thereby saving signi cant time and energy.Outline.The rest of this paper is organized as follows.Section 2 explains and motivates our fuzzer-benchmarking approach.In Section 3, we describe our implementation of GreenBench, and in Section 4, we present our experimental evaluation comparing GreenBench with FuzzBench in terms of speed and accuracy.Section 5 elaborates on the importance of greener fuzzer benchmarking.We discuss related work in Section 6 and conclude in Section 7.

APPROACH
Our GreenBench approach incorporates three key changes in the FuzzBench platform: (1) Randomizing initial seed inputs of benchmark programs, thereby creating benchmarks of the form ( , ); (2) Drastically reducing the duration of fuzzing campaigns (from 23 hours to 15 minutes in our default con guration); (3) Drastically increasing the number of campaigns per benchmark program (from 20 campaigns, each running on the same with the same, xed seed input , to 100 campaigns in our default con guration, each running on the same but with diverse seed inputs , where = 1 . . .100).
Figure 1 illustrates these changes visually.The outer black oval represents the input state space of a given benchmark program .For simplicity, since the input state space is typically unbounded, a point in this oval represents many inputs that exercise the same program path.FuzzBench runs many fuzzing campaigns from the same seed .The inputs discovered during those campaigns are depicted by the red-shaded area.As fuzzers are non-deterministic, di erent campaigns naturally discover di erent sets of inputs.In the gure, the inputs within the red line represent the intersection of these sets, i.e., the inputs discovered by all campaigns.
The rst two changes in GreenBench are visualized in green in the gure.First, there are many diverse initial seeds , and second, each campaign is shorter, thus covering a smaller part of the input state space.The third change is not explicitly visualized, but intuitively, many campaigns starting from diverse seeds aim to evenly cover the input state space with green areas.Note that starting a campaign with 5 allows evaluating a fuzzer's e ectiveness on a region of the input space that would hardly be reached when only starting campaigns with .
To further reduce costs and optimize experiment running time, we additionally propose a fourth change.Instead of comparing fuzzer e ectiveness with respect to the total achieved edge coverage, we suggest to base the comparison on the achieved coverage of certain randomly selected target edges.By bounding the number of edges to be covered, a fuzzing campaign can be terminated as soon as all target edges are covered instead of allowing it to reach the time limit.As shown in our experiments, this optimization indeed results in larger time savings without sacri cing accuracy of the benchmarking platform.
In the following, we describe all four changes in more detail and discuss the motivation behind these design choices.FuzzBench (in red) with respect to the input state space of a given benchmark program (in black).FuzzBench uses the same set of initial seeds for all campaigns.In contrast, GreenBench randomly uses diverse initial corpora to run signi cantly more short campaigns.

Randomizing Initial Seed Inputs
Seed inputs can have signi cant impact on the e ectiveness of a fuzzing campaign [20,21,29], and as a result, the fuzzing community has been debating what seeds to use for benchmarking [21].There are two extreme options, namely, using an empty seed corpus or a large, almost saturated corpus.
In practice, an empty corpus is mainly relevant for fuzzing a new program with no existing seed inputs.During the rst few hours of a campaign, most fuzzers will discover many new edges.Using an empty corpus is, therefore, not suitable for comparing fuzzers with a relatively short time limit as they will all quickly increase the achieved coverage and the variance of their e ectiveness will be high.
In contrast, a large corpus can be used for programs that have already been extensively fuzzed.In such cases, increasing coverage is di cult, and any newly discovered edges provide a good indication of a fuzzer's e ectiveness.However, it may also bias comparisons in favor of fuzzers that specialize in nding new coverage using more expensive techniques.In addition, there may not be any noticeable coverage increase in a long period of time, thus rendering a very large corpus less suitable for benchmarking.
FuzzBench balances these two extremes by using a range of corpus sizes across di erent benchmark programs.However, the corpus is xed for each program, and consequently, there is a risk of over-tting fuzzers as we will see in our experimental evaluation.More speci cally, a xed corpus may bias comparisons in favor of fuzzers that are under active development and use FuzzBench to validate algorithmic choices or tune hyper-parameters.
To reduce such bias, we randomize the initial seed corpus for different campaigns run on the same benchmark program.By default, GreenBench uses a single random seed input, but it may also be con gured to use a corpus of any size.We pick seed inputs from an existing large corpus uniformly at random.The large corpus can be obtained by running a long fuzzing campaign once per benchmark program and using the generated pool as the corpus, or by directly using Google's OSS-Fuzz corpus [5].Since seed selection is performed randomly, each fuzzing campaign will (most likely) sample a di erent part of the input state space-by construction, each seed is guaranteed to cover a di erent program path since this is the criterion that AFL-like fuzzers use for adding seeds to their corpus.This is most noticeable at the beginning of fuzzing (i.e., rst minutes or hours) when there is typically little coverage overlap between two campaigns started from diverse inputs.
As shown in Figure 1, using a diverse initial corpus for each campaign explores the input state space more broadly right from the start.Not only does this change reduce the risk of over-tting, but it also allows for drastically reducing the campaign durationno time needs to be spent on discovering the same inputs over and over again across campaigns.

Drastically Reducing Campaign Duration
Over the last years, fuzzing campaigns have been established to be relatively long, i.e., lasting one or more days.FuzzBench is no exception-its campaigns are 23 hours long (they are not 24-hour campaigns to reduce costs by running on less expensive cloud instances).This practice is mainly motivated by the fact that coverage variance tends to decrease with time, and of course, less variance tends to provide more reliable comparisons among fuzzers.However, the concrete choice of the time limit is not well motivated since variance depends on the benchmark program, i.e., variance may decrease more quickly for some benchmarks than others.So, one way to reduce costs would be to better calibrate the time limit for each benchmark program.
Our approach is even bolder by using a very short time limit (15 minutes by default) across all benchmark programs.Viewed in isolation, this change seems like a poor design decision-our experimental results also substantiate it as such.However, it should be considered in combination with change 1 (randomizing initial seed inputs) and change 3 (drastically increasing campaign number).As shown in Figure 1, by using a large number of short campaigns, each evaluating fuzzers on a di erent part of the input state space, we do not waste time re-discovering the same inputs.Instead, fuzzers are evaluated on diverse seed corpora and may reach parts of the input space that they would not with FuzzBench.

Drastically Increasing Campaign Number
In fuzzer benchmarking, it is customary to run many campaigns per benchmark program since fuzzers are non-deterministic tools.In other words, the nal achieved coverage by two campaigns of the same fuzzer (with the same seed inputs and benchmark program) may vary.FuzzBench by default runs 20 campaigns per benchmark program, thus allowing to compute statistical measures, such as variance, and to compare statistical signi cance of any di erences in the e ectiveness among fuzzers.
In practice however, the nal coverage achieved with FuzzBench has very low variance for most benchmark programs and fuzzers.This is due to the long campaign duration as well as due to the use of the same seed corpus for all campaigns on a given benchmark program.Consequently, running 20 such campaigns is a waste of resources, thereby creating an opportunity for cost savings.
Even though GreenBench only runs a single campaign for each benchmark, i.e., a benchmark program and its randomized seed corpus, it runs signi cantly more campaigns for the same benchmark program.The key idea is to rely on more campaigns to ensure that the short campaign duration (change 2) still evaluates the fuzzer for a large space of interesting inputs.Intuitively, many randomly distributed, small, green areas in Figure 1 cover the black oval more evenly than a single, large, red area.

Bounding Target Edge Coverage
Fuzzer benchmarking typically uses two measures of e ectiveness, namely achieved code coverage and detected bugs.Due to a number of challenges with using the number of detected bugs [12], achieved coverage is a more established metric.(We use code coverage here even though GreenBench could easily be adapted to use the number of detected bugs instead.)However, it is not tractable to determine the maximum possible coverage for real-world benchmarks.Otherwise, fuzzing campaigns could be terminated once this maximum was reached, allowing to further reduce costs.
In GreenBench, we design an approximate solution that enables terminating early.On a high level, we randomly select a subset of all feasible edges that are not already covered by the initial seed corpus of a benchmark program, i.e., given a benchmark ( , ), we select edges in that are not covered by .We refer to these edges as target edges and terminate a campaign as soon as it covers all target edges.This change of randomly selecting a subset of all edges resembles bug-based benchmarking since bugs also occur sparsely in a program.
Under the hood, we rst need to approximate the set of feasible edges that are not already covered by the initial seed corpus.Such a set can be determined using the existing large corpus, which we also needed for randomizing initial seed inputs (change 1). is then the set of edges covered by the large corpus excluding those covered by the initial corpus.Even though the target edges are a subset of , we cannot simply uniformly sample from to form .This is because we may include edges that are very di cult to cover within the short time limit of fuzzing campaigns, consequently defeating the purpose of saving costs and producing a meaningful comparison among fuzzers.
Instead, GreenBench implements the following alternative.We randomly select inputs from the large corpus that are not already in the initial corpus.
is then composed of those edges that are not covered by the initial corpus but are covered by the random inputs.Algorithm 1 shows our approach for generating a benchmark of the form ( , , ) for a benchmark program when given an existing large corpus and parameter .
Lines 2-3 randomly select an initial seed input from the large corpus and determine the coverage it achieves in the benchmark program.Next, lines 4-7 randomly select target inputs and determine their coverage.The nal target coverage is computed by removing any edges that are already covered by the initial seed input (line 8).If the nal target coverage is non-empty (line 9), Algorithm 1: GreenBench's benchmark-generation algorithm for a given benchmark program , a large set of potential initial seed inputs corpus, and a number of target inputs.On a high level, we rst randomly select an initial seed and target inputs from the corpus.The benchmark then consists in program , the initial seed , and the set of edges that are covered by the target inputs but are not covered by .This approach of selecting target edges bounds the di culty of the generated benchmarks as inputs su ce for covering all target edges.It also allows for a smooth increase in di culty for a given benchmark.In particular, GreenBench gives partial credit to fuzzers for covering only some of the target edges, e.g., the shallower (and therefore easier) ones.
When the same number of target edges is covered by multiple fuzzers, GreenBench breaks the tie by, rst, using the time to nd the target edges, and nally, the total number of covered edges for each fuzzer.

IMPLEMENTATION
Our implementation reuses and extends the existing FuzzBench infrastructure (e.g., benchmark programs and fuzzer runners) as much as possible.This section provides a short overview of the most important implementation changes, some of which are incorporated in the mainline FuzzBench project.
First, we made it possible to compare fuzzers based on edge coverage-previously, FuzzBench used region coverage.Edge coverage is a more common and well understood metric, and is now the default in FuzzBench.
Second, we added a feature in FuzzBench to provide custom initial seed inputs (instead of the xed seed corpus) for di erent benchmark programs and campaigns.In GreenBench, we use this feature to start campaigns with randomized seed inputs.
Third, we extended the coverage-measurement module to keep track of target-edge coverage in addition to measuring the overall edge coverage.
Moreover, in our implementation, we adjusted several FuzzBench settings.First, we use a coverage-measurement interval of 1 minute instead of 15 minutes.Second, we reduce the time limit for campaigns from 23 hours to 15 minutes.Third, we increase the number of campaigns per benchmark program from 20 to 100.These are our default settings, but we also consider several variants in our experimental evaluation.

EXPERIMENTAL EVALUATION
In this section, we address the following research questions: RQ1: How long does it take to generate random benchmarks of the form ( , , ) from existing benchmark programs and a large seed corpus?RQ2: How much time does GreenBench save with respect to FuzzBench?RQ3: How accurate is GreenBench versus FuzzBench?RQ4: Can GreenBench run fewer campaigns without sacri cing accuracy?RQ5: Can GreenBench run shorter campaigns without sacricing accuracy?RQ6: Are GreenBench results stable?

Setup
Large corpora.Recall that large corpora are needed in changes 1 and 4 of GreenBench.We used AFL++ [16] (commit 45668bb) to generate a large corpus for each benchmark program.For generating a corpus, we used the default settings from the AFL++ FuzzBench setup.We selected AFL++ for this purpose since it was the winning fuzzer in the FuzzBench paper [26], but we could have also used another fuzzer.
Con gurations.We used commit e816b71 of FuzzBench for our comparisons.Our con gurations for GreenBench are described in detail in the rest of this section.
Machine.We performed all experiments on a 32-core Intel Xeon E5-2667 v2 CPU (3.30GHz) machine with 256GB of memory, running Debian GNU/Linux 11.

Results
We now discuss our ndings for each research question.
RQ1: How long does it take to generate random benchmarks of the form ( , , ) from existing benchmark programs and a large seed corpus? Figure 2 shows how long it takes to generate 100 benchmarks of the form ( , , ) for each of the benchmark programs, , in FuzzBench and a corresponding large seed corpus.As shown in the gure, for most benchmark programs, the time is less than 5 minutes, whereas for two programs, more time is spent than for all others together.For these two outliers, the average running time per input is much higher than for other programs.The total time for all programs is 120.28 minutes, and the majority of this time is spent on executing inputs to compute their achieved edge coverage.
Note that the time for benchmark generation does not have to be spent when reusing benchmarks across experiments, which is the most common use case.In addition, note that we do not consider the time to obtain the large corpus needed for changes 1 and 4 in this research question.For each benchmark program, we built a large corpus by running a single fuzzing campaign (with AFL++), however such a corpus may also be obtained di erently.
Given a benchmark program and a corresponding large corpus, it typically only takes a few minutes to generate 100 benchmarks of the form ( , , ).RQ2: How much time does GreenBench save with respect to FuzzBench?A regular FuzzBench experiment with 6 fuzzers takes 55,200 CPU hours (6 fuzzers x 20 benchmark programs x 20 campaigns x 23 CPU hours), which is approximately 6.4 CPU years.In contrast, the corresponding GreenBench experiment without change 4 takes only 3,000 CPU hours (6 fuzzers x 20 benchmark programs x 100 benchmarks x 0.25 CPU hours), which is approximately 4.2 CPU months.This constitutes a speedup of 18.4x.
We have also investigated the additional savings of GreenBench by enabling change 4. In general, the savings depend on the number of target edges.Fewer target edges should result in more savings, possibly at the expense of accuracy (see RQ3 below).The number of target edges can be controlled by changing the number of target inputs , i.e., larger values of should result in more target edges.We have compared di erent values of to explore how the savings decrease as increases.
For = 2, the running time of a GreenBench experiment is further reduced by 31.92%.However, this comes at the cost of signi cantly reduced accuracy (see RQ3).For = 3, the reduction is 12.70%, and for = 5, the time is reduced by 8.89%.Both of these settings have good accuracy.By setting to a much higher value ( = 50), the reduction is only 3.35% without notably increasing accuracy.Our default con guration with = 5 and change 4 enabled provides a speedup of 20.2x over FuzzBench.
As we discuss in the following research questions, there are a number of important hyper-parameters that can further a ect speedup.For instance, GreenBench could even achieve a speedup  The default GreenBench con guration runs in 3.8 CPU months, in contrast to FuzzBench, which takes 6.4 CPU years.There are GreenBench con gurations that can even bring its running time down to 37.5 CPU days without signi cantly sacricing accuracy.
RQ3: How accurate is GreenBench in comparison with FuzzBench?In this research question, we investigate the accuracy of GreenBench (and each of its design choices) by comparing the correlation of its fuzzer ranking with the FuzzBench rankingwe use the standard ranking function from FuzzBench [26].We consider the following nine con gurations: FB: The vanilla FuzzBench con guration that runs 20 campaigns per benchmark program, each of 24 hours and with the same initial seed corpus1 ; R: A variant of FB that applies change 1 of randomizing initial seed inputs, i.e., it runs 20 campaigns per benchmark program, each of 24 hours but with a randomized initial seed input; RD: A variant of R that additionally applies change 2 of drastically reducing the campaign duration, i.e., it runs 20 campaigns per benchmark program, each of 15 minutes and with a randomized initial seed input; RDN: A variant of RD that additionally applies change 3 of drastically increasing the campaign number, i.e., it runs 100 campaigns per benchmark program, each of 15 minutes and with a randomized initial seed input; RDN-2: A variant of RDN that additionally applies change 4 of bounding the target edge coverage with = 2; RDN-3: A variant of RDN that additionally applies change 4 with = 3; RDN-5: A variant of RDN that additionally applies change 4 with = 5; RDN-50: A variant of RDN that additionally applies change 4 with = 50; Table 1 shows the fuzzer rankings that are produced by these di erent con gurations and Table 2 the correlation between the fuzzer rankings of all con gurations.
When comparing FB and R, the correlation (Table 2) drops to 0.89, con rming the substantial e ect of initial seeds on benchmarking results.Certain fuzzers, such as AFL++ and Eclipser, seem to signi cantly bene t from the xed seed corpus (Table 1).In fact, AFL++ seems to have been extensively tuned using FuzzBench experiments2 , which could explain why there is over-tting to these speci c seeds.To reduce such potential bias, we will use R as our main baseline.A very recent study on explainable fuzzer evaluation [28] independently makes a similar observation and tries to explain a fuzzer ranking through properties, such as size or coverage, of the initial corpus or of the benchmark programs.
When comparing R and RD, the correlation drops signi cantly, to 0.60.This is, of course, not surprising and con rms that 20 short campaigns per benchmark program are not able to reliably cover their input state space.By increasing the number of campaigns to Let us now evaluate the e ect of change 4 by considering different values for parameter , namely, 2, 3, 5, and 50.We observe that the correlation with R drops for = 2, but it increases again as we increase .For = 5, the correlation even surpasses the RDN con guration, thereby improving both accuracy and time savings.Notice that = 50 does not increase accuracy with respect to R while also saving less time in comparison with smaller values (see RQ2).We, therefore, consider RDN-5 the default GreenBench con guration.
The xed initial seeds of FuzzBench may lead to over-tting.
The fuzzer ranking generated by the default GreenBench conguration has an 0.83 correlation with the ranking generated by FuzzBench when randomizing the initial seeds, but it is computed 20 times faster.
RQ4: Can GreenBench run fewer campaigns without sacricing accuracy?In our default con guration, RDN-5, we use 100 campaigns per benchmark program.However, this number could potentially be reduced further without sacri cing accuracy.In this research question, we investigate how the choice of this parameter a ects accuracy.Figure 3 plots the correlation in fuzzer ranking for di erent campaign numbers ( = 1, . . ., 100) with respect to baseline R. Recall that, for each benchmark program , GreenBench has generated 100 benchmarks of the form ( , , ).In this experiment, for every value of , we shu e these benchmarks 100 times, and each time, we select the rst benchmarks for fuzzing.In the gure, we compute the median correlation with R (dark line) and determine 95%-con dence intervals (shaded area).
Even with lower values for , such as 30, we already obtain a similar median correlation as our default con guration.In fact, for = 28, the correlation is already 0.83-the same as for RDN-5.This demonstrates that GreenBench could, in principle, save even more time: when changing RDN-5 to set = 30, GreenBench is over 61x faster than FuzzBench.Overall, the number of campaigns provides a knob for controlling the accuracy-vs-speed trade-o .
GreenBench could run as few as 28 campaigns per benchmark program without sacri cing its accuracy while being 61 times faster than FuzzBench.
RQ5: Can GreenBench run shorter campaigns without sacri cing accuracy?In our default con guration, RDN-5, we use a time limit of 15 minutes (or 900 seconds) for each campaign.
this research question, we investigate how this time limit a ects  4: Correlation with baseline con guration R for di erent values of (time limit for campaigns).After about 10 minutes, the correlation only improves minimally.
accuracy, and in particular, by how much it could be shortened without sacri cing accuracy.Figure 4 plots the correlation in fuzzer ranking for di erent time limits ( = 60, . . ., 900 seconds) with respect to baseline R. As shown in the gure, the correlation converges quickly, and after about 10 minutes (or 600 seconds), it barely changes.In fact, at 600 seconds, the correlation is already 0.83-the same as for RDN-5.Again, GreenBench could save even more time: when changing RDN-5 to set = 600, GreenBench is over 27x faster than FuzzBench.We also experimented with time limits of up to 30 minutes, and the correlation did not increase further.In general, the campaign duration also controls the accuracy-vs-speed trade-o and could be dynamically adjusted for di erent benchmark programs.
GreenBench could reduce the campaign duration down to 10 minutes without sacri cing its accuracy while being 27 times faster than FuzzBench.RQ6: Are GreenBench results stable?Since fuzzers are nondeterministic, a benchmarking experiment may generate slightly di erent results from another.To investigate the stability of Green-Bench results across di erent benchmarking experiments, we performed three independent runs of our default con guration, RDN-5, on three di erent machines with the same hardware con guration.
Table 3 shows the fuzzer rankings that are produced by these three runs and Table 4 their correlation.We observe that all runs have very high correlation.In fact, runs 1 and 3 generate the same Table 3: Fuzzer rankings for three independent benchmarking runs with our default con guration (RDN-5).We also show the ranking for baseline con guration R for comparison.Multiple repetitions of each independent run can obviously help to obtain even more stable results.However, such a design choice comes at a cost and, as we argue over Figure 1, the time might be better spent on running with di erent seed inputs, e.g., 100 campaigns by default in GreenBench.

Fuzzer
The results of independent GreenBench experiments have very high correlation.

Threats to Validity
We have identi ed the following threats to the validity of our experiments.
Benchmark programs.The choice of benchmark programs is important when evaluating fuzzers [21] as well as benchmarking platforms.For our experiments, we used 20 benchmark programs from the FuzzBench platform.These are diverse, well established programs from various application domains and have already been used in multiple fuzzer evaluations.However, our results may not generalize to a di erent selection of benchmark programs.
Fuzzers.Due to the large computational cost, we used a subset of six fuzzers for our experiments (namely, AFL [6], AFL++ [16], Eclipser [14], Entropic [11], Honggfuzz [3], and libfuzzer [4]) instead of all eleven fuzzers from the FuzzBench paper.We tried to select a diverse subset, but our results may not generalize to a different selection of fuzzers.Moreover, by building on FuzzBench, GreenBench is as applicable to di erent fuzzers as FuzzBench.
Large corpus.Our approach uses a large, but xed, corpus of seed inputs per benchmark program.These inputs are used both for randomizing the initial corpus for our benchmarks and for selecting target edges.Therefore, they may in uence the accuracy of our approach.We used the winning fuzzer from the FuzzBench paper (AFL++) to generate these corpora.AFL++ only adds inputs to its corpus when they increase coverage, thereby guaranteeing a diverse set of inputs.However, our results may not generalize to a di erent seed corpus.
Choice of GreenBench parameters.The choice of (number of target inputs), (number of campaigns per benchmark program), and (time limit per campaign) can in uence the accuracy and performance of our approach.To mitigate this threat, we have evaluated our approach using a range of values for all these parameters (see RQ2-5).However, our results may not generalize to di erent choices of GreenBench parameters.
Fuzzer non-determinism.Since fuzzers are non-deterministic, one benchmarking run may produce slightly di erent results from another.To mitigate this threat, we ran our default-con guration experiment three times (see RQ6).E ectiveness metric.We use code coverage as our main effectiveness metric for fuzzers.An alternative would be to use the number of detected bugs-in fact, changes 1-3 are directly applicable to bug-based evaluations when appropriately adjusting the campaign duration such that bugs are found.However, coverage is used more often in practice since bugs are rare in real-world code.A recent study [12] discusses some of the challenges with using bugs as an e ectiveness metric, and in any case, found that there is very high correlation between the two metrics.Nevertheless, our results may not generalize to a di erent e ectiveness metric.

DISCUSSION
Is GreenBench green?GreenBench brings us a signi cant step forward in mitigating environmental concerns with fuzzer benchmarking.It allows fuzzer developers to speed up evaluations by orders of magnitude, and in addition, it provides knobs to adjust the accuracy guarantees depending on the stage of fuzzer development (e.g., when evaluating a pull request, or when preparing a new release).On the other hand, GreenBench does not completely eliminate environmental concerns, and more research is needed.
Implications for the fuzzing community.While we have highlighted environmental concerns, the main issue with current fuzzer benchmarking is multi-faceted, and raises economic, social, and methodological questions, such as: • Will fuzzer developers with limited nancial resources still be able to publish their research at top venues?• Will the success of fuzzers hinge on the ability to run huge numbers of experiments?
• How will a fuzzer be able to beat the state of the art without years of expensive hyper-parameter tuning?• Will we end up with a fuzzer mono-culture consisting of minor tweaks to AFL++? • How can artifact-evaluation committees reproduce fuzzer evaluations within short review periods and on a tight budget?• How can we prevent fuzzer developers from over-tting to speci c, well established benchmarks?
To continue to thrive, the fuzzing community should try to engage with these questions.
Going forward.Looking beyond our community, we can observe similar trends in machine learning, where it seems to have become an arms race to build larger and larger models using more and more resources.This puts academic institutions and small companies at a competitive disadvantage.However, there are also practices in the machine-learning community that could inspire and bene t us.For instance, machine-learning models are typically trained and evaluated on separate datasets.Assuming that the datasets are su ciently di erent, this mitigates the risk of overtting.
Perhaps fuzzer developers should use di erent sets of benchmarks during development and when running nal experiments for scienti c publications.A potential rst step could be to use GreenBench during development and FuzzBench for preparing a paper.However, more thought is needed to develop rigorous methodologies.
In GreenBench, we have proposed to randomize the initial seed corpus for di erent campaigns.A logical next step would be to also randomize the benchmark programs themselves.Of course, this would require a larger selection of suitable benchmark programs.Perhaps Google's OSS-Fuzz corpus [5] (containing about 850 opensource programs) could serve as a diverse set of such programs, and GreenBench or FuzzBench could randomly select programs from this set.Such a larger space of possible benchmarks could reduce the feasibility of systematic hyper-parameter tuning.During artifact evaluations, a di erent (possibly smaller) set of benchmarks could easily be generated (i.e., by providing a di erent random seed to GreenBench or FuzzBench) to validate that the key claims generalize beyond the benchmarks that were used by the authors.
In GreenBench, we have also proposed to run shorter campaigns.We have already discussed the reasoning behind this choice.However, there is one additional bene t we would brie y like to discuss here.Currently, benchmarking platforms focus on the nal results (often after 24 or more hours of fuzzing) but tend to de-emphasize how these results are achieved.Two fuzzers that both achieve coverage X after 24 hours are considered to perform equally well, even if fuzzer A already reaches coverage X much earlier.In other words, fuzzer A may be superior, but due to the long running time, the advantage becomes less signi cant.
Shorter campaign duration may not be the only solution to address this issue, but it does put more emphasis on the early stages of fuzzing campaigns when most inputs tend to be discovered.While we tried to make sure that the fuzzer ranking generated by Green-Bench is similar to the ranking generated by FuzzBench, the latter should not be considered as ground truth but only as part of the current state of the art, which may evolve over time.

RELATED WORK
There is a large body of work on fuzzing [18,24,27].In this section, we only focus on fuzzer benchmarking, which is more closely related to our approach.
Guidelines for fuzzer evaluations.Over the years, several guidelines for fuzzer evaluations and benchmarking have been created; from very general guidelines for empirical evaluations [1], to more speci c guidelines for randomized algorithms (including fuzzers) [7,8], and-most recently-even speci c guidelines for fuzzers [21].The latter, for instance, highlights the importance of initial seed inputs, time limits, e ectiveness metrics, and benchmark programs.
Benchmarks.The last two concerns-e ectiveness metrics and benchmark programs-have also motivated the creation of several di erent benchmark sets for fuzzers.On the one hand, there are synthetic benchmark sets, such as LAVA [15] and Fuzzle [22].The former is based on real-world programs where hard-to-reach bugs are added, while the latter synthesizes maze-like programs where transitions from one position to another are guarded by conditions of varying di culty.
On the other hand, there are benchmarks-such as those in FuzzBench [2,26], Magma [19], and UNIFUZZ [23]-that are based on real-world programs, and both real bugs or coverage can be used for comparing fuzzer e ectiveness.Two recent studies [13,17] identi ed signi cant di erences between arti cial/synthetic benchmarks and ones based on real-world bugs.Another recent study [12] compared the two e ectiveness metrics, namely code coverage and bugs.They found that there is very high correlation between achieved coverage and found bugs, although-surprisingly-the best fuzzer in terms of coverage may not be the best fuzzer in terms of found bugs.For all of the above benchmarks (independently of the e ectiveness metric), the default campaign duration is at least 23 hours.For bug-based benchmarks, such as Magma, the actual duration may be shorter if the fuzzer nds all target bugs earlier (similar to change 4).However, the worst-case resource usage is still high.
Finally, there are also e orts for porting benchmarks from the program-veri cation and model-checking community [9] to testing tools (including fuzzers) [10].This may allow for comparisons beyond fuzzers; for instance, with software model checkers [25].
Properties of fuzzer rankings.A very recent study on explainable fuzzer evaluation [28] tries to explain a fuzzer ranking through properties-such as size or coverage-of the initial corpus or of the benchmark programs.Like us, they independently point out the risk of over-tting fuzzers to speci c benchmark sets, such as FuzzBench.While they aim to quantify the risk of using speci c initial corpora or benchmark programs, our proposed changes aim to mitigate some of this risk.

CONCLUSION
We have presented GreenBench, the rst benchmarking platform that aims to reduce the exceedingly large computational cost of fuzzer benchmarking.The default con guration of GreenBench o ers a speedup of 20.2x over FuzzBench, thereby enabling much faster turnaround times.GreenBench also provides a number of knobs to tune the accuracy-vs-speed trade-o , making it possible to favor speed for incremental changes (e.g., for merging a pull request) and accuracy for larger changes (e.g., before a new fuzzer release).
In future work, we plan to investigate if and how GreenBench could be used to provide e cient regression testing for fuzzers and to pin-point fuzzer weaknesses or even bugs.A rst step for achieving the latter could be to suggest tailored fuzzer "challenges", that is, benchmarks for which a fuzzer's e ectiveness is signi cantly below average for short campaigns.
We also hope that the community will start using GreenBench.This would allow us to gather additional empirical and anecdotal evidence about usage scenarios (such as regression testing) where GreenBench can reliably be used as a substitute for FuzzBench or other benchmarking tools.

Figure 1 :
Figure 1: Visual comparison of GreenBench (in green) and FuzzBench (in red) with respect to the input state space of a given benchmark program (in black).FuzzBench uses the same set of initial seeds for all campaigns.In contrast, GreenBench randomly uses diverse initial corpora to run signi cantly more short campaigns.
j s o n c p p _ j s o n c p p _ f u z z e r l i b p n g -1 . 2 p c a p _ f u z z _ b o t h s y s t e m d _ f u z z -l i n k -p a r s e r p r o j 4 -2 0 1 7 -0 8 -1 4 l i b j p e g -t u r b o -0 7 -2 0 1 7 h a r f b u z z -1 .3 .2 m b e d t l s _ f u z z _ d t l s c l i e n t s q l i t e 3 _ o s s f u z z f r e e t y p e 2 -2 0 1 7 o p e n t h r e a d -2 0 1 9 -1 2 -2 3 l i b x s l t _ x p a t h l i b x m l 2 -v 2 .9 .2 b l o a t y _ f u z z _ t a r g e t c u r l _ c u r l _ f u z z e r _ h t t p o p e n s s l _ x 5 0 9 p h p _ p h p -f u z z -p a r s e r

•
We implement GreenBench as an open-source extension of FuzzBench, a widely used platform for accurate fuzzer benchmarking.FuzzBench but from 18 to 61 times faster.• We discuss the implications of our ndings for the fuzzing community.
• We evaluate GreenBench against FuzzBench in terms of speed and accuracy; our results show that GreenBench can generate a fuzzer ranking with very high correlation with the ranking generated by Benchmark-generation time for di erent benchmark programs.The bar chart shows how long it takes to generate 100 benchmarks for the given programs.The majority of time is spent on executing inputs to obtain edge-coverage information.
of 61.3x over FuzzBench by reducing the number of benchmarks from 100 to 30 without signi cantly sacri cing accuracy (see RQ4).

Table 2 :
Correlation between the fuzzer rankings that are produced by di erent benchmarking con gurations.The Pearson correlation coe cient ranges from -1 (no correlation) to 1 (perfect correlation).Values in the [0.8, 1] interval are commonly considered to indicate very strong correlation, and values in the [0.6, 0.8) interval are commonly considered to indicate strong correlation.
Correlation with baseline con guration R for di erent values of (number of campaigns per benchmark program).After about 30 campaigns, the correlation only improves minimally.

Table 4 :
Correlation between the fuzzer rankings for three independent benchmarking runs with our default con guration (RDN-5).We also show the correlation with baseline con guration R for comparison.