SpecBCFuzz: Fuzzing LTL Solvers with Boundary Conditions

LTL solvers check the satisfiability of Linear-time Temporal Logic (LTL) formulas and are widely used for verifying and testing critical software systems. Thus, potential bugs in the solvers' implementations can have a significant impact. We present SpecBCFuzz, a fuzzing method for finding bugs in LTL solvers, that is guided by boundary conditions (BCs), corner cases whose (un)satisfiability depends on rare traces. SpecBCFuzz implements a search-based algorithm that fuzzes LTL formulas giving relevance to BCs. It integrates syntactic and semantic similarity metrics to explore the vicinity of the seeded formulas with BCs. We evaluate SpecBCFuzz on 21 different configurations (including the latest and past releases) of four mature and state-of-the-art LTL solvers (NuSMV, Black, Aalta, and PLTL) that implement a diverse set of satisfiability algorithms. SpecBCFuzz produces 368,716 bug-triggering formulas, detecting bugs in 18 out of the 21 solvers' configurations we study. Overall, SpecBCFuzz reveals: soundness issues (wrong answers given by a solver) in Aalta and PLTL; crashes, e.g., segmentation faults, in NuSMV, Black and Aalta; flaky behaviors (different responses across re-runs of the solver on the same formula) in NuSMV and Aalta; performance bugs (large time performance degradation between successive versions of the solver on the same formula) in Black, Aalta and PLTL; and no bug in NuSMV BDD (all versions), suggesting that the latter is currently the most robust solver.

Given that LTL solvers are used in the development of critical systems, their correctness and reliability are of paramount importance.For instance, when solvers are used as part of verification tools, incorrect solver results can leave large portions of programs under analysis unverified, with the unfortunate effect of increasing the risk of missing potential bugs and vulnerabilities.Thus, solvers are constantly engineered to fix bugs.
Testing LTL solvers is particularly challenging.The reason is that LTL solvers base their satisfiability analyses on the intrinsically complex temporal semantics of LTL formulas, which involves reasoning about infinite traces.Moreover, this complexity is increased since in order to efficiently produce solutions, solvers are continuously optimized via analytical and heuristic-based algorithms, e.g., via machine learning-based LTL SAT prediction [39,53,55] and other techniques.This gives rise to various types of bugs, such as erroneous results (wrong satisfiability responses), system crashes (solver aborts exceptionally, e.g., due to segmentation faults), flaky behaviors (different conclusions across different re-runs of the solver on the same formula), and performance issues (large performance divergences between successive versions of the solver on the same formula).
Moreover, checking the satisfiability of non-trivial LTL formulas often depends on corner cases (rare and subtle traces), that if solvers miss to explore, may lead to incorrect results.This makes LTL solver development challenging, and calls for the development of effective testing techniques, aiming to uncover the different types of bugs that can occur in LTL solving software.While fuzzing approaches exist for testing SAT and SMT solvers [31,42], the problem of testing LTL solvers remains largely unexplored.To the best of our knowledge, there is no approach aiming at automatically testing LTL solvers beyond some benchmark sets of formulas [68].Thus, there is a lack of principled methods to purposely test LTL solvers.
We fill this gap by proposing SpecBCFuzz, a fuzzing method for LTL solvers that is guided by the particularities of LTL semantics.The key idea is to generate formulas that are likely to drive the solvers towards corner cases -a general principle that has been applied successfully in other testing areas, e.g., Boundary-Value Analysis [2].In our context, corner cases are formulas whose (un)satisfiability depends on few and unique traces (cases that require a "global" analysis that force the solver to make a complete computation and cross-check of the entire formula).In contrast, formulas for which solvers can conclude their (un)satisfiability from multiple common traces are less interesting because they do not require an in-depth exploration of the entire formula semantics (in a sense offering multiple opportunities based on which one can determine unsatisfiability).
To generate corner cases, we rely on goal divergences, a concept originated in goal-oriented requirements engineering [74].When a specification S is composed of a (satisfiable) conjunction of goals    (each   is an LTL formula), a divergence, also known as a boundary condition, is a condition  whose occurrence makes the specification inconsistent, i.e., its conjunction with the whole set of goals is unsatisfiable, while its conjunction with any strict subset of the goals remains satisfiable.As an example, consider a mine pump controller [45] with the following two goals: "the pump shall be on when the water level is above the high threshold", and "the pump shall be off when methane is detected in the mine".These can be formalized as  1 : 2(ℎ → ) and  2 : 2( → ¬), where ℎ,  and  stand for "high water", "methane" and "pump on", respectively.Although these goals are globally consistent, e.g., they are satisfiable in cases where methane is not detected in the mine, they are conditionally inconsistent (unsatisfiable) when the water level is high and methane is present at the same time.Thus, a boundary condition for  1 and  2 is  : 3(ℎ ∧ ).
Our approach works in two steps.First it considers a satisfiable (seed) formula S and produces a set of unsatisfiable formulas {S ∧   } based on the set of boundary conditions   for S, automatically generated from S. Then it generates mutated versions S ′ of the formula S based on which a set of potentially unsatisfiable formulas {S ′ ∧   } are produced.Thus, our aim, given an original seed formula, is to explore its vicinity, i.e., formulas that are close to the original, together with the vicinity of the possible divergences of the formula, as illustrated in Figure 1.Starting from a satisfiable formula S, SpecBCFuzz produces a set of unsatisfiable formulas {S ∧   } by computing a set of boundary conditions   for S and then explores their nearby solutions (shaded areas).
We claim that boundary conditions are good for triggering faults in LTL solvers because they include two important (semantic) properties: i) the conjunction of S with  moves the specification from the satisfiable plane to the unsatisfiable plane thereby further challenging the solvers; ii) it forces the solvers to consider all goals and the boundary condition to prove the unsatisfiability of the formula (divergence definition property).Overall, boundary conditions force the solvers to perform an in-depth exploration that has a good potential to trigger bugs as shown by our results.
Proper fuzzing also requires the selection of diverse seeds.This brings two challenges for SpecBCFuzz: 1) the selection of formulas that have many divergences and 2) the computation of boundary conditions.In our analysis we used 25 specifications from the literature and computed a total of 346 boundary conditions, an average of 13.84 boundary conditions per specification.To efficiently produce tests, SpecBCFuzz relies on a search-based algorithm that explores # models # conflicts SAT UNSAT S < l a t e x i t s h a 1 _ b a s e 6 4 = " t + X 4 O 4 n N 5 z F L c r 3 Z q 6 2 D n W H d S X w = " > A A A B 8 n i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L L i x m V F + 4 B 2 K J k 0 0 4 Z m M k N y R y h D w Z 9 w 4 0 I R t 3 6 N O / / G T N u F t h 4 I H M 6 5 4 Z 5 7 g k Q K g 6 7 7 7 a y s r q 1 v b B a 2 i t s 7 u 3 a 2 k g n B X X x 5 m T S r F f e 8 U r 2 7 K N e u J 7 M 4 i u g Y n a A z 5 K J L V E O 3 q I 4 a i K A J e k a v 6 M 1 6 s l 6 s d + t j 1 l q w 5 h E e o j + w P n 8 A n P 6 X 5 A = = < / l a t e x i t > S 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " O N H q V 0 a S f O e e 0 P x p 6 8 T J q V s l s t V + 4 u S r X r 8 S y O A j p B p + g c u e g S 1 d A t q q M G I m i M n t E r e r O e r B f r 3 f q Y t a 5 Y 8 w i P 0 B 9 Y n z + g B p f m < / l a t e x i t > S ^ 4 6 8 T J q V s n t R r t x V S 7 X r 8 S y O A j p B p + g c u e g S 1 d A t q q M G I m i M n t E r e r O e r B f r 3 f q Y t a 5 Y 8 w i P 0 B 9 Y n z + h i p f n < / l a t e x i t > the vicinity of S and finds other formulas  ′ for which the boundary conditions Δ = {  } (previously computed from S) remain relevant.To increase the likelihood that   is also a boundary condition for  ′ , the search is guided to preserve  ′ as similar as possible to .
SpecBCFuzz implements a Non-dominated Sorted Genetic Algorithm (NSGA-III) [26] to evolve and mutate the original specification S, guided by a multi-objective fitness function.Within this function, two similarity metrics -one syntactic and one semantic -compare the seeded formula S and the mutated formula S ′ .SpecBCFuzz uses the Levenshtein edit distance [52] to measure syntactic similarity and an LTL model counting heuristic [13] as a semantic-related metric.To effectively explore the search space close to S, we also consider an additional objective: to explore as many combinations as possible of the divergences captured by boundary conditions in Δ.This objective aims at exploring a wide spectrum characterized by the divergences from the sat plane to the unsat one.
Once the mutated formulas have been generated, SpecBCFuzz conjuncts them with the previously computed boundary conditions (yielding S ′ ∧   ) as inputs for fuzzing the LTL solvers.To bypass the oracle problem, SpecBCFuzz cross-checks the answers given by different solvers (when looking for soundness bugs), different versions of the same solver (for performance bugs), and different runs of the same version on the same formula (for flakiness bugs).
Our experimental analysis uses as seeds 25 requirements specifications collected from the literature [28], for which a set of boundary conditions are automatically computed.In total, SpecBCFuzz generates 368,716 bug-triggering formulas, revealing bugs in 18 out of the 21 solver configurations (except in the 3 configurations of NuSMV BDD).Furthermore, SpecBCFuzz reveals: Our results also demonstrate that all search objectives of SpecBC-Fuzz (the use of boundary conditions, the syntactic and semantic similarity metrics) contribute in finding bug triggering LTL formulas.Finally, our analysis shows that SpecBCFuzz outperforms a typical grammar-based fuzzer, that was specifically implemented and tuned for fuzzing LTL solvers.

BACKGROUND 2.1 LTL, SAT and Model Counting
Linear-time Temporal Logic (LTL) is a formalism used for formal property specification of reactive systems [56].Several formal methodologies, e.g., KAOS [73], have adopted LTL to express requirements [73] and perform analyses.
LTL assumes a linear topology of time, i.e., each instant is followed by a unique future instant, and LTL formulas are evaluated over infinite traces, that can be interpreted as system executions.Let AP be a set of propositional variables.LTL formulas are inductively defined using the standard logical connectives and temporal operators ⃝ and U, as follows: (i) constants true and false are LTL formulas; (ii) every  ∈ AP is an LTL formula; and (iii) if  1 and  2 are LTL formulas, then so are ¬ 1 ,  1 ∨  2 ,  1 ∧  2 , ⃝ 1 and  1 U 2 .
LTL formulas are evaluated over infinite traces of the form  =  0  1 . .., where each   is a propositional valuation on 2 AP , i.e.,  ∈ 2 AP  .Formulas with no temporal operators are evaluated in the first valuation of a trace.Given a trace , ⃝ is true in  if and only if  is true in  [1..] (the trace obtained by removing the first valuation from ), and  1 U 2 is true in  if and only if there is a position  such that  2 holds in  [..], and for all 0 ≤  < ,  1 holds in  [  ..].We consider the typical definitions for the operators 2 (always), 3 (eventually) and W in terms of ⃝, U and logical connectives.
An LTL formula  is satisfiable (SAT) iff there exists at least one trace satisfying ; otherwise, it is unsatisfiable (UNSAT).
Model counting computes the number of traces that satisfy a formula.Since LTL formulas are defined over infinite traces, this involves the computation of the number of canonical finite representations of infinite traces, such as lasso traces, as also done in bounded model checking [8].Since computing the exact number of lasso traces is expensive [32], Brizzio et.al [13] proposed a bounded model counting approximation to compute the number of lasso traces satisfying an LTL formula, based on symbolic LTL automata representation and matrix multiplication.Thus, given an LTL formula  and a bound , we denote by #Approx(, ), the approximated number of lasso traces of length  satisfying .SpecBCFuzz employs #Approx model counting to compare the semantics of LTL formulas.

LTL Solvers
LTL satisfiability is a decidable problem [67], and tools implementing such checking are called LTL (SAT) solvers.Existing LTL solvers are designed to be efficient [68], to support diverse temporal operators [37], and to be expressive [48].Some of the best known techniques for LTL solving include those based on bounded model checking (BMC), on binary decision diagrams (BDDs), on Tableaux, and on automata-theoretic approaches.
Bounded Model Checking (BMC) encodes LTL formulas as propositional formulas [8,24,49], for a given bound .Each satisfying valuation of the encoding corresponds to a lasso trace of  states.NuSMV [19] implements a traditional BMC algorithm.
BDD-based Model Checking employs a symbolic representation of LTL formulas as binary decision diagrams (BDD), based on the formula's elementary sub-formulas [21,23].The BDD is built by applying rules that capture the semantics of the LTL operators.The satisfying instances are obtained by traversing paths in the BDD.NuSMV [19] and PLTL implement a traditional BDD-based SAT algorithm.
Tableau is a well-known logical satisfiability approach, based on the decomposition of the formula being assessed according to the semantics of its logical operators, to search for satisfying valuations.While a tableau for a positional formula is a finite tree capturing its semantics, a tableau for an LTL formula is a graph capturing the semantics of temporal operators in the formula.
There exist different particular variants of the process to build the tableau structure, classified as the so-called One-pass and Multipass tableaux methods.Among these, PLTL [69] implements the Schwendimann method, while BLACK [36,37] implements the Reynolds method [66].
Automata-based approaches encode LTL formulas into automata, and then check for the emptiness of the languages corresponding to such automata [47,75].Different automata representations are used to improve the performance of solvers, such as the alternating automata (more efficient and succinct than other kinds of automata) [47,75].Additional heuristics are also used.For instance, Aalta (v.1) implements an on-the-fly automaton exploration [50].
Coverage-guided fuzzers, such as the American Fuzzy Lop (AFL) and its variants, implement various fuzzing strategies.For instance, AFLFast implements grey-box fuzzing guided by code coverage [16]; AFLGo implements a simulation annealing search guided by an inter-procedural distance measure [12]; and Zest implements a property-based fuzzing mechanism [63,64].
Grammar-based Fuzzing typically takes a (context-free) grammar that describes the shape of the inputs, and produces random inputs by traversing the grammar production rules.These are commonly used for testing compilers.Their objective is to efficiently produce many and diverse inputs, increasing the chances of triggering syntactic or semantic bugs.
Seeds are bootstraps of the bug finding process [40].Good seeds are typically collected from a large number of representative cases or domain-specific scenarios, and have some meaningful semantics for the software under test.For instance, seeds can be built by crawling the internet [40], or can be user provided [59,61].Collected seeds are exploited by mutation and search-based strategies for generating semantically meaningful inputs.
Mutation Fuzzing introduces small changes to the given seeds with the aim of triggering additional behaviors in the target system.Preferably, the mutations should maintain the syntactic validity of the input.Common mutation operators remove, insert, and flip single elements of the inputs.
Search-based Fuzzing leverages mutation fuzzing by also guiding the seeds selection and evolution with one or more fitness functions [65].The optimized meta-heuristics are domain-specific for the target system.Typical fitness metrics focus on the code coverage and structure of the generated inputs.
In this work, we present SpecBCFuzz, a search-based fuzzing approach that takes LTL specifications with boundary conditions as seeds.SpecBCFuzz implements a set of evolutionary operators (mutation and crossover) to evolve LTL formulas, and two similarity metrics (one syntactic and one semantic) to search in the vicinity of the given seeds, with the aim of finding critical bugs, e.g., soundness issues, crashes, and flakiness problems, in LTL solvers.

THE SPECBCFUZZ APPROACH
Figure 2 shows an overview of SpecBCFuzz.It follows four steps to generate and evolve LTL formulas for testing LTL solvers.Firstly, given a seeded formula S (a goal-oriented requirements specification), SpecBCFuzz computes a set Δ = { 1 , . . .,   } of boundary conditions that capture divergences in S.
The second step focuses on evolving S into other LTL formulas S ′ that are likely to also be divergent with respect to the boundary conditions in Δ.To do so, SpecBCFuzz implements a multi-objective search algorithm (more precisely, NSGA-III [26]) that applies genetic operators (mutation and crossover) to produce new formulas.The multi-objective function SpecBCFuzz relies on is driven by formula similarity metrics; it seeks to increase the likelihood of the new formulas to be divergent with respect to the boundary conditions computed from the seeded specification S.
Once a formula S ′ has been produced, it is conjoined with each   to form a new set {S ′ ∧   }  that SpecBCFuzz tests each solver with (step 3).In our experiments we considered 21 configurations of four mature and state-of-the-art LTL solvers, including their corresponding latest versions, namely, NuSMV, Aalta, PLTL and Black.Of course, other solvers and future versions of the considered solvers may be easily integrated in the future.
To bypass the oracle problem, SpecBCFuzz relies on differential testing to cross-check the behavior of the different solvers and runs (step 4).More precisely, SpecBCFuzz looks for soundness bugs (different solvers giving different satisfiability answers), crashes (a solver aborts the execution exceptionally), flakiness (different runs of the same solver version on the same formula yields different answers), and performance bugs (large performance differences between different versions of the same solver on a same given formula).
In what follows, we detail the above-mentioned steps and, in particular, how the evolutionary search process and the solver testing steps intertwine.

Computing Divergences
In goal-oriented methodologies, e.g., KAOS [73], requirements are organized as a set of domain properties () and a set of goals ({  }).Intuitively, the goals are the properties we expect the system to achieve, while domain properties capture assumptions and descriptive statements of the environment, e.g., physical or normative laws.Since requirements descriptions can be ambiguous and incomplete, and different stakeholders may have different expectations from the system, specified goals can contradict one another (i.e., they can be conflicting) [73,74].If there is a strong conflict between the goals and the domain properties, they cannot be satisfied together ( ∧ (    ) |=  ), and the specification is said to be inconsistent.
For consistent goals, there exists a weaker form of conflict, named divergence [73,74].It represents a condition whose occurrence makes the goals inconsistent (i.e., they cannot be satisfied under the condition that the divergence occurs).Formally, a set  = { 1 , . . .,   } of goals is divergent with respect to  if there exists a boundary condition (a formula)  such that the following conditions hold together: The first condition establishes that, when  holds, the whole set of goals cannot be simultaneously satisfied.The second condition states that, if any of the goals are disregarded, then consistency is recovered.Also, it prevents  from being false, since it has to be consistent with the domain .The third condition prohibits a boundary condition to be simply the negation of the goals.SpecBCFuzz relies on divergences and their corresponding boundary conditions to create LTL formulas that force solvers to make in-depth analyses to conclude about satisfiability, leaning on the three properties of boundary conditions.The rationale is that, since  semantically connects all the goals of the formula, it can complicate the variable splitting heuristics implemented by the solvers, forcing them to perform a more exhaustive search.
To have meaningful formulas, SpecBCFuzz uses requirements specifications from the literature as seeds, and computes their boundary conditions using the automated approach by Degiovanni et al. [28].However, because this approach and its alternatives [29,54] are computationally expensive, their application throughout the search process (i.e., on all formulas S ′ that SpecBCFuzz generates in the vicinity of S) would require prohibitively expensive computation time (in our experiments, the approach in [28] requires 2,183 seconds for computing boundary conditions, on average).This is a clear obstacle for fuzzing, that is based on fast < l a t e x i t s h a 1 _ b a s e 6 4 = " H u X B / L q f z u l w

Search Process
Algorithm 1 describes how SpecBCFuzz combines and intertwines the evolutionary search steps (NSGA-III) and the on-the-fly testing of the LTL solvers.
Given seeded formula S = (, ), SpecBCFuzz starts by computing a set Δ of boundary conditions capturing S's divergences (Line 2).In Line 3, it initializes the sets where the different kinds of bug-triggering inputs will be saved, and in Line 4 it initializes the set of candidate inputs (population)  , i.e., the NSGA-III set of non-dominated individuals, with the seeded formula S.
From Lines 5-16, the algorithm relies on specific features of NSGA-III, such as the prioritization, selection and evolution of individuals (formulas) for fuzzing the LTL solvers, until a termination condition is met, e.g., a specific execution time or number of generations is reached.Particularly, in Line 6, the algorithm picks , one of the non-dominated candidate formulas produced so far.Then, in Line 7, SpecBCFuzz applies the implemented genetic operators to , i.e., the mutation and cross-over especially designed for manipulating LTL formulas (see Section 3.3), and iterates on each generated formula S ′ .
In this inner loop, for each boundary condition  ∈ Δ, SpecBC-Fuzz generates a candidate bug-triggering formula  by combining it with the mutated formula, yielding S ′ ∧  (Line 9).Then, it invokes each solver under test and gathers their corresponding outputs, i.e.,   =   (), 1 ≤  ≤  (Line 10).SpecBCFuzz analyses the outputs and, whenever a bug is detected, it saves relevant information useful to reproduce it later, e.g., the solver's name, version and configuration as well as the bug-triggering formula .In Line 13, SpecBCFuzz computes the multi-objective fitness value for the mutated formula S ′ (see Section 3.4).Finally, SpecBCFuzz updates the set  of non-dominated individuals according to the just computed fitness values for S ′ , which can survive for the next generations.return ⟨B  , B  , B  , B  ⟩ 18: end function In the remainder of this section, we present the details regarding the genetic operators and the multi-objective fitness computation.

LTL Genetic Operators
SpecBCFuzz implements the standard mutation and cross-over operators for LTL formulas [28,54].
To illustrate these operators, consider the formula F : 2( → ¬∧ ). Figure 3 shows four examples of mutations that SpecBCFuzz applies.For instance, mutant  1 replaces the unary operator 2 in F by 3; mutant  2 replaces proposition  by  ; in  3 , binary operator → is replaced by ∨; while mutant  4 removes part of the formula.All mutations are guaranteed to respect the LTL syntax,  < l a t e x i t s h a 1 _ b a s e 6 4 = " B q z g 7 v q v X K 7 2 2 e S S q P s M G g v u h l 4 A + P w B 0 a q b u g = = < / l a t e x i t > F 2 : ⌃(r ^¬p)

< l a t e x i t s h a 1 _ b a s e 6 4 = " p z F Z 0 1 v d x p P / r N I B M c p y i w 0 P w D A = " >
0 w h h V u A 5 + w f j 6 B m 0 H q P A = < / l a t e x i t >  and can produce any LTL formula in the language, strictly used in the seeded formula S, i.e., no new atomic proposition is added.Crossover operator takes two goals F 1 and F 2 from the same individual  (or from a different individual previously generated), and two sub-formulas randomly selected from these goals, yielding two new formulas  1 and  2 .For example, given F 1 : 2( → ) and F 2 : 3( ∧ ¬) as in Figure 3, the crossover operator might select sub-formula  :  →  from F 1 and sub-formula  : ¬ from F 2 .Then, it proceeds to swap the sub-formulas by replacing  by  in F 1 , and vice versa in F 2 , leading to two new formulas  1 : 2(¬) and  2 : 3( ∧ ( → )).This operator guarantees by construction to produce syntactically valid LTL formulas.

Multi-objective Fitness
For each candidate variant S ′ generated by the application of the genetic operators, SpecBCFuzz computes the fitness value for the three objectives that guide the search (Line 13 in Algorithm 1).Since computing a new set of boundary conditions for each candidate is practically infeasible (requiring on average 2,183 seconds [28]), these three objectives aim at driving SpecBCFuzz's search process towards formulas S ′ for which the boundary conditions Δ capture divergences in S ′ as well.These three objectives are: semantic similarity between S ′ and S, their syntactic similarity, and the approximated number of boundary conditions that remain relevant for S ′ .
The rationale behind semantic similarity is that two semantically close formulas have a higher likelihood to have boundary conditions in common.Thus, our semantic similarity function, denoted by (S, S ′ ), computes the ratio between the number of behaviors common to S ′ and S over the union of their behaviors.Since computing the set of lasso traces of an LTL formula is computationally prohibitively expensive, SpecBCFuzz instead relies on an efficient model counting heuristic [13], which approximates the number of accepted lasso traces of any formula (cf.Section 2.1).Hence, given a bound  for the lasso traces, the semantic similarity between S and S ′ is computed as: where #Approx(S ∧ S ′ ) is the approximated number of accepted lasso traces for S ∧ S ′ (intersection) and #Approx(S ∨ S ′ ) is the approximate number for S ∨ S ′ (union).Small values for (S, S ′ ) indicate that the behaviors described by S ′ deviate too much from those described by the seeded formulas S as it has few behaviors in common.If this value gets closer to 1 it means that the two formulas share most of their corresponding behaviors.
In the case where S ′ is unsat or contradicts S, we set the semantic similarity to 0 to discard this unsatisfiable formula in subsequent iterations.
Syntactic similarity, denoted by (S, S ′ ), is another objective we use to further support semantic similarity.To measure it, SpecBCFuzz uses Levenshtein distance [52] to compute the distance between the text representations of the formulas.The Levenshtein distance between two words is the minimum number of singlecharacter edits (insertions, deletions, or substitutions) required to change one word into the other.Hence, (S, S ′ ) is computed as the ratio between the number of tokens changed from S to obtain S ′ among the maximum number of tokens corresponding to the largest formula (ℎ =  (ℎ(), ℎ())).Specifically: ℎ Since SpecBCFuzz aims at generating formulas for which boundary conditions in Δ remain relevant, it uses a simple heuristic to count for the number of boundary conditions  ∈ Δ that remain unsatisfiable with respect to S ′ : #BCs(S ′ , Δ) =  −  =1    (S ′ ,   )  where    (S ′ ,   ) will check if the majority of the solvers' outputs  1 , . . .,   indicate that formula S ′ ∧  is unsat.Taking the majority as the correct answer, will help SpecBCFuzz to be robust in the cases where formula S ′ ∧   triggers a bug in a solver.
Overall, the three objectives will guide SpecBCFuzz to search in the vicinity of S, with high chances to be divergent with respect to the boundary conditions in Δ, making them good candidates for triggering bugs in LTL solvers.

Differential Testing
SpecBCFuzz, inspired by differential testing, defines simple oracles to detect four kinds of bugs.Given the input formula  = S ′ ∧   and solvers' outputs  1 , . . .,   , SpecBCFuzz first computes the number of # (#{ ∈  1 , . . .,   |  =  }) and # (#{ ∈  1 , . . .,   |  =    }) responses.Then, the expected outcome is defined as follows: For example, let us assume that  1 ≠  (S ′ ∧   ), i.e.,  1 produces an output different from the expected one.SpecBC-Fuzz will first re-run  1 (S ′ ∧  ) 100 times, in order to confirm that the solver is consistently producing the same output for the given input.If the solver is always producing the same unexpected output, then the bug is confirmed and information regarding the solver and the bug-triggering formula are added to the corresponding set.Precisely, if  1 is sat/unsat, i.e., the solver produces an incorrect output, this is a soundness bug and the bug-info is added to the set B  ; otherwise, if  1 is unknown because the solver always crashes with the same input, the bug-info is added to B  .
In the case that, after re-running multiple times the same solver with the same input, it produces a different output compared to the firstly observed  1 , e.g.,  1 was sat, but when re-run it often produced unsat, then we consider this behavior as flaky and we add the bug info to the set B  .
Finally, for each input formula, we compute the average execution time  of the solvers for producing a valid (sat/unsat) outcome.Then, when performing the re-runs, if a solver takes more than 300 times the average execution time in producing the output (i.e., the execution time is greater than 300 ×  ), or it reaches a predefined timeout of 24 hours (just for the re-runs), then SpecBCFuzz will consider this as a potential performance bug and warn the tester by adding the corresponding bug-info to the set B  .After finalizing the fuzzing campaign, SpecBCFuzz returns the sets of the identified bug-triggering inputs.

EXPERIMENTAL SETUP 4.1 Research Questions
We investigate the following research questions: • RQ1: How effective is SpecBCFuzz in revealing bugs in LTL solvers?How robust are the solvers?• RQ2: How does each objective of the fitness function contribute to SpecBCFuzz' effectiveness? • RQ3: Do boundary conditions and vicinity exploration provide effective guidance to reveal LTL solver bugs?
RQ1 evaluates the effectiveness of our approach in revealing bugs in the studied LTL solvers.Thus, we execute SpecBCFuzz against 21 configurations of four solvers (see Section 4.3).First, we report the number of soundness, crashes, flakiness, and performance issues that were triggered by the formulas produced by our fuzzing campaigns.We then perform an analysis on the bug-triggering inputs to categorize the inconsistent executions into more general buggy patterns, i.e., symptoms we observe on one or more specific versions of a solver.Intuitively, two formulas  1 and  2 are in the same cluster, if they trigger the same buggy symptom, e.g., a crash with the same error message, in the same version of the same solver.Thus, a buggy pattern corresponds to one or more concrete bugs to fix in the concerned solver versions.
To answer RQ2, we conduct an ablation study to quantify the contribution of each objective of SpecBCFuzz's fitness function to its effectiveness.Thus, we disable each objective (separately) and compare the number of bug-triggering formulas (for each previously identified buggy pattern) that SpecBCFuzz reports.
RQ3 aims to validate the two main hypotheses that SpecBCFuzz relies on, i.e., 1) specifications with boundary conditions are good seeds for revealing bugs, and 2) exploring the local vicinity of the original formula is effective in triggering faults.A superior effectiveness of SpecBCFuzz would validate our hypotheses and the principles our approach relies on.Hence, in addition to SpecBCFuzz (which uses the set {S ∧   }  ∪ {S ′  ∧   } , as seeds), we repeat our experiments on multiple baselines: • SpecBCFuzz using {S ∧   }  as seeds (boundary conditions without vicinity exploration); • SpecBCFuzz using {S} ∪ {S ′  }  as seeds (vicinity exploration without boundary conditions); • a probabilistic grammar-based fuzzing approach we fine-tuned to generate random LTL formulas, i.e., broad exploration without boundary conditions; • a large benchmark1 of 3,723 LTL formulas, of different complexities, used for studying the performance of LTL solvers [68].

Seeds
Our evaluation considers a total of 25 LTL formulas collected from the literature and different benchmarks.These formulas were previously used by several approaches for the identification and resolution of divergences [1, 17, 27-29, 54, 74].Table 1 summarizes the number of LTL formulas of each seeded specification (#S) and the number of boundary conditions (#Δ) computed with the approach by Degiovanni et al. [28].

Considered Solvers
We aim at assessing the effectiveness of SpecBCFuzz in finding bugs in LTL solvers that implement a diverse set of algorithms and heuristics, where the state representation is symbolic or concrete, and the search is automata-based, tableaux, or propositional solving guided.Because of that, in addition to the latest version of each solver, we also consider some past versions that can potentially reveal different kinds of buggy behaviors.
In total we use 21 solvers' configurations in the evaluation, summarized in Table 2.
Particularly, NuSMV is one of the most adopted solvers for analyzing LTL requirements [19].In total, we consider 6 configurations for NuSMV, including two SAT algorithms (BMC and BDD) for the latest (2.6.0) and past versions (2.5.4, 2.4.3).
Aalta is a concrete-state representation satisfiability algorithm that builds the automata capturing LTL formulas on-the-fly.We consider Aalta version 2 [51] and version 1 [50], that implement different search algorithms for building the automata.
PLTL is a traditional solver [68] that implements three different algorithms in each version.The latest version is a symbolic approach based on BDDs [34,58], while the two previous versions are based on two different algorithms for computing the Tableau [69,78].

Setting and Evaluation
SpecBCFuzz is implemented in Java using the JMetal framework [62], that instantiates the NSGA-III algorithm and integrates all the solvers we test.It also uses the OwL library [46] to parse and manipulate the LTL formulas.We make our tool, seeded formulas and divergences, as well as the bug-triggering formulas, publicly available at: https://github.com/SpecBCFuzz/repo.We configure several parameters of the NSGA-III algorithm based on the findings of other studies and exploratory analyses.Particularly, at every generation we preserve a population of 100 individuals, the mutation operator is always applied to selected individuals, while the crossover is applied with a probability of 0.1.Moreover, for computing the semantic similarity objective, we set a timeout of 30 seconds for the model counting approach.We run the fuzzing campaigns for 48 hours or until 1, 000 individuals (S ′ ) were generated, whichever happened first, with no re-starts.Notice that, assuming we have, for instance, 10 boundary conditions in Δ, the fuzzing campaign will produce 1, 000 × 10 input formulas to test the 21 solvers' versions, i.e., a total of 210, 000 sat invocations.We set a timeout of 300 seconds for the solver.
To answer RQ2, we run SpecBCFuzz by disabling some of the objectives and to answer RQ3, we fine-tune a (probabilistic) grammarbased fuzzer [79].The grammar-based fuzzer produces random LTL formulas by randomly visiting non-terminal and terminal nodes.We varied several parameters to produce more diverse LTL formulas: the maximum of literals (ranging from 1 to 9), the maximum number of non-terminals reached (ranging from 2 to 10), terminal choice probability (from 0.20 to 0.44), and boolean constant probability (ranging from 0.05 to 0.29).Moreover, the maximum number of formulas generated is 20,000 (20 times more than with SpecBCFuzz).
We ran SpecBCFuzz and the grammar fuzzer on an HPC cluster.Each node has a Xeon E5 2.4GHz, with 16 CPUs-nodes available and 4GB of memory per CPU.The operating system is Centos Linux, version 7. Since both tools are stochastic, we repeat the experimental process 10 times, to avoid random bias.

EXPERIMENTAL RESULTS
5.1 RQ1: Robustness and Failures Patterns 5.1.1SpecBCFuzz Effectiveness.Table 3 summarizes the number of formulas generated by SpecBCFuzz that trigger a bug in each solver configuration, summarizing to a total of 368,716 unique bug-triggering formulas (since a same formula can reveal different symptoms in other solvers and versions).Noticeably, SpecBCFuzz revealed diverse kinds of faults in 18 out of the 21 studied solver configurations.The only exceptions were all versions of NuSMV BDD, suggesting that this is the most robust solver (according to our experiments).
Overall, crashes and flakiness bugs are the predominant observed symptoms.It is worth to remark that flakiness is essentially a soundness issue too, since it implies the generation of incorrect results, affecting the robustness and correctness of solvers.Regarding performance, we observed very clear issues in cases in which solvers do not produce an output within 24 hours of execution, while other solvers do it very efficiently.This possibly points to hang loops in the implemented algorithms.NuSMV.Cluster number 1 includes a set of LTL formulas that trigger a crash ("segmentation fault") in all the versions of NuSMV BMC.The second cluster characterizes a set of LTL formulas that reveals a flaky behavior in BMC versions 2.6.0 and 2.5.4,since from time to time these solver configurations crash when run on a same input formula.We reported these two bugs to the corresponding development team, and they replied that they are currently investigating the issues.Clusters number 3 and 4 characterize crashes and flaky executions (often crashes without error messages) in past BMC versions of the solver (2.5.4 and 2.4.3).Notice that bugs triggered by clusters 3 and 4 have been fixed by developers in the latest version of the solver.
Black.LTL formulas in Cluster 7 are of the form "(( 1 W  2 ))", e.g., ((W))).These formulas trigger a parser failure in Black version 0.5.2, which has been fixed in recent versions.Cluster 5 contains formulas that trigger a crash (with a "killed" message) in all versions of Black.We reported this bug to the developers, and it was confirmed, although it has not yet been fixed.For formulas in Cluster 6, all the versions of Black cannot produce an outcome within 24 hours of execution.SpecBCFuzz oracles classified this symptom as a performance bug, but it can also pinpoint to bugs that lead to a hang loop.We reported this bug and the developers answered that are investigating the issue.
Aalta.SpecBCFuzz generated LTL formulas that trigger all kinds of failures in Aalta (both versions).Cluster 8 includes formulas that induce Aalta to return an incorrect satisfiability answer, i.e., soundness issues.Cluster 9 contains formulas triggering different kinds of crashes.Clusters 10 and 11 capture different sets of formulas that induce flaky behavior in the two versions of Aalta, by producing different satisfiability results from time to time.Clusters 12, 13 and 14 capture different sets of LTL formulas that reveal different symptoms in Aalta v1 and v2, leading to crashes, flakiness, and performance bugs (as cataloged according to SpecBCFuzz's oracles).Again, formulas for which Aalta does not produce an output within 24 hours can potentially pinpoint hang loops in the code.We reported the 6 bugs from clusters 8-9 and 11-14, affecting the latest version (v2) of the solver, to the developers (issues corresponding to cluster 10, affecting v1, were solved in v2).We received the confirmation of 2 out of these 6 bugs (clusters 8 and 11), and developers are investigating the remaining four issues.
PLTL.Cluster 15 contains LTL formulas of the form "2¬(f)", for which PLTL BDD always produces an incorrect satisfiability answer.We reported this bug to the developers, but we received no response yet.Finally, Cluster 16 contains LTL formulas for which the past version of the solver implementing different Tableau algorithms, cannot produce an outcome within 24 hours of execution, while other solvers answer in seconds.These performance issues were not observed in the latest version of the solver, based on BDD.
Overall, 3 out of the 16 bugs found have been confirmed by the developers, 7 are currently under analysis by the developers, and 5 were fixed in followup versions of the solvers.We have not yet received a response by the corresponding development team for 1 of the reported bugs.
Additionally, we collected 315 clusters of LTL formulas that lead different combinations of solvers to reach the analysis timeout (set in 300 seconds by SpecBCFuzz).Although these clusters may not represent performance bugs, since the solvers can indeed output a satisfiability result in less than 24 hours, they constitute an interesting test-bed, which we classified by version and algorithm implemented, that can challenge the performance of the solvers, and can be used for regression testing purposes.We make these data available in our accompanying website.

RQ2: Importance of Fitness Objectives
We perform an ablation study to assess SpecBCFuzz' effectiveness with deactivated fitness objectives.In configuration (Sem+#BCs) we deactivate the syntactic similarity objective, making SpecBCFuzz guided by the semantic similarity and the number of boundary conditions that remain unsatisfiable (objective #BCs).In configuration (Syn+#BCs) we deactivate the semantic similarity objective, while in configuration (Syn+Sem) #BCs is deactivated, making SpecBCFuzz guided only by the similarity metrics.
Figure 4 summarizes the impact of deactivating, one by one, the fitness objectives, on our results.We see that by deactivating the syntactic or semantic similarities, the number of bugs detected is drastically reduced.Configurations (Syn+#BCs) and (Sem+#BCs) can only detect the Cluster 1 failures (crashes in NuSMV BMC -all versions) and Cluster 7 (syntax parsing failures in Black 0.5.2).
On the other hand, deactivating the #BCs objective in configuration (Syn+Sem) produces a minor, still important, impact in SpecBCFuzz's effectiveness, compared to deactivating the similarity metrics.(Syn+Sem) can trigger faults captured by Clusters 1, 3 and 4 (i.e., crashes and flaky behaviors in NuSMV BMC), 9, 10 and 11 (i.e., crashes and flaky behaviors in Aalta v1 and v2), and 15 (i.e., soundness issues in PLTL BDD with formulas of the form 2¬f).
Overall, the combination of the three fitness objectives contributes significantly to SpecBCFuzz's effectiveness in revealing more diverse buggy symptoms in LTL solvers.

RQ3: Validation of SpecBCFuzz's Principles
Figure 5 shows that both principles of our approach are relevant to revealing faults in LTL solvers.We observe that, if no boundary condition is used when feeding the solvers, but we still search in the vicinity ({S} ∪ {S ′  }  ), we are only able to produce LTL formulas that trigger 2 kinds of failures, captured by Clusters 1 and 7. When instead boundary conditions are directly used, but no search is performed ({S ∧   }  ), we only find faults corresponding to Clusters 1 and 7.
It is worth to highlight that the large set of LTL benchmark formulas did not reveal any fault.We then developed a probabilistic grammar-based fuzzer, by carefully setting probabilities and parameters to the grammar's production rules.Since this approach explores a broad spectrum of the search (20,000 formulas -20 times more than SpecBCFuzz), it can produce LTL formulas that trigger faults similar to 11 Clusters from Table 4.However, we observe that for four clusters, the grammar-based fuzzer only manages to trigger such faults very rarely (3, 4, 5 and 7 times, respectively, out of the 10 runs we performed).Moreover, we also observe that for most of the clusters, this approach produces very few bug-triggering  formulas in relation to the numbers produced by SpecBCFuzz (10X to 100X fewer), except for the bugs in PLTL and the syntax parser crash in Black, for which it produces a similar amount of triggering formulas.Surprisingly, our fine-tuned grammar-based fuzzer can trigger a crash in PLTL BDD, with the error message "Segmentation fault (core dumped)", not triggered by SpecBCFuzz (i.e., the symptoms observed are not covered by the patterns in Table 4).Overall, while it shows a better performance than SpecBCFuzz for testing the PLTL solver, it achieves a limited performance when testing all the other solvers.
To conclude, our results suggest that the combination of LTL specifications with boundary conditions and vicinity search is an effective heuristic for producing bug-triggering LTL formulas.

THREATS TO VALIDITY
Threats to external validity concern the solvers and seeded formulas we used in our evaluation.We searched for LTL solvers supporting different heuristics, and we gathered formulas with divergences from the literature that are typically used for the evaluation of requirements analysis tools.Results may not generalize to other solvers, logical languages and domains (for instance, when a language other than LTL is employed).To mitigate this threat, we also studied the application of SpecBCFuzz to other kinds of solvers that take LTL formulas as input, such as model checkers.We included Spin [41] (version 6.5.1) as one of the subjects under analysis, and SpecBCFuzz did not detect any fault.To mitigate this risk, we rely on reliable third-party frameworks and libraries such as the NSGA-III implementation of the JMetal framework [62].Since SpecBCFuzz and the baselines are stochastic, we repeat our experiments 10 times, following research guidelines for how such algorithms should be evaluated [3].
Our assessment metrics, namely the number of generated formulas that trigger a fault in a solver, is intuitive and reflect the effectiveness of SpecBCFuzz.Since our analysis is black-box, identified bug-triggering formulas may relate to different bugs in the solver to fix.We then performed a follow-up analysis to cluster and classify the sets of formulas showing different faulty symptoms.Related to our differential oracles, in addition to the re-runs, we performed extensive manual analyses to double-check the outcomes and confirm the identified faults.For correctness bugs, we applied a delta-debugging like approach to obtain simpler sat-preserving formulas, that confirm the original wrong satisfiability outcomes.Crashes were confirmed by a simple reproduction, while flakiness bugs were confirmed by an extensive re-run of the likely flaky solvers (in our evaluation, all flaky behaviors were reproduced and confirmed).Performance bugs were detected by adopting a very conservative threshold (24 hours, or 300 times slower behavior compared with other solvers).

RELATED WORK
Fuzzing has been used for testing solvers.Brummayer et al. [14] presented a black-box grammar-based fuzzing for propositional (SAT) and quatifier-free (QBF) solvers.Among the various fuzzers to test solvers, CNFuzz generates random CNF formulas based on a given CNF grammar and set of parameters, e.g., maximum layers, width, and variables.FuzzSAT generates random formulas in the form of random boolean circuit (RBC), a kind of directed acyclic graphs.QBFuzz works similarly, but for producing QBF formulas.The empirical evaluation shows their corresponding capabilities for finding bugs in SAT and QBF solvers.
SMT solvers have been the target of several testing approaches.StringFuzz implements string generators and transformers to increase the complexity of string constraints [11], by replacing literals and operators and swapping non-leaf nodes with leaf nodes.StringFuzz also can include a seed-based strategy, fed with realistic regular expression, to improve the fuzzing testing.Bugariu and Muller aimed at generating more complex string operations [15], that are sat/unsat by construction, which are used as test oracles.Other approaches, like STORM, also adopt seed-based and mutation fuzzing techniques [57].SpecBCFuzz also implements string transformations, i.e., mutation and crossover operators, to evolve LTL formulas.It uses requirements specifications with divergences as seeds to improve fault detection of LTL solvers.Other works have used evolutionary algorithms to produce LTL formulas with different goals, e.g., for specification repair [13,17], and relevant to this paper, to identify boundary conditions [28].
The recently developed HistFuzz [72] proposes to use seed-based buggy skeletons, crafted from historical bug reports.Furthermore, DIVER [44] takes an original satisfiable formula, for which a model is built, applies unrestricted mutations during the search, and uses the model as an oracle (if the mutated formula is consistent with the model, then it should be satisfiable).These approaches were found effective for finding bugs in popular SMT solvers as Z3 [25], CVC5 [4], and dReal [35].SpecBCFuzz is also based on semantic properties and assumes that LTL specifications with divergences, i.e., conflicting specifications in the context of goal-oriented requirements, are likely to trigger faults in LTL solvers.
Schuppan and Darmawan [68] conducted an empirical study to assess the performance of LTL solvers, based on a large benchmark of 3,723 LTL formulas of different complexities (that we used in RQ3) to challenge the solvers.To conclude, our work is the first approach that fuzzes LTL solvers and reveals crashes, correctness bugs, flakiness issues, and performance bugs.SpecBCFuzz was shown to be effective in triggering different kinds of bugs in the latest and past versions of very efficient LTL solvers.

CONCLUSION
We presented SpecBCFuzz, fuzzing method that combines boundary conditions and vicinity exploration for testing LTL solvers.We showed that SpecBCFuzz is effective in producing formulas (368,716) that trigger different kinds of bugs (soundness issues, crashes, flakiness and performance issues) in 18 out of the 21 studied solvers' configurations.In our evaluation, SpecBCFuzz did not trigger any bug in the 3 versions of NuSMV BDD, suggesting that this is currently the most robust LTL solver.

ACKNOWLEDGMENTS
This work is supported by the Luxembourg National Research Funds (FNR) through the CORE project grant C19/IS/13646587/RASoRS. Nazareno Aguirre is also supported by ANPCyT PICTs 2019-2050 and 2020-2896, an Amazon Research Award, and by EU's Marie Sklodowska-Curie grant No. 101008233 (MISSION).

Figure 1 :
Figure 1: Exploring the vicinity of the divergences.
f2: #BCs resolved semantic similarity S = (Dom, G) r b e r P d J a 8 6 a R r g N f s H 6 + A Y o N 5 U 2 < / l a t e x i t > 1 , . . ., k < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 8 G r S h 4 z D Z t S b T h + z 2 S G J r T 9 E b o = " > A A A C A X i c b V D L S s N A F J 3 4 r P U V d S O 4 C R b B R S l J F X R Z c e O y g n 1 A E 8 J k M m m H T i Z h 5 k Y o o S L 4 K 2 5 c K O L W v 3 D n 3 z h t s 9 D W A x f O n H M v c + 8 J U s 4 U 2 P a 3 s b S 8 s r q 2 X t o o b 2 5 t 7 + y a e / t t l W S S 0 B Z J e C K 7 A V a U M 0 F b w I D T b i o p j g N O O 8 H w e u J 3 7 q l U L B F 3 M E q p F + O + Y B E j G L T k m 4 d u S D l g 3 6 m 6 P E x A V Y v 3 0 D c r d s 2 e w l o k T

F
: ⇤(p !¬q ^r) < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 U G Y C Y v 5 d Y F O s N s a 0 U G w D 8 J 5 w g A = " > A A A C I n i c b V D L S g M x F M 3 4 r P V V d e k m W I S 6 K T N V 8 L G q C u J S w a r Q K e V O m k 5 D M 8 m Y Z N Q y F P w T N / 6 K G x e K u h L 8 G N P H Q l s P X D i c c y / J O U H M m T a u + + V M T E 5 N z 8 x m 5 r L z C 4 t L y 7 r 5 I I d K 6 E w V 2 s x d I j 3 o 9 8 T + v m p j m X i 1 l I k 4 M F W T w U D P h 2 E j c 6 w s 3 m K L E 8 I 4 l Q B S z f 8 W k B Q q I s a 1 m b Q n e a O R x c l k q e t v F 0 v l O v n z 4 M K g j g 9 b R B i o g D + 2 i M j p F Z 6 i C C H p E z + g V v T l P z o v z 7 n w O V i e c Y Y V r 6 A + c 7 x / i L 6 O k < / l a t e x i t > M 1 : ⌃(p !¬q ^r) 6 9 k 6 e f n e W 2 / f t J S c G O D 4 N p r P H j 4 a O P x 5 p P m 0 2 f P t 1 6 0 t l 8 e G l V p y k Z U C a W P U z R M c M l G l l v B j k v N s E g F O 0 p P D h b + 0 S n T h i v 5 z c 5 K F h e Y S z 7 h F K 2 T k l b y J Q l h H + p o e d V Y 5 2 l c B 3 6 3 0 w n D v f e B 3 + 9 + 2 O s H D t 1 9 u d R 5 8 4 F k p m c 9 g p I d I 8 n 1 r U W p 1 B l PK c c k 0 h k i y H 7 x A J l B n o d 0 m r H f j B s u A + h G t o k 3 U N k 9 Z V l C l a F U x a K t C Y c R i U N q 5 R W 0 4 F m z e j y r A S 6 Q n m b O x Q Y s F M X C + n n 8 N b p 2 Q w U d o t a W G p 3 j x R Y 2 H M r E h d Z 4 F 2 a u 5 6 C / F / 3 r i y k 3 5 c c 1 l W l k m 6 e m h S C b A K F q l C x j W j V s w c I N X c z Q p 0 i h q p d d k 3 X Q j h 3 S / f h 8 O O H + 7 6 n a / d 9 u D j + S q O T f K a v C E 7 J C Q 9 M i C f y Z C M C C U X 5 A f 5 RX 5 7 l 9 5 P 7 4 / 3 d 9 X a 8 N Y R v i K 3 q t H 4 B + t V r b E = < / l a t e x i t > M 2 : ⇤(r !¬q ^r) < l a t e x i t s h a 1 _ b a s e 6 4 = " p U A k z 6 G b 5 B D E 4 c H u Q S F 8 A T + o M M E = " > A A A C R H i c b Z D L a h s x F I Y 1 6 c 1 x m s Z J l 9 k c a g I O h G H G d o m d V Z p u u i m k U F 9 g Z j A a W R 6 L a K S J p E l r B k N f L Z s 8 Q H Z 9 g m y y a C n d l s q X R W L n B 6 G P / x x J R 3 + c c a a N 5 / 1 0 N p 4 9 f / H y V W m z r b z h S b s V 6 t z c y n a k F u R q 2 o Y C L L D R V k 8 d A o 5 2 A k z B K F I V O U G D 6 x g I l i d l Y g Y 6 w w M T b 3 s g 3 B X / 3 y O n T r r t 9 w 6 1 + a 1 d M P P x Z x l N A + e o d q y E f H 6 B R 9 Q u e o g w i 6 R n f o F / r t 3 D j 3 z h / n 7 6 J 1 w 1 l G + B Y 9 k v P v P 4 j o r P w = < / l a t e x i t > M 3 : ⇤(p_ ¬q ^r) < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 o 5 g K e H y d V W 0 d k u b l 3 s F A 6 P H n p 1 T K E V t I r W k Y P 2 0 S E 6 R e e o g g h 6 R K / o H X 1 Y T 9 a b 9 W l 9 9 a I j V r / C Z f Q H 1 v c P M v m p u g = = < / l a t e x i t > M 4 : ⇤(p !¬q ^r) < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 J 8 I n Q i j E Y P X Y r 7 F A 7 e Z B B L w R T k = " > A A A C G 3 i c b Z B L S w M x F I U z P m t 9 V V 2 6 u V i E u i k z t a C 4 q r p x I 1 S w D + i U k k n T a W g m G Z O M W k r B n + H G v + L G h S K u B B f + G 9 P H w t e B w M c 5 N y T 3 B D F n 2 r j u p z M z O z e / s J h a S i + v r K 6 t Z z Y 2 q 1 o m i t A K k V y q e o A 1 5 U z Q i m G G 0 3 q s K I 4 C T m t B 7 3 S U 1 6 6 p 0 k y K S 9 O P a T P C o W A d R r C x V i t T O G 8 V 4 Q j 8 E 3 k L u R h 8 x c K u w U r J G / A D F h K m C P i C h n A F P s e i D W q v l c m 6 e X c s + A v e F L J o q n I r 8 + 6 3 J U k i K g z h W O u G 5 8 a m O c D K M M L p M O 0 n m s a Y 9 H B I G x Y F j q h u D s a 7 D W H X O m 3 o S G W P M D B 2 v 9 8 Y 4 E j r f h T Y y Q i b r v 6 d j c z / s k Z i O o f N A R N x Y q g g k 4 c 6 C Q c j Y V Q U t J m i x P C + B U w U s 3 8 F 0 s U K E 2 P r T N s S v N 8 r / 4 V q I e / t 5 w s X x W z p + G 5 S R w p t o x 2 U Q x 4 6 Q C V 0 h s q o g g i 6 R 4 / o G b 0 4 D 8 6 T 8 + q 8 T U Z n n G m F W + i H n I 8 v 7 Q a f 6 g = = < / l a t e x i t >

F 1 :
⌃(r !q) < l a t e x i t s h a 1 _ b a s e 6 4 = " K + N 2 n N m / S V 5 X i l L A 9 I m r W w o L h r 8 = " > A A A C P H i c b Z B L a x s x F I U 1 a V 5 1 8 3 D b Z T c i p u B A M D N j J 4 6 7 c m k p X T o k d g y e w d y R Z V t Y I 0 0 l T Y o Z B v q 3 u s m P y K 6 r b r J I K d 1 2 H f m x S O J c E H y c c y X d e 6 K E M 2 1 c 9 5 e z 9 m J 9 Y 3 N r + 2 X h 1 c 7 u 3 n 7 x 9 Z u O l q k i t E 0 k l 6 o b g a a c C d o 2 z H D a T R S F O O L 0 M p p 8 m v m X V 1 R p J s W F m S Y 0 j G E k 2 J A R M F b q F 8 + D G M y Y A M + + 5 H 0 P f 8 D B Z w a x F A N c z o L 5 8 z 0 1 i s L M r X h V r 1 o 7 P n I r j U b j p F 6 3 4 P r V E 7 + W K x w o N h o b U E p + x 9 / y w 3 6 x Z M 1 5 4 V X w l l B C y 2 r 1 i z f B Q J I 0 p s I Q D l r 3 P D c x Y Q b

C 1 :
⇤(¬p) < l a t e x i t s h a 1 _ b a s e 6 4 = " / P 4 H 5 I i v + R A G Y c r z J 4 e 4 P z Q c o B 4 = " > A A A C J X i c b V B L S w M x G M z 6 r P V V 9 e g l W I Q K U n b b L r b i o d q L x w r 2 A d 2 l Z N O 0 D c 1 m l y Q r l q X g b / H i X / H i w S K C J / + K 6 e O g r Q O B Y W b y 5 c t 4 I a N S m e a X s b K 6 t r 6 x m d h K b u / s 7 u 2 n D g 7 r M o g E J j U c s E A 0 P S Q J o 5 z U F F W M N E N B k O 8 x 0 v A G l Y n f e C B C 0 o D f q 2 F I X B / 1 O O 1 S j J S W 2 q m r S t u C l 9 C 5 C R 5 h J n a m A 1 u i 5 7 m x m T X t k m 2 Z 5 2 b

Figure 5 :
Figure 5: Divergences and Vicinity Exploration Matters

Table 2 :
Solvers and versions used.

Table 3 :
Number of soundness, crashes, flaky and performance (B  , B  , B  , B  ) bug-triggering formulas generated by SpecBCFuzz.Same bug-triggering formula can induce the same or different failure in the solvers.Then, we aim at performing a more in depth analysis by clustering the failures according to the input/outputs relations, i.e., between the bug-triggering formulas and the symptoms shown by the solvers' executions.Table 4 summarizes 16 particular clusters capturing different buggy symptoms we identified for each solver, and later on we discuss symptoms potentially revealing further performance bugs.

Table 4 :
Cluster of the observed symptoms.