Synthesizing Specifications

Every program should be accompanied by a specification that describes important aspects of the code's behavior, but writing good specifications is often harder than writing the code itself. This paper addresses the problem of synthesizing specifications automatically, guided by user-supplied inputs of two kinds: i) a query posed about a set of function definitions, and ii) a domain-specific language L in which the extracted property is to be expressed (we call properties in the language L-properties). Each of the property is a best L-property for the query: there is no other L-property that is strictly more precise. Furthermore, the set of synthesized L-properties is exhaustive: no more L-properties can be added to it to make the conjunction more precise. We implemented our method in a tool, Spyro. The ability to modify both the query and L provides a Spyro user with ways to customize the kind of specification to be synthesized. We use this ability to show that Spyro can be used in a variety of applications, such as mining program specifications, performing abstract-domain operations, and synthesizing algebraic properties of program modules.


INTRODUCTION
Specifications make us understand how code behaves.They also have many uses in testing, verifying, repairing, and synthesizing code.Because programmers iteratively refine their code to meet a desired intent (that often changes along the way), writing and maintaining specifications is often harder than writing and maintaining the code itself.A number of approaches have been proposed for automatically generating specifications, but these approaches are restricted to certain types of specifications, limited types of properties, and are based on dynamic testing-i.e., they yield likely specifications that though correct on the observed test cases might be unsound in general.
In this paper, we present the first customizable framework for synthesizing provably sound, most-precise (i.e., "best") specifications from a given set of function definitions.Our framework can be used to mine specifications from code, but also to enable several applications where obtaining precise specifications is crucial-e.g., generating algebraic specifications for modular program synthesis [Mariano et al. 2019], automating sensitivity analysis of programs [D' Antoni et al. 2013], and enabling abstract interpretation for new abstract domains [Yao et al. 2021].The engine/primitive that drives the framework is an algorithm for the following problem: Given a query Φ posed 1 As explained in §2, the framework requires a definition of the semantics of the function symbols that appear in query Φ (e.g., push, pop, reverse, etc.) and DSL L. The specification obtained with our framework is sound with respect to the supplied semantics, but our implementation sometimes uses bounded or approximate semantics 2 Readers familiar with symbolic methods for abstraction interpretation [Reps et al. 2004] will recognize that our problem is an instance of the strongest-consequence problem.Given a formula Φ in logic L 1 (with meaning function • 1 ), find the strongest formula Ψ in a different logic L 2 (with meaning function • 2 ) such that Φ 1 ⊆ Ψ 2 .

285:3
Our Framework.The core algorithm is a CEGIS loop that handles some negative-example classifications as "maybe" constraints, and guarantees progress via monotonic constraint hardening until an L-property is found that is both sound and precise.By repeatedly calling the core algorithm to synthesize incomparable L-properties, a most-precise L-conjunction is created.
The core algorithm relies on three simple primitives.Synthesize: synthesizes an L-property that accepts a set of positive examples and rejects a set of negative examples; it returns ⊥ if no such a property exists.CheckSoundness: checks if the current L-property is sound; if is not, CheckSoundness returns a new positive example that the property fails to accept.CheckPrecision: checks if the current L-property is precise; if it is not, CheckPrecision returns a new L-property that accepts all positive examples, rejects all negative examples, and rejects one more negative example (which is also returned).Our current implementation of each primitive relies on satisfiability modulo theory (SMT) solvers, limiting the scope of our framework.Nevertheless, as long as one has implementations for such primitives, the algorithm is sound.When the DSL L is finite, the algorithm is also complete.Contributions.Our work makes the following contributions: • A formal framework for the problem of synthesizing best L-conjunctions ( §2).
• A tool that we implemented to support our framework, called spyro.There are two instantiations of spyro: spyro[sketch] and spyro [smt], which have different capabilities (see §4). • An evaluation of spyro on a variety of benchmarks, showcasing four different applications of spyro ( §5): mining specifications [Lo et al. 2017], generating algebraic specifications for modular program synthesis [Mariano et al. 2019], automating sensitivity analysis of programs [D 'Antoni et al. 2013], and enabling abstract interpretation for new abstract domains [Yao et al. 2021].§6 discusses related work.§7 concludes.In the extended paper [Park et al. 2023a], §A contains proofs; §B contains implementation details; and §C contains further details about the evaluation.

PROBLEM DEFINITION
In this section, we define the problem addressed by our framework.Throughout the paper, we use a running example in which the goal is to synthesize interesting consequences of the following query, which allows obtaining properties of up to two calls of the list-reversal function.
In particular, we are interested in identifying properties that are consequences of query formula (1) and are expressible in the DSL defined by the following grammar L list : An L-property is a property expressible in a DSL L. We say that an L-property is sound if it is a consequence of-i.e., implied by-the given query formula Φ.The goal of our framework is to synthesize a set of incomparable sound and most-precise L-properties (i.e., a conjunctive specification), not just any L-properties.For example, the L list -property len( 1 ) ≤ len( 1 ) is sound but not most-precise (len( 1 ) = len( 1 ) is a more precise sound L list -property).
In the rest of this section, we describe what a user of the framework has to provide to solve this problem, and what they obtain as output.The user needs to provide the following inputs: Input 1: Query.The query formula consists of a finite set of atomic formulae Φ = { 1 , . . .,   } (denoting their conjunction).Each atomic formula   is of the form   =   (  1 , . . .,    ), where   is an output variable, each    is an input variable, and   is a function symbol.In our running example, the query is given in Eq. (1).Input 2: Grammar of L-properties.The grammar of the DSL L in which the synthesizer is to express properties.In our example, the DSL L list is defined in Eq. (2).Input 3: Semantics of function symbols.A specification of the semantics of the function symbols that appear in query Φ (e.g., reverse) and in the DSL L (e.g., len).
We assume semantic definitions are given in-or can be translated to-formulas in some logic fragment.For pragmatic reasons, in our implementation semantic definitions are given as code that is then automatically transformed into first-order formulas.For instance, the semantics of reverse is given as a program in the sketch programming language [Solar- Lezama 2013] from which we automatically extract the following bounded semantics   reverse (, ) for a given bound  > 0:  0 reverse (, ) := ⊥   reverse (, ) := [ eq (, []) ⇒  eq (, [])] ∧ ∃ ℎ ,   ,  ′ [ eq (,  ℎ ::   ) ⇒  −1 reverse (  ,  ′ ) ∧  snoc ( ′ ,  ℎ , )] if  > 0 (3) We discuss this limitation-i.e., that the semantics is bounded-and how we mitigate it in Section 5. Let  Φ be the set of all variables in Φ.We use  Φ to denote the formula that exactly characterizes the space of valid models over the variables  Φ in Φ.For example, let  reverse (, ) be the formula that exactly characterizes the result of reversing a list  and storing the result in -e.g., (, ) = ([1, 2], [2, 1]) is a valid model of  reverse .We use  to denote the set of models of a formula .Then, in our example  Φ ( 1 ,  1 ,  2 ,  2 ) =  reverse ( 1 ,  1 ) ∧  reverse ( 2 ,  2 ) .Henceforth, we omit variables in formulas when no confusion should result, and merely write  Φ .Output: Best L-properties.The goal of our method is to synthesize a set of incomparable sound and most precise L-properties that are consequences of query Φ. Ideally, the best L-property would be one that exactly describes  Φ , but in general the language L might not be expressive enough to do so.We argue that this feature is actually a desirable one! 3 The customizability of our approach via a DSL is what allows our work to focus on identifying small and readable properties (rather than complex first-order formulas), and to apply our method to different use cases (see §5).
Because in general there might not be an L-property that is equivalent to  Φ , the goal becomes instead to find L-properties that tightly approximate  Φ .
Definition 2.1 (A best L-property).An L-property  is a best L-property for a query Φ if and only if (i)  is sound with respect to Φ:  Φ ⊆  .(ii)  is precise with respect to Φ and L: ¬∃ ′ ∈ L.  Φ ⊆  ′ ⊂  .We use P (Φ) to denote the set of all best L-properties for Φ.
When we refer to "a sound L-property," soundness is always relative to some  Φ .Strictly speaking, we should say "a  Φ -sound L-property, " but  Φ should always be clear from context.
A best L-property is a strongest consequence of  Φ that is expressible in L. Because L is constrained, there may be multiple, incomparable L-properties that are all strongest consequences of  Φ -thus, we speak of a best L-property and not the best L-property.In our running example, len( 1 ) = len( 1 ) is a best L-property and so is ¬eq( 1 ,  2 ) ∨ eq( 1 ,  2 ).(Stated as an implication: eq( 1 ,  2 ) ⇒ eq( 1 ,  2 ).)The former states that the sizes of the input and output of reverse are the same, while the latter states that applying reverse twice to a list yields the same list.
The goal of this paper is to find semantically minimal sets of incomparable best L-properties.
While there can be multiple best L-conjunctions, they are all logically equivalent and they are all equivalent to a strongest L-conjunction.However, note that a strongest L-conjunction is not necessarily a best L-conjunction; it could contain L-properties that are not best and could potentially have repeated or redundant L-properties.
Theorem 2.1.If  ∧ is a best L-conjunction, then its interpretation coincides with the conjunction of all possible best properties:  ∧ =  ∈ P (Φ)  .We are now ready to state our problem definition: Definition 2.3 (Problem definition).Given query Φ, the concrete semantics  Φ for the function symbols in Φ, and a domain-specific language L with its corresponding semantic definition, synthesize a best L-conjunction for Φ.
As illustrated in Section 3, given query (1), the DSL in Eq. ( 2), and the semantic definitions of reverse, isEmpty, len, etc., our tool spyro synthesizes the set of L-properties shown below in Eq. ( 4), and establishes that the conjunction of these properties is a best L-conjunction.(For clarity, we write properties of the form ¬ ∨  as  ⇒ .) Even though reverse is a simple function, its corresponding best L-conjunction (w.r.t. the DSL L) for query (1) is non-trivial.For example, our approach can discover properties involving single function calls (e.g., reverse behaves like the identity function on a list of length 0 or 1), but also hyperproperties, i.e., properties involving multiple calls to the same function.For example, the property eq( 1 ,  2 ) ⇒ eq( 1 ,  2 ) states that applying the reverse function twice to an input returns the same input, while the property eq( 1 ,  2 ) ⇒ eq( 1 ,  2 ) shows that reverse is injective!Moreover, because the user has control over the DSL L, they can change the language in which properties are to be expressed.In particular, if the formulas returned by spyro are too complicated for the user's taste, they can modify L and reinvoke spyro until they are satisfied with the results.
Depending on the DSL, a best L-conjunction may need to be an infinite formula.
Example 2.1 (Infinite L-conjunction).Consider again the running example, and assume we change the DSL to the one defined by the following grammar L inf : where "::" denotes the infix cons operator.There exists only one best L inf -conjunction, which has an infinite number of conjuncts.
Our implementation focuses on DSLs for which this problem does not arise.Assumptions on DSLs are formally discussed in Section 3.
All the inputs to the framework are reusable.To synthesize a best L-conjunction for a different query Φ that still operates over lists, one only needs to supply the semantic definition of the functions in Φ, and (if needed) modify the variable names generated by nonterminal  of Eq. (2).
For example, for the function that takes a list and duplicates its entries  = stutter()-e.g., stutter( [1, 2]) = [1, 1, 2, 2])-spyro synthesizes the following L-conjunction using the DSL L list .len()=len()+1 ∨ len()>1 ∨ isEmpty() len()≤0 ∨ len()=1 ∨ len()>len()+1 len()≤0 ∨ len()>len() Because our DSL does not contain multiplication by 2, spyro could not synthesize the property that states that the length of the output list is twice the length of the input list.If we modify the DSL L list to contain multiplication by 2 and the ability to describe when an element appears both in the input and output list, spyro successfully synthesizes the following new best L-properties: The ability to modify the DSL empowers the user of spyro with ways to customize the type of properties they are interested in synthesizing.As we will show in Section 5, customizing the DSL also allows us to use spyro for different applications and case studies-e.g., synthesizing abstract transformers and algebraic properties of programs.

AN ALGORITHM FOR SYNTHESIZING BEST L-CONJUNCTIONS
In this section, we present the main contribution of the paper: an algorithm for synthesizing a best L-conjunction.The algorithm synthesizes one most-precise L-property at a time.It keeps track of the L-properties it has synthesized and uses this information to synthesize a new most-precise L-property that is incomparable to all the ones synthesized so far.

Positive and Negative Examples
Given query Φ, the concrete semantics  Φ for the function symbols in Φ, and a domain-specific language L with its corresponding semantic definition, our algorithm synthesizes best L-properties and a best L-conjunction for Φ using an example-guided approach.Definition 3.1 (Examples).Given a model , we say that  is a positive example if  ∈  Φ and a negative example otherwise.
Example 3.1.Given the query  1 = reverse( 1 ), the model that assigns  1 to the list [1, 2] and  1 to the list [2, 1] is a positive example.For brevity, we use the notation ( [1, 2], [2, 1]) to denote such an example.The following examples are negative ones: When considering the query from Eq. (1), which contains two calls on reverse, ]) is a positive example (where the values denote  1 ,  1 ,  2 , and  2 , respectively), while Intuitively, a best L-property must accept all positive examples while also excluding as many negative examples as possible.
Positive examples can be treated as ground truth-i.e., a best L-property should always accept all positive examples-but negative examples are more subtle.First, there can be multiple, incomparable,  best L-properties, each of which rejects a different set of negative examples.Second, there may be negative examples that no best L-property can reject-they are accepted by every best L-property.
In most cases, we use common datatypes-integers or lists-as the domain of our examples; however, more complicated definitions are sometimes required.If we are interested in queries involving binary search trees (BSTs), a tree datatype can describe the syntactic structure of the examples, but cannot capture BST invariants, e.g., for every node , all values in the left (resp.right) subtree of  must be ≤ (resp.≥) 's value.In our implementation, the set of valid BSTs is defined using the following sketch program-called a generator-that uses BST insertion operations to generate valid binary search trees: generate_BST() := if (??) then emptyBST() else insert(generate_BST(), ??) The code is then transformed into a bounded formula similar to the one in Eq. 3, but where the holes (i.e., ??) at each recursive call are replaced with unknown variables.In this case, different values for the holes result in different BSTs.In the rest of the paper, we assume that examples are only drawn from their valid domains regardless of how this domain is expressed.

Soundness and Precision
Now that we have established how examples relate to best L-properties, we can introduce the two key operations that make our algorithm work: CheckSoundness and CheckPrecision.These two operations are similar to the ones used by Kalita et al. [2022] to synthesize abstract transformers, and are used by the inductive-synthesis algorithm to determine whether an L-property is sound (i.e., a valid L-property) and precise (i.e., a best L-property).We modify the definition of CheckPrecision proposed by Kalita et al. [2022] to account for already synthesized best L-properties and thus facilitate the synthesis of distinct best L-properties.

Checking Soundness.
Given an L-property , CheckSoundness(,  Φ ) checks whether  is an overapproximation of  Φ .In other words, CheckSoundness checks if there exists a positive example  + ∈  Φ that is not accepted by ; it returns that example if it exists, and ⊥ otherwise.The soundness check can be expressed as ∃ + .¬ ( + ) ∧  Φ ( + ).
CheckPrecision can be thought of as a primitive that synthesizes a negative example and a formula that can reject such an example at the same time (or proves whether the synthesis problem does not admit a solution).The formula  ′ is a witness that the negative example  − can be rejected by some L-property.In our algorithm, the set  is used to ensure that the negative example produced by CheckPrecision is not already rejected by best L-properties we already synthesized.
A key property that we exploit in our algorithm is that, under some assumptions on the language L, once a sound L-property is found, there must exist a best L-property that implies it.This property allows searching for sound L-properties in a monotonic manner.Once a sound L-property is found, the search space can be narrowed down to a smaller set that includes all L-properties that are more precise than the one already found.We will show in Theorem 3.4 that this narrowing is guaranteed to be finite under some assumptions about the language L.
We say that a relation ⪯ on a set  is a well-quasi order if ⪯ is a reflexive and transitive relation such that any infinite sequence of elements  0 ,  1 ,  2 , . . .from  contains an increasing pair   ⪯   with  <  .If the consequence relation ⇒ for language L is a well-quasi order, we have that any descending sequence of sound L-properties cannot be infinite-i.e., for any L-property , there exist only finitely many L-properties that imply .Moreover, a well-quasi order has no infinite ) obtained when calling CheckPrecision on  4 and the current examples. 2 is sound.
)) obtained when calling CheckPrecision on  4 and the current examples. 3 is unsound.anti-chains (i.e., there are no infinite sequences of pairwise incomparable elements).Clearly, if L is finite, then ⇒ is a well-quasi order, but finiteness is not a necessary condition.For example, consider the absolute-value function  = abs(), and a grammar L  that defines properties of the form −20 ≤  ≤ 10 ⇒  ≤  (for any natural number  ).The set of properties is infinite, but ⇒ is a well-quasi order on the set of sound L  -properties-i.e., for any concrete value of  , it is only possible to decrease the value of  and strengthen the property a finite number of times.Lemma 3.1 (Monotonicity of Sound L-properties).If ⇒ is a well-quasi order on the set of L-properties, for every sound L-property , there exists a best L-property  ′ such that  ′ ⇒ .
Consequently, if  is a sound L-property that rejects a set of negative examples  − , there must exist a best L-property that also rejects the examples in  − .This property lets us infer when a set of negative examples must be rejected by a best L-property (i.e., after a sound property is found).

Synthesizing One Most Precise L-property
We are now ready to describe the method used to synthesize an individual best L-property (Algorithm 1).The procedure SynthesizeStrongestConjunct takes as input Even when each member of a set of negative examples can be rejected (individually) by a sound L-property, there may not exist a single sound L-property that rejects all members of the set.For example, the negative examples 1a cannot both be rejected by a single best L-property.
Preserved Invariants.We describe SynthesizeStrongestConjunct and the invariants it maintains.
Invariant 1: At the beginning of each loop iteration, the L-property  accepts all the examples in the current set  + and rejects all the examples in the current set  −  ∪  −  .
In each iteration, SynthesizeStrongestConjunct checks if the current property is sound using CheckSoundness (line 3).If the property is sound, it is then checked for precision using CheckPrecision (line 14).The algorithm terminates once the property is sound and precise (line 19). when a sound L-property is found is one of the contributions of our algorithm.While the algorithm is sound even without line 12, this step prevents the algorithm from often oscillating between multiple best L-properties throughout its execution.We found that this optimization gives a 3.06% speedup to the algorithm (see §5.5).
We say that a sound L-property  is precise for  Φ with respect to  if there does not exists a negative example  − ∈  and L-property  ′ such that  Φ ⇒  ′ ⇒  and  ′ rejects  − , whereas  accepts  − .The following lemma characterizes the behavior of SynthesizeStrongestConjunct.Lemma 3.2 (Soundness and Relative Precision of SynthesizeStrongestConjunct).If SynthesizeStrongestConjunct terminates, it returns a sound L-property  that accepts all the examples in  + , rejects all the examples in  −  ∪  −  , and is precise for  Φ with respect to  .

Synthesizing a Most-Precise L-conjunction
In this section, we present SynthesizeStrongestConjunction (Algorithm 2), which uses Syn-thesizeStrongestConjunct to synthesize a best L-conjunction of L-properties.
On each iteration, SynthesizeStrongestConjunction maintains a conjunction of best Lproperties  ∧ , and uses SynthesizeStrongestConjunct to synthesize a best L-property that rejects some negative examples that are still accepted by  ∧ (i.e., negative examples in  ∧ ∧ ¬ Φ ).It also maintains the set of positive examples  + that have been observed so far.
Each iteration performs three steps: First, it uses SynthesizeStrongestConjunct to try to find an L-property  that rejects new negative examples  −  that no L-property synthesized so far could reject-i.e,by calling SynthesizeStrongestConjunct with  =  ∧ ∧ ¬ Φ (line 5).
Second, it checks whether  rejects some example that was not rejected by  ∧ (line 7).If it does not, the algorithm terminates, and returns the L-properties in Π synthesized so far.They are all best L-properties and their conjunction is a best L-conjunction.
Finally, if we reach line 12, we know that  rejects negative examples in  −  that  ∧ did not reject.Furthermore, because of the guarantees of SynthesizeStrongestConjunct,  is precise with respect to  ∧ -i.e., no sound L-property  ′ exists that can reject more negative examples in  ∧ than  could reject.However, there may be a more precise L-property that rejects more negative examples outside of  ∧ that  does not reject, while still rejecting all the negative examples in  −  .The call to SynthesizeStrongestConjunct in line 12 addresses this issue; it computes a best Lproperty starting from  and makes sure that the L-property obtained rejects everything in  −  while allowing negative examples to be computed anywhere-i.e.,  = ⊤.(Compare this call with the one on line 5, which only allows negative examples to be drawn from  ∧ .)Because precision with respect to ⊤ implies actual precision, we have that when  = ⊤, if SynthesizeStrongestConjunct terminates, it returns the best L-property  for  Φ by Lemma 3.2.
Invariant 5: Π is a set of incomparable best L-properties.
We are now ready to show that SynthesizeStrongestConjunction is sound.
Note: In some settings, we might know a priori that certain L-properties hold, and we would not want to waste time synthesizing them.In such a situation, the formula  ∧ in Algorithm 2 can be initialized to hold those properties, in which case Algorithm 2 would synthesize only best L-properties that are not subsumed by  ∧ .For example, consider synthesizing a specification for two calls of the list-reversal function, as shown in Section 2. We can initialize  ∧ with trivial properties-such as ( 1 ,  2 ) ⇒ ( 1 ,  2 )-that are true of every function definition.Furthermore, after synthesizing a property like ( 2 ,  1 ) ⇒ ( 2 ,  1 ), we can also include the symmetric property ( 1 ,  2 ) ⇒ ( 1 ,  2 ) in the conjunction.This approach enables us to effectively filter out redundant and trivial specifications during the synthesis process.

Completeness
We observe that in SynthesizeStrongestConjunct, because positive examples in  + are never removed, any property stronger than a property that fails CheckSoundness at line 3 is never considered again.Consequently, the sequence of unsound L-properties in an execution of Synthe-sizeStrongestConjunct is non-strengthening.Thus, if a non-strengthening sequence of unsound L-properties can only be finite, SynthesizeStrongestConjunct can only find finitely many unsound L-properties.
Another key observation about SynthesizeStrongestConjunct is that at line 14 if CheckPrecision(, . ..) returns a sound property  ′ with a negative example  − , CheckSoundness will return ⊥ in the next iteration, and the negative example  − will be added to  −  (from  −  in line 12).Therefore, any property weaker than -i.e.,  on the current iteration-will never be considered during this execution of SynthesizeStrongestConjunct.(Recall from the definition of CheckPrecision(, . ..) that  − satisfies .)Thus, if a non-weakening sequence of sound L-properties can only be finite, SynthesizeStrongestConjunct can only find finitely many sound L-properties.
Based on the above two observations, Theorem 3.4 provides a sufficient condition for our algorithm to terminate when DSL L generates an infinite set of formulas.
Theorem 3.4 (Relative Completeness).Suppose that ⇒ is a well-quasi order on the set of sound L-properties.Let ⇐ denote the inverse of ⇒, and suppose that ⇐ is a well-quasi order on the set of unsound L-properties.If Synthesize, CheckSoundness and CheckPrecision are decidable on L, then SynthesizeStrongestConjunct and SynthesizeStrongestConjunction always terminate.
Above, our argument about the second observation involved line 12 of SynthesizeStrongest-Conjunct.Nevertheless, Theorem 3.4 remains valid even if we eliminate line 12 from Synthesize-StrongestConjunct-i.e., line 12 is an optimization.
During each iteration, SynthesizeStrongestConjunct either adds a new positive example to  + or adds a new negative example to  −  .As a result, the number of iterations is also limited by the size of the example domain.
Corollary 3.5.Suppose that either L contains finitely many formulas, or the example domain is finite.If Synthesize, CheckSoundness and CheckPrecision are decidable on L, then Synthe-sizeStrongestConjunct and SynthesizeStrongestConjunction always terminate.

IMPLEMENTATION
We implemented our framework in a tool called spyro.Following §2, spyro takes the following inputs: (i) A query Φ for which spyro is to find a best L-conjunction.(ii) The context-free grammar of the DSL L in which properties are to be expressed.(iii) A specification, as a logical formula, of the concrete semantics of the function symbols in Φ and L. Synthesize and CheckPrecision may be undecidable synthesis problems in general, but we show that these primitives can be implemented in practice using program-synthesis tools that are capable of both finding solutions to synthesis problems and establishing that a problem is unrealizable (i.e., it has no solution).
We implemented two versions of spyro: spyro[smt] supports problems in which semantics are definable as SMT formulas, and spyro[sketch] supports arbitrary problems but relies on the bounded/underapproximated encoding of program semantics of the sketch language [Solar- Lezama 2013].For the current implementations of spyro[smt] and spyro [sketch], it is necessary to give the inputs in slightly different forms.In particular, input (iii) is provided to spyro [smt] in SMT-Lib format, whereas it is provided to spyro[sketch] as a piece of code in the sketch programming language.
In spyro [smt], CheckSoundness is just an SMT query, and Synthesize and CheckPrecision can be expressed as SyGuS problems.For the latter two primitives, spyro runs two SyGuS solvers in parallel and returns the result of whichever terminates first: (i) CVC5 (v.commit b500e9d) [Barbosa et al. 2022], which is optimized for finding solutions to SyGuS synthesis queries, and (ii) a reimplementation of the constraint-based unrealizability-checking technique from [Hu et al. 2020] that is specialized for finding whether the output of Synthesize and CheckPrecision is ⊥.
In spyro[sketch], Synthesize, CheckSoundness, and CheckPrecision are all implemented by calling the sketch synthesizer (v. 1.7.6) [Solar-Lezama 2013].We describe how each primitive is encoded in sketch in Appendix B and how sketch's encoding affects soundness in Section 5. Timeouts.We use a timeout threshold of 300 seconds for each call to Synthesize, CheckSoundness, and CheckPrecision.If any such call times out, SynthesizeStrongestConjunction returns the current L-conjunction, together with an indication that it might not be a best L-conjunction.(However, each of the individual conjuncts in the returned L-conjunction is a best L-property.)Additional Tooling.In our evaluation, we used Dafny [Leino and Wüstholz 2014] to verify that the properties obtained by spyro[sketch] were sound for inputs beyond the bounds considered by sketch.Furthermore, for the SyGuS benchmarks only, we invoked CVC5 to verify whether the properties obtained by spyro[sketch] exactly characterized the function (which is a sufficient condition for an answer to be a "best" answer).

EVALUATION
We evaluated the effectiveness of spyro through four case studies: specification mining ( §5.1), synthesis of algebraic specifications for modular synthesis ( §5.2), automating sensitivity analysis ( §5.3), and enabling new abstract domains ( §5.4).For each case study, we describe how we collected the benchmarks, present a quantitative analysis of the running time and effectiveness of spyro, and a qualitative analysis of the synthesised L-conjunctions.In §5.5, we describe additional experiments to identify what parameters affect spyro's algorithm.
We ran all experiments on an Apple M1 8-core CPU with 8GB RAM.All results in this section are for the median run, selected from the results of three runs ranked by their overall synthesis time.

Application 1: Specification Mining
We considered a total of 45 general specification-mining problems to evaluate spyro: 7 syntaxguided synthesis (SyGuS) problems from the SyGuS competition [Alur et al. 2019], where the semantics of operations is expressed using SMT formulas; 24 type-directed synthesis problems from Synqid [Polikarpova et al. 2016], where the semantics of operations is expressed using sketch; and 14 problems we designed to cover missing interesting types of properties (11 had their semantics expressed using sketch and 3 had semantics expressed using SMT formulas).Cumulatively, we have 10 benchmarks for which the semantics of operations is expressed using SMT formulas, and 35 benchmarks for which the semantics of operations is expressed using sketch.For the SyGuS and Synqid benchmarks, we "inverted" the roles from the original benchmarks: given the reference implementation, spyro synthesized a specification.Each input problem consists of a set of functions (1 to 14 functions per problem, and the size of each function ranges from 1 to 30 lines of code per function).The largest problem contains 14 functions (8 list functions and 6 queue functions) and the file contains a total of 140 lines of code.Most functions are recursive and can call each other-e.g., dequeue calls reverse, which calls snoc, which calls cons.
For each set of similar benchmarks, we designed a DSL that contained operations that could describe interesting properties for the given set of problems.The construction of each DSL depended on syntactic information from the code: the number, types, and names of input and output variables, constants, and function symbols used in the code.We included operations that are commonly used in each category, such as equality, size primitives, and emptiness checking, but avoided problemspecific information.For benchmarks involving data structures with structural invariants (e.g., stacks, queues, and binary search trees), we provided data-structure constructors that guaranteed that functions were only invoked with data-structure instances that satisfied the invariants.The exact grammars are described in Appendix C.1.A DSL designed for a specific problem domain was often reused by modifying what function symbols could appear in the DSL.Overall, we created 7 distinct grammars for 14 different SyGuS and arithmetic problems; 10 grammars for 72 Synqid problems; and 3 grammars for 7 Stack and Queue problems.Although all but one of the DSLs are finite, they are still large languages; our finite DSLs can express between 4 thousand and 14.8 trillion properties, thus making the problem of synthesizing specifications challenging.
5.1.1Quantitative Analysis, Part 1: Performance.spyro[smt] synthesized best L-conjunctions for 6/10 benchmarks for which the semantics was expressed using SMT formulas.It took less than 6 minutes each for it to solve the successful examples, and timed out on the remaining 4 benchmarks (max4, arrSearch3, abs with the grammar from Eq. ( 22), and hyperproperties of diff)-spyro [smt] typically times out when the synthesis algorithm requires many examples.
Although we did not consider this option in our initial set of benchmarks, for the 4 benchmarks on which spyro[smt] failed, we also encoded the semantics of the function symbols in Φ and L using sketch.spyro[sketch] could synthesize properties for all 4 benchmarks, and guaranteed that 1/4 were best L-conjunctions (with respect to sketch's bounded semantics), but for the other 3 benchmarks (max4, arrSearch3 and diff) spyro[sketch] timed out on a call to CheckPrecision.However, the 3 L-conjunctions obtained by the time CheckPrecision timed out were indeed best L-conjunctions: although most-preciseness was not shown by spyro[sketch] within the timeout threshold, we found-using an SMT solver-that the L-conjunctions in hand on the synthesis round on which the timeout occurred defined the exact semantics of the functions of interest, which implies they were best L-conjunctions.The 1 problem for which spyro[sketch] established most-preciseness terminated within 5 minutes.For the other 3 problems, if we disregard the last iteration-the one on which most-preciseness of the L-conjunction was to be establishedspyro[sketch] found a best L-conjunction within 10 minutes.
spyro[sketch] could synthesize properties for 35/35 benchmarks for which the semantics was expressed using sketch, and guaranteed that 34/35 were best L-conjunctions.It took less than 10 minutes to solve each List and Tree benchmark, except for the branch problem-spyro[sketch] took about 30 minutes to find the best L-conjunction, but failed to show most-preciseness.It took less than 15 minutes to solve each Stack, Queue, and Integer-Arithmetic benchmark.For nonlinSum, spyro[sketch] was able to synthesize in 900 seconds a best L-conjunction from a grammar that contains ≈14.8 trillion properties (see Eq. 23).
As a baseline, we compared the running time of spyro to an estimate of the running time of an algorithm that enumerates all sound properties in L. For each benchmark, we estimated the cost of enumerating all terms in the grammar and checking for their soundness by multiplying the size |L| of the language generated by the grammar by the average running time of each call to CheckSoundness observed when running spyro on the same benchmark.As shown in Table 1, while spyro demonstrated a small estimated speedup for smaller problems like emptyQueue (i.e., 3.5×, with |L| = 64), spyro was 2-5 orders of magnitude faster than the baseline for problems with large languages-10 4 ≤ |L| ≤ 6 • 10 7 -and 8-10 orders of magnitude faster than the baseline for the two problems with very large languages-10 10 ≤ |L| ≤ 1.5 • 10 13 .
Together, spyro[sketch] and spyro[smt] synthesized properties for 41/45 benchmarks (45/45 if we consider the 4 benchmarks rewritten using a sketch semantics), and guaranteed that 40 were best L-conjunctions (44 if we consider the 4 benchmarks rewritten using a sketch semantics and our further analysis using an SMT solver).
5.1.2Quantitative Analysis, Part 2: Soundness.To assess whether the properties synthesized by spyro [sketch] were indeed sound beyond the given input bound considered by sketch, we used an external verifier: Dafny [Leino and Wüstholz 2014] (a general purpose semi-automatic verifier).Dafny successfully verified that 23/35 L-conjunctions synthesized by spyro[sketch] on non-SyGuS benchmarks were sound without any manual input from us.We could increase this number to 33/35 by providing invariants or some logical axioms to Dafny-e.g., (∀ .len() ≥ 0).Dafny failed to verify properties synthesized from enqueue and reverse, which require a more expressive L to describe the order of elements.
5.1.3Qualitative Analysis.Fig. 3 shows the properties synthesized by spyro on one of our three runs.In the SyGuS benchmarks, "[sketch]" denotes cases in which spyro[sketch] terminated with a semantics defined using sketch, but spyro[smt] did not with a semantics defined using SMT formulas.Due to space constraints, we omit max4 and arrSearch3.They are similar to max3 and arrSearch2, respectively, but result in many properties.SyGuS Benchmarks.The L-conjunctions synthesized by spyro[smt] and spyro[sketch] are more precise or equivalent to the original specifications given in the SyGuS problems themselves.In fact, spyro found L-conjunctions that define the exact semantics of the given queries.Inspired by this equivalence, we attempted to use a SyGuS solver (CVC5) on the SyGuS benchmarks to synthesize an exact formula: we used a grammar of conjunctive properties (including the "and" operator, unlike the grammars used by spyro), and the specification was the semantics of the function.For 6/8 cases, CVC5 timed out, thus showing that our approach (of synthesizing one L-property at a time) is beneficial even in the artificial situation in which an oracle supplies the semantics of the best L-conjunction.Moreover, directly synthesizing an L-conjunction-as CVC5 attempts-can yield a set of conjuncts of which some are not most-precise L-properties.Synquid Benchmarks.To evaluate the synthesized properties, we provided the synthesized Lconjunctions to Synqid and asked it to re-synthesize the reference implementation from which we extracted the properties.In 12/16 cases, Synqid could re-synthesize the reference implementation.In 4/16 cases-elemIndex, ith, reverse, and stutter-the synthesized properties were not precise enough to re-synthesize the reference implementation.For example, as stated at the end of §2, for the stutter benchmark our DSL did not contain multiplication by 2 and spyro could not synthesize a property stating that the length of the output list is twice the length of the input list.After modifying the DSL to contain multiplication by 2 and the ability to describe when an element appears in both the input and output lists, spyro successfully synthesized 7 properties in 154.71 seconds (see Eq. 6).From the augmented set of properties, Synqid could synthesize the reference implementation of stutter.This experiment shows how the ability to modify the DSL empowers the user of spyro with ways to customize the type of properties they are interested in synthesizing.Other Benchmarks.A core property of Stack is the principle of Last-In First-Out (LIFO).spyro was able to synthesize a simple formula that captures LIFO by looking at the relationship between push and pop: given the query  1 =push( 1 ,  1 ) and ( 2 ,  2 )=pop( 2 ), spyro synthesized the properties eq( 1 ,  2 ) ⇒  1 = 2 and eq( 1 ,  2 ) ⇒ eq( 1 ,  2 ).
A Queue is a data structure whose formal behavior is somewhat hard to describe.Unlike Stack, the behavior of a Queue is not expressible by a simple combination of input and output variables.spyro could synthesize formulas describing the behavior of each Queue operation by providing a conversion function from a Queue consisting of two Lists into a List.For the query (, ) = dequeue(), spyro synthesized the property eq(toList(), cons(, toList())).

Application 2: Synthesizing Algebraic Specifications for Modular Synthesis
In many applications of program synthesis, one has to synthesize a function that uses an existing implementation of certain external functions such as data-structure operations-i.e., synthesis is to be carried out in a modular fashion.Even if one has to synthesize a small function implementation, the synthesizer will need to reason about the large amount of code required to represent the external functions, which can hamper performance.Mariano et al. [2019] recently proposed a new approach to modular synthesis-i.e., functions are arranged in modules-where instead of providing the synthesizer with an explicit implementation of the external functions, one provides an algebraic specification-i.e., one that does not reveal the internals of the module-of how the functions in a module operate.For example, to describe the semantics of the functions emptySet, add, remove, contains and size in a HashSet module, one would provide the algebraic properties in Eq. 7, which describe appropriate data-structure invariants, such as handling of duplicate elements.
While their approach has shown promise in terms of scalability, to use this idea in practice one has to provide the algebraic specifications to the synthesizer manually, a tricky task because these specifications typically define how multiple functions interact with each other.
In our case study, we used spyro to synthesize algebraic specifications for benchmarks used in the evaluation of JLibSketch, an extension of the sketch tool that supports algebraic specifications [Mariano et al. 2019].We considered the 3 modules-ArrayList, HashSet, and HashMap-that provided algebraic specifications, did not use string operations (our current implementation does not support strings), and did not require auxiliary functions that were not present in the implementation to describe the algebraic properties.For each module, JLibSketch contained both the algebraic specification of the module and its mock implementation-i.e., a simplified implementation that mimics the intended library's behavior (e.g., HashSet is implemented using an array).Given the mock implementation of the module, we asked spyro to synthesize most-precise algebraic specifications.
For this case study, designing a grammar that accepted all possible algebraic specifications but avoided search-space explosion proved to be challenging.Instead, we opted to create multiple grammars for each module to target different parts of the algebraic specifications, and called spyro separately for each grammar.For example, if the JLibSketch benchmark contained an algebraic specification size(add(, )) = size() + 1, we considered the grammar to contain properties of the form  ⇒ size(add(, )) = .All the DSLs designed for algebraic specifications synthesis were reused by modifying what function symbols could appear in the DSL.The detailed grammars are presented in Appendix C.2. spyro terminated with a best L-conjunction for all the benchmarks (and grammars) in less than 800 seconds per benchmark.spyro was slower than the enumerative baseline presented in Section 5.1.1 for very small languages (|L| < 5) but faster in all other cases.The speedups were not as prominent as for Application 1.
For all but one benchmark, the L-conjunctions synthesized by spyro were equivalent to the algebraic properties manually designed by the authors of JLibSketch.For the implementation Table 1.Evaluation results of spyro.A few representatives benchmarks are selected from each application.A (*) indicates a timeout when attempting to prove precision in the last iteration.In that case, we report as total time the time at which spyro timed out.The Enum. column reports the estimated time required to run CheckSoundness for all formulas in the DSL L. This estimation is achieved by multiplying the size of the grammar by the average running time of the CheckSoundness. of HashMap provided in JLibSketch, for one specific grammar, spyro synthesized an empty Lconjunction (i.e., the predicate true) instead of the algebraic specification provided by the authors of JLibSketch-i.e.,  1 =  2 ⇒ get(put(,  1 , ),  2 ) = .Upon further inspection, we discovered that the implementation of HashMap used in JLibSketch was incorrect and did not satisfy the specification the authors provided, due to an incorrect handling of hash collision!After fixing the bug in the implementation of HashMap, we were able to synthesize the algebraic specification.
Because algebraic properties often involve multiple functions, we were not able to separately verify their correctness on all inputs using the Dafny verifier, but the fact that we obtained the same properties that the authors of JLibSketch specified in their benchmarks is a strong signal that our properties are indeed sound.
Finding: spyro can help automate modular synthesis by synthesizing precise algebraic specifications that existing synthesis tools can use to speed up modular synthesis.Thanks to spyro's provable guarantees, we were able to uncover a bug in one of JLibSketch's module implementations.

Application 3: Automating Sensitivity Analysis
Automatically reasoning about quantitative properties such as differential privacy [D' Antoni et al. 2013] in programs requires one to analyze how changes to a program input affect the program output-e.g., differential privacy typically requires that bounded changes to a function's input cause bounded changes to its output.A common and inexpensive approach to tackle this kind of problem is to use a compositional sensitivity analysis (either in the form of an abstract interpretation or of a type system [D' Antoni et al. 2013]) in which one tracks how sensitive each operation in a program is to changes in its input.For example, one can say that the function  () = abs(2), when given two inputs  1 and  2 that differ by , produces two outputs that differ by at most 2.
While for the previous function, it was pretty easy to identify a precise sensitivity property, it is generally tricky to do so for functions involving data structures, such as lists, which are of interest in differential privacy when the list represents a database of individuals [Wang et al. 2016].In this case study, we considered 9 list-manipulating functions (append, cons, deleteFirst, delete, reverse, snoc, stutter, tail, and cons_delete) and used spyro[sketch] to synthesize precise sensitivity properties describing how changes to the input lists affect the outputs.The cons_delete benchmark uses the query deleteFirst(cons(, ), ) involving the composition of two functions.
For each function  , we used spyro to synthesize a property of the form where  can be the predicate true or an equality/inequality between  1 and  2 (the grammars vary across benchamrks), dist is the function computing the distance between two lists (we run experiments using both edit and Hamming distance), and the expression exp (the part to synthesize) can be any linear combinations of len(), len(), , and constants in the range -1 to 2. When considering all combinations of functions, guards, and distances, we obtained 18 benchmarks.All the DSLs designed for algebraic specifications synthesis were reused by modifying what function symbols could appear in the DSL.The complete grammars are shown in Appendix C.3.spyro terminated with a best L-conjunction for all the benchmarks (and grammars) in less than 250 seconds per benchmark.spyro outperformed the enumerative baseline presented in Section 5.1.1 for every problem (3.14× speedup for Hamming-distance sensitivity problems and 11.93× speedup for edit-distance sensitivity problems-geometric mean).
We observed that even for simple functions, sensitivity properties are fairly complicated and hard to reason about manually.For example, spyro synthesizes the following sensitivity L-property for the function deleteFirst (  denotes the edit distance): ( 1 ,  2 ) ≤  ⇒   (deleteFirst( 1 ,  1 ), deleteFirst( 2 ,  2 )) ≤  + 2 However, if we add a condition that the element removed from the two lists is the same, spyro can synthesize the following L-property that further bounds the edit distance on the output: When inspecting this property, we were initially confused because we had thought that the edit distance should not increase at all if identical elements are removed.However, that is false as illustrated by the following tricky counterexample  1 = [1; 2; 3],  2 = [3; 2; 3] and  1 =  2 = 3.Besides the L-property shown above with bound  + 1, spyro also synthesized (incomparable) best L-properties with bounds len( 1 ) − len( 2 ) + 2 and len( 2 ) − len( 1 ) + 2 for the same query.All combined, these L-properties imply that the edit distance should not increase when  = 0.
Because of the complexity added by the programs that compute the edit and Hamming distances, by the use of unbounded data structures, and by the fact that sensitivity properties are hyperproperties, we were not able to separately verify the soundness of they synthesized L-conjunctions on all inputs using the Dafny verifier.However, we believe that the synthesized properties are indeed sound given that they hold for lists up to length 7-i.e., the bound imposed by sketch.
Finding: spyro can synthesize precise sensitivity properties for functions involving lists; the synthesized function would be challenging for a human to handcraft.

Application 4: Enabling New Abstract Domains
One of the most powerful relational abstract domains is the domain of convex polyhedra [Bagnara et al. 2008;Cousot and Halbwachs 1978].While programs typically operate over int-valued program variables for which arithmetic is performed modulo a power of 2, such as 2 16 or 2 32 , existing implementations of polyhedra are based on conjunctions of linear inequalities with rational coefficients over rational-valued variables.This disconnect prevents polyhedra from precisely modeling how values wrap around in int/bit-vector arithmetic when arithmetic operations overflow.
Heretofore, it has not been known how to create an analog of polyhedra that is appropriate for bit-vector arithmetic.Yao et al. [Yao et al. 2021] recently defined two domains (Version 1 and Version 2 below), but have only devised algorithms to support Version 2. Version 1 (bit-vector-polyhedra domain): conjunctions of linear bit-vector inequalities Version 2 (integral-polyhedra domain): conjunctions of linear integer inequalities The case study described in this section shows that spyro provides a way to enable precise polyhedra operations for Version 1.The main reason why operations for bit-vector polyhedra have not been proposed previously is that it is challenging to work with relations over bit-vector-valued variables.For example, let  and  be 4-bit bit-vectors.Fig. 4(a) depicts the satisfying assignments of the inequality  +  + 4 ≤ 7 interpreted over 4-bit unsigned modular arithmetic.As seen in the plot, the set of points that satisfy a single bit-vector inequality can be a non-contiguous region.
We instantiated spyro to take as input a formula  ∈ L BV and return a conjunction  of bitvector inequalities-i.e., the symbolic abstraction [Reps and Thakur 2016, §5] of  in the conjunctive fragment L BV (∧) .Because spyro computes best L-properties, in this setting it computes the mostprecise symbolic abstraction-i.e., the formula  computed by spyro is one representation of the most-precise abstraction of  that is expressible as a conjunction of bit-vector inequalities.Fig. 4.Each subfigure illustrates a bit-vector formula (in green) and the most precise bit-vector polyhedron computed by spyro (i.e., the inequalities above the formulas).Each colored cell in the plots represents a solution in 4-bit unsigned modular arithmetic of the conjunction of the inequalities found by spyro: green cells represent solutions to the original formula, whereas red cells are points that are solution to the inequalities, but do not satisfy the original formula.In (a) and (b), the conjunctive formula represents the original formula exactly (there are only green cells).In (b), the twelve occurrences of red cells are points that do not satisfy the original formula, but are needed for a conjunctive formula to over-approximate the original formula.
As known from the literature ( [Reps et al. 2004;Thakur et al. 2012;Thakur and Reps 2012] and [Reps and Thakur 2016, §5]), operations needed for abstract interpretation, such as (i) the creation and/or application of abstract transformers, and (ii) taking the join of two abstract-domain elements, can be performed via an algorithm for symbolic abstraction.For instance, if   and   are two formulas in L BV (∧) , we can perform the join   ⊔   by  (  ∨   ).(Note that   ∨   is not a formula of L BV .)For this reason, we say that spyro enables this new abstract domain.
In our experiments, we limited inequalities to two variables  and  on each side, and used 4-bit unsigned arithmetic.Our benchmarks were taken from an earlier study conducted by one of the authors, which on each example used brute force to consider all 16,762,320 non-tautologies of the 16,777,216 4-bit inequalities of the form  +  +  ≤  +  +  .That study found that some example formulas had hundreds of thousands of inequalities as consequences.We selected 9 interesting-looking formulas to use as benchmarks, including linear/nonlinear operations, equalities and inequalities, Boolean combinations, and one pair of formulas on which to perform the join operation.(See §5.4.2 and Appendix C.4.) 5.4.1 Quantitative Analysis.spyro[sketch] computed a sound best L-conjunction for 9/9 formulas, and guaranteed that 8/9 were best L-conjunctions.For the query  = /2, spyro[sketch] timed out on a call to CheckPrecision, but the obtained L-conjunction was indeed a best L-conjunction because it defined the exact semantics of the query.spyro computed an L-conjunction for all the 9 benchmarks in less than 400 seconds per benchmark, which is 2-5 orders of magnitude faster than the enumerative baseline presentd in Section 5.1.1.Each output L-conjunction contained between 1 and 6 L-properties.For this domain, sketch-and hence spyro[sketch]-is sound and precise because we are working with bit-vector arithmetic of fixed bit-width.
spyro [sketch] could not terminate for most of our benchmarks when considering 8-bit arithmetic.Because our examples contain several multiplications, this limitation is not surprising because multiplication is one of the known weaknesses of sketch and its underlying SAT solver.Recently, there have been promising advances in SAT solving for multiplication circuits [Kaufmann et al. 2022] that, if integrated with sketch, we believe would help spyro scale to larger bit-vectors.

Qualitative Analysis.
Examples of results obtained by spyro are shown in Fig. 4. The result that the most-precise abstraction of these formulas could be expressed using only a small number of inequalities was surprising to the authors.In Fig. 4(c), the formula we are abstracting is  = df  ≤  * ).Fig. 4(c) shows that, in addition to the green points that satisfy the non-linear inequality  ≤  * , the bit-vector-polyhedral abstraction found for  includes twelve "extra" points, indicated by the red cells.Because spyro finds a most-precise sound bit-vector-polyhedral abstraction of , every sound bit-vector-polyhedral abstraction of  must also include those twelve points.In an earlier study conducted by one of the authors, they used brute force to consider all 16,762,320 non-tautologies of the 16,777,216 4-bit inequalities of the form  +  +  ≤  +  +  .That study found that the following numbers of inequalities over-approximated the original formula: 564 for Fig. 4(a), 109,008 for Fig. 4(b), and 456 for Fig. 4(c).Thus, spyro showed that a most-precise abstraction could be 2-4 orders of magnitude smaller than one obtained by brute force.
Finding: spyro can synthesize the most-precise sound bit-vector-polyhedral abstraction of a given bit-vector formula  over 4-bit arithmetic.Furthermore, spyro surprised the authors by showing that the most-precise sound bit-vector-polyhedral abstraction for the presented examples could be precisely expressed with only a handful of bit-vector inequalities.

Further Analysis of spyro's Performance
In the previous sections, we have shown that spyro can synthesize best L-conjunctions for a variety of case studies.In this section, we analyze what parameters affect spyro's running time.
Q1: How do Different Primitives of the Algorithm Contribute to the Running Time?On average, spyro spends 13.69 % of the time performing Synthesize, 26.78 % performing CheckSoundness, and 42.33 % performing CheckPrecision (details in Table 2 in App.C).
It usually takes longer for CheckSoundness and CheckPrecision to show the nonexistence of an example-i.e., to return ⊥-than to find an example.CheckSoundness is one of the simplest queries, but occupies a large portion of the running time because it is expected to return ⊥ many times, whereas CheckPrecision needs to return ⊥ only once for each call to Synthesize-StrongestConjunct.The last call to CheckPrecision (i.e., the one that returns ⊥) often takes a significant amount of time to complete (on average 19.61 % of the time spent on each run of SynthesizeStrongestConjunct).
Finding: spyro spends most of the time checking soundness and precision.
Q2: What Parts of the Input Affect the Running Time?The number of L-properties in the language L has a large impact on the time taken by Synthesize (Fig. 5a) and CheckPrecision.
The complexity of the code defining the semantics of various operators has a large impact on how long CheckSoundness takes.insert, delete of BST and edit distance have relatively complicated implementations, and CheckSoundness takes longer for these problems.
The size and complexity of the example space also affect the running time.The biggest factor contributing to the size of the example space is the number of input and output variables used.The number of possible examples, i.e., variable assignments, increases exponentially with the number of variables.The size of the example space affects not only the number of total queries but also the time that each query takes.Specifically, the number of positive and negative examples affects the time taken by Synthesize or CheckPrecision (notice that CheckSoundness does not take the examples as input), as shown in Figure 5b.
Finding: The running time of spyro is affected by the sizes of (i) the property search space, (ii) the programs that describe the semantics of the operators, and (iii) the example search space.Q3: How effective is line 12 in Algorithm 1? We compared the running times of spyro with and without line 12 (Figure 5c).When line 12 is present, spyro is 3.06% faster (geometric mean) than when line 12 is absent, but both versions can solve the same problems.The optimization is only effective when the language has many incomparable properties that do not imply each other and cause Synthesize to often return ⊥, thus triggering Line 10 in Algorithm 1-e.g., in all SyGuS benchmarks the language L is such that the optimization is not used.
Finding: Freezing negative examples is slightly effective.

RELATED WORK
Abstract-interpretation techniques.Many static program-analysis techniques are pitched as tools for checking safety properties, but behind the scenes they construct an artifact that abstracts the behavior of a program (in the sense of abstract interpretation [Cousot and Cousot 1977]).One such kind of artifact is a procedure summary [Cousot and Cousot 1978;Cousot and Halbwachs 1978;Gopan and Reps 2007;Sharir and Pnueli 1981], which abstracts a procedure's transition relation with an abstract value from an abstract domain, the elements of which denote transition relations.Our problem is an instance of the strongest-consequence problem [Reps and Thakur 2016].Existing techniques for solving this problem rely on properties of the language L that are typical of abstract interpretation.Some techniques work from "below," identifying a chain of successively weaker implicants, until one is a consequence of  [Reps et al. 2004].Other techniques work from "above, " identifying a chain of successively stronger implicates, until no further strengthening is possible [Thakur et al. 2012;Thakur and Reps 2012].Ozeri et al. [2017] explored a different approach, which works from above by repeatedly applying a semantic-reduction operation [Cousot and Cousot 1977].(A semantic reduction operation finds a less-complicated description of a given set of states if one exists.)Our work differs from methods that use abstract interpretation in several aspects.First, our algorithm is the first to use both positive and negative examples to achieve precision.Second, while our work supports a variety of DSLs specified via a grammar, existing methods require that certain operations can be performed on concrete states and elements of the language L (e.g., joins [Thakur and Reps 2012]), thus limiting the language that can serve as the DSL.
Type inference.Liquid type inference [Hashimoto and Unno 2015;Rondon et al. 2008;Vazou et al. 2014] can infer a weakest precondition from a given postcondition or a strongest postcondition from a given precondition.To make the problem tractable, properties must be specified in a user-given restricted set of predicates that are closed under Boolean operations.Although our work shares some similarities-e.g., looking for properties over a restricted DSL-we tackled a fundamentally different problem because no pre-or post-condition is given as an input, and our algorithm instead looks for best L-properties.Furthermore, our work is not restricted to functional languages.Invariant inference.Several data-driven, CEGIS-style algorithms can synthesize program invariants.These techniques look for any invariant that is satisfactory for a client verification problem, whereas we do not assume there is a client for whom the properties are synthesized.Without a client, "true" is a sound (and also weakest) but useless specification.The absence of a client requires synthesized specifications to be precise, therefore requiring our new CheckPrecision primitive.
A closely related system is Elrond [Zhou et al. 2021], which synthesizes weakest library specifications that make verification possible in a client program.Elrond allows one to specify a set of target predicates of interest, and finds quantified Boolean formulas with equalities over the variables and the predicates.Because soundness (with respect to the library function) is only checked on a set of inputs, Elrond tries to synthesize weakest specifications using an iterative weakening approach.Their algorithm takes advantage of the structure of supported formulas (they can contain disjunctions), but has some limitations (they can only contain equalities).
The key differences between Elrond and our work are: (i) Our work supports a user-supplied DSL, which enables more generality, but prevents the use of techniques that rely on access to arbitrary Boolean operations.(ii) Our work uses a parametric DSL that can contain complex usergiven functions, whereas Elrond only allows parametric atoms (i.e., user-defined Boolean function), equality over variables, and Boolean combinations of them.Our tool spyro can synthesize the L-property "2 = 0 ∨ − + 2 −  2 = 0," whereas Elrond does not consider arithmetic predicates.(iii) The specifications generated by spyro are most precise with respect to the DSL L, allowing for their reuse in multiple problem instances that use the same function.Such reuse is possible because spyro ensures soundness of the synthesized properties.In contrast, Elrond operates in a closedbox setting and uses a random sampler for soundness, intentionally weakening the synthesized specifications to enhance the likelihood of soundness.(iv) Our work can synthesize arithmetic properties efficiently (e.g., the ones considered in Sections 5.3 and 5.4) as well as complex algebraic properties (e.g., the ones in Section 5.2).In theory, Elrond can describe all the necessary components to express algebraic properties such as the property (, ) = pop(push(, )), but one would need to provide a function that combines pop and push.This approach is feasible if one is interested in only a few possible ways of combining functions, but becomes infeasible once more combinations are possible (e.g., for an arithmetic formula).In short, the two approaches have different goals.
The presence of a user-supplied DSL and absence of a client distinguish our work from other prior work, e.g., abductive inference [Dillig et al. 2012], ICE-learning [Garg et al. 2014], LoopInvGen [Padhi et al. 2016], Hanoi [Miltner et al. 2020], and Data-Driven CHC Solving [Zhu et al. 2018].Dynamic techniques.Daikon [Ernst et al. 2001, 2007] is a system for identifying likely invariants by inspecting program traces.Invariants generally involve at most two program quantities and are checked at procedure entry and exit points (i.e., invariants form precondition/postcondition pairs).In Daikon, the default is to check 75 different forms of invariants, instantiated for the program variables of interest.The language of invariants can be extended by the user.
spyro differs from Daikon (and follow-up work, e.g., [Beckman et al. 2010]) in two ways: (i) The language L is not limited to a set of predicates, and spyro scales to languages containing millions of properties; (ii) the properties that spyro synthesizes are sound and provably best L-properties.Furthermore, while Daikon's dynamic approach can scale to large programs, we could not find a way to encode our case studies as instances that Daikon could receive as input.
A similar tool to Daikon is QuickSpec [Smallbone et al. 2017], which generates equational properties of Haskell programs from random tests.spyro differs from QuickSpec in two ways: (i) The language L is not limited to equational properties; (ii) the properties that spyro synthesizes are sound and provably best L-properties.Astorga et al. [2021] synthesize contracts that are sound with respect to positive examples generated by a test generator.We see two main differences between that work and ours: (i) They do not use negative examples, whereas we do.Negative examples are the key to synthesizing best L-properties.(ii) Their work does not allow a parametrized DSL and their notion of "tight" is with respect to a syntactic restriction on the logic in which the contract is to be specified.Synthesis of best L-transformers.The paper that inspired our work synthesizes most-precise abstract transformers in a user-given DSL [Kalita et al. 2022].We realized that their basic insight-use both positive and negative examples; treat positive examples as hard constraints and negative examples as "maybe" constraints-had broader applicability than just creating abstract transformers.
Our work differs in two key ways.First, because our goal is to obtain a formula rather than a piece of code (their setting), we could take advantage of the structure of formulas-in particular, conjunctions-to decompose the problem into (i) an "inner search" to find a best L-property that is an individual conjunct (Alg.1), and (ii) an "outer search" to accumulate best L-properties to form a best L-conjunction (Alg.2).Second, Alg. 1 exploits monotonicity-i.e., once a sound L-property is found, there must exist a best L-property that implies it (Lemma 3.1).This observation allows us to use a simplified set of primitives: our algorithm uses a Synthesize primitive, whereas theirs requires a MaxSynthesize primitive-i.e., one that synthesizes a program that accept all the positive examples and rejects as many negative examples as possible.Our ideas could be back-ported to provide improvements in their setting as well: if the abstract domain supports meet (⊓), they could run their algorithm multiple times to create a kind of "conjunctive" transformer, which would run multiple, incomparable best L-transformers, and then take the meet of the results.

CONCLUSION
This paper presents a formal framework for the problem of synthesizing a best L-conjunction-i.e., a conjunctive specification of a program with respect to a user-defined logic L-and an algorithm for automatically synthesizing a best L-conjunction.The innovations in the algorithm are threefold: (i) it identifies individual conjuncts that are themselves strongest consequences; (ii) it balances negative examples that must be rejected by the L-property being synthesized and ones that may be rejected; and (iii) it guarantees progress via monotonic constraint hardening.
Our work opens up many avenues for further study.One is to harness other kinds of synthesis engines to implement Synthesize, CheckSoundness, and CheckPrecision.Recent work on Semantics-Guided Synthesis (SemGuS) [Kim et al. 2021] provides an expressive synthesis framework for expressing complex synthesis problems like the ones discussed in this paper.SemGuS solvers, such as MESSY [Kim 2022], are able to produce two-sided answers to a problem: either synthesizing a solution, or proving that the problem is unrealizable-i.e., has no solution-exactly what is needed for CheckPrecision.On the theoretical side, while we have used first-order logic, it would be interesting to try other logics, such as separation logic [Reynolds 2002] or effectively propositional logic [Itzhaky 2014;Padon 2018].On the practical side, our work could find applications in invariant generation [Padon et al. 2022] and code deobfuscation [Blazytko et al. 2017

B THE ENCODING OF OUR PRIMITIVES IN SKETCH
We discuss below how spyro uses the features of sketch to implement the DSL L specified by the given grammar, and the operations Synthesize, CheckSoundness, and CheckPrecision.
The grammar of the DSL L is compiled to a sketch generator.A generator is a function with holes that allows one to build complex programs via recursion.Intuitively, holes are used to allow the synthesizer to make choices about what to output.In our case, holes are used to select which productions are expanded at each point in a derivation tree, and recursion is used to move down the derivation tree.sketch allows one to provide a recursion bound to make synthesis tractable, and spyro's grammar syntax allows one to provide such bounds for each of the nonterminals in the grammar.Fig. 6 shows how the generator genAP for the productions of nonterminal  in Equation 2 can be specified.sketch uses assertions (which are often written within so-called harnesses) to impose constraints on the programs produced by a generator.In our setting, assertions can be used to produce properties that accept all positive examples and reject all must negative examples.
Synthesize is implemented using a call to sketch that finds an L-property produced by the generator that satisfies all the assertions.
We also use generators to write data-structure constructors, which are functions that can generate data structures that satisfy desired invariants-e.g., list_constructor() can generate any valid list up to a certain length.We bound the size of the inputs in our experiments.
Given a logical specification of a concrete function, we use sketch as a satisfiability solver.For a given candidate L-property, CheckSoundness uses sketch to attempt to synthesize a soundness counterexample.
CheckPrecision asks sketch to find a property that satisfies all the assertions, together with a new negative example that the newly synthesized property can reject and the previous property accepts (see Figure 7).Term minimization: To avoid generating unnecessarily complex formulas, we use the minimization feature of sketch to look for terms of smallest size (both for Synthesize and CheckPrecision).Because we are only interested in the final formula, it is enough to apply minimization only in the second call to SynthesizeStrongestConjunct in Alg. 2 (line 12).  7. sketch code for CheckPrecision with positive and negative examples.lnegex is the negative example  − generated by the generator list_constructor that would witness imprecision, phi is a function describing the current property , psi is a function describing the predicate  , and phiprime is a function calling the generator genD to find an L-property  ′ that together with lnegex would form a witness for imprecision.

C EVALUATION DETAILS
In this section, we first describe the benchmarks in detail and then present detailed evaluation metrics.
C.1 Specification-Mining Benchmarks C.1.1 SyGuS.We "inverted" the roles from the original benchmarks.We selected a few representative reference implementations of functions from the SyGuS repository, and applied spyro to synthesize specifications for them.In this case, we have a "ground-truth" property against which we can compare, namely, the specification that a SyGuS solver would start from to synthesize the function.We bounded the size of integers to 5 bits.Max.max2, max3, and max4 are problems where the goal is to find a function that computes the maximum of 2, 3, or 4 integers, respectively.For each problem, we provided the reference implementation given in the SyGuS repository, and synthesized properties expressed in the grammar The nonterminal  derives all input and output variables appearing in the given query.The grammar for max2 and max3 allows up to three disjuncts in , the grammar for max4 allows four disjuncts, and the grammar for max5 allows five disjuncts.
Diff. diff is a SyGuS problem where the goal is to find a commutative function that returns the difference of two input variables-i.e.=− ∨ =− whenever =diff(, ).
First, the diff1 benchmark synthesizes consequences of the query =diff(, ) using the following grammar: The nonterminal  derives all input and output variables appearing in the given query, and also the constant 0, which appears in the implementation of the function.
Second, the diff2 benchmark synthesizes consequences of the query  1 =diff( 1 ,  1 ) ∧  2 =diff( 2 ,  2 ), using the following grammar: Using a query in which a function symbol appears twice allows one to express properties such as commutativity.
Array.arrSearch2 and arrSearch3 are SyGuS problems where the goal is to synthesize an expression that finds the index of an element in a sorted tuple of size 2 and 3, respectively.For example,  = arrSearch3( 1 ,  2 ,  3 , ) implies that  is the index of the value  in the sorted tuple ( 1 ,  2 ,  3 ).
For each problem, we synthesize properties expressed in the following grammar: The nonterminal  derives all input variables, while the nonterminal  derives the output and all index constants appearing in the implementation.The grammar for arrSearch2 (resp.arrSearch3) allows up to three (resp.four) disjuncts.
C.1.2Synquid.We collected all the problems from the Synqid paper [Polikarpova et al. 2016] that involve Lists, Binary Trees, and Binary Search Trees (BSTs), and that do not use other datatypes or higher-order functions.In these cases, we have a "ground-truth" property provided by the polymorphic refinement type used to synthesize the function.The reference code that we used for each problem is a version of the code synthesized by Synqid, rewritten in a C-style language.
We modified elemIndex to return -1 instead of raising an exception when there is no matched element in the list, and added a new problem deleteFirst that deletes only the first occurrence of an element.
List.We considered 13 different methods for Lists constructed by Nil and Cons (we bounded the length of lists generated by list constructors to 10): append, deleteFirst, delete, drop, elem, elemIndex, ith, min, replicate, reverse, snoc, stutter, and take.Given the reference implementation from the Synqid benchmark, we synthesized properties in the grammar: The nonterminal  derives all input and output variables denoting list elements, the nonterminal  derives all integer input and output variables and constants, and the nonterminal  derives all list input and output variables in the implementation.Size comparisons are used only when the output variable is a list, index, or size.The number of occurrences of the nonterminal  on the right-hand side of a comparison is equal to the number of input variables of type list, index, or size.
In addition to the above 13 problems, reverse_twice synthesizes properties for the query,  1 =reverse( 1 ) ∧  2 =reverse( 2 ), using the following grammar:4 Binary Tree.The first set of Binary Tree problems consists of the method elem and the constructors emptyTree and branch.We provided the reference implementations from the Synqid benchmark, and synthesized properties expressed in the following grammar: The nonterminal  derives all input and output variables for tree elements, the nonterminal  derives all integer input and output variables and constants, and the nonterminal  derives all tree input and output variables that appear in the implementation.
In the second set of Binary Tree problems we synthesize properties that capture relationships between the constructor branch and the three methods rootval, left, and right, which return the respective elements of branch.Properties are expressed in the following grammar: The nonterminal  derives all input and output variables for tree elements, and the nonterminal  derives all input and output variables of type Tree.Binary Search Tree.We considered four different methods for BSTs: emptyBST, insert, delete, and elem.We provided the reference implementations from the Synqid benchmarks, and synthesized properties expressed in the grammar in Eq. ( 14).We provided a data-structure constructor that can generate all valid BSTs up to size 4 using the following operations ( stands for an integer): C.1.3Other.We designed 14 problems to cover missing categories.(i) Stack and Queue are data structures implemented using other data structures (i.e., Lists), (ii) Integer Arithmetic cover more integer arithmetic problems, and in particular, cases where the input program uses loops.
Stack.We considered three Stack methods: emptyStack, push and pop and synthesized properties in the following grammar: The nonterminal  derives all input and output variables for indices or size, and the nonterminal  derives all input and output variables of Stack type.
The data-structure constructor generates all stacks up to size 10 using the following operations: Finally, the problem push_pop synthesizes properties for the query  1 =push( 1 ,  1 ) ∧ ( 2 ,  2 )=pop( 2 ) using the following grammar: Queue.We considered three Queue methods: emptyQueue, enqueue and dequeue.We defined a functional-style Queue using two Lists, and provided implementations of emptyQueue, enqueue, and dequeue using List methods.We also provided a function toList that converts a Queue to a single List in which elements of Queue are in the intended order, and synthesized properties expressed in the following grammar: The nonterminal  derives all input and output variables for Queue elements, and the nonterminal  derives all input and output variables of type Queue.We provided a data-structure constructor that can generate all valid queues up to size 4 using the following operations: Integer Arithmetic.We considered three additional integer-manipulating functions: (i) abs() computes the absolute value of the integer  (the semantics is expressible in SMT-Lib); (ii) linSum() outputs the value of  using a loop that counts up to ; (iii) nonlinSum() computes the sum of all integers from 0 to the value of  using a loop.First, we synthesize properties expressed in the following grammar where the nonterminal  derives all input and output variables, constants that appear in the code, and their negations: Then we synthesize properties in the following grammar, which can express more general properties involving linear combinations of the integer terms of interest: The terms from  1 to   are the input and output variables and constants that appear in the implementation.In the case of nonlinSum, the quadratic term  •  was also included.Finally, for the case of abs(), we also considered the following grammar, which was introduced in Section 3.2.3 to illustrate a case in which the given language forms a well-quasi-order but is not finite.
C.2 Algebraic-Specification Benchmarks For each module, we considered queries that were relevant to synthesize the algebraic properties described in the JLibSketch benchmarks.We considered fairly small grammars as the module's implementations contained many lines of code and were otherwise hard to perform synthesis for.For example, the implementation of HashMap contained 150 lines of code and involved arrays.
C.2.1 ArrayList.For the query  = get(add(, ), ), we considered the following grammar: In this case, spyro synthesized the algebraic properties For the query  = len(emptyArray), we considered the following grammar: In this case, spyro synthesized the algebraic property  = 0 (28) For the query  = len(add(, )), we considered the following grammar: In this case, spyro synthesized the algebraic property  = len( ) + 1 (30) C.2.2 HashMap.We implemented the KeyNotFound exception of get using an additional output flag  -i.e., get(, ) refers the output value when the error flag is not set to true, and  is true when any of the functions that called has thrown a KeyNotFound exception.As mentioned in Section 5.2, the initial mock implementation for HashMap provided in JLibSketch was incorrect.In the rest of this section, we assume the correct mock implementation we fixed before synthesizing properties.
For the query  = get(emptyMap, ), we considered the following grammar: In this case, spyro synthesized the algebraic property For the query  = get(put(,  1 , ),  2 ), we considered the following grammar: In this case, spyro synthesized the algebraic properties For the query  = put(put(,  1 ,  1 ),  2 ,  2 ), we considered the following grammar: In this case, spyro synthesized the algebraic property C.2.3 HashSet.For the query  = size(emptySet), we considered the following grammar: In this case, spyro synthesized the algebraic property For the query  = size(add(, )), we considered the following grammar: In this case, spyro synthesized the algebraic properties contains(, ) ⇒  = size() ¬contains(, ) ⇒  = size() + 1 (40) In this case, spyro synthesized the following sensitivity properties for cons: the following properties for deleteFirst, the following properties for delete, and the following properties for snoc: For queries of the form  1 =  ( 1 ) ∧  2 =  ( 2 ), with  ∈ {reverse, stutter, tail}, we considered the following grammar: In this case, spyro synthesized the following sensitivity properties for reverse, the following properties for stutter, and the following properties for tail: Properties for  1 = deleteFirst(cons( 1 , ), ) ∧  2 = deleteFirst(cons( 2 , ), ) are also expressed in the grammar Eq. ( 61).In this case, spyro synthesized the following sensitivity properties:   8(d) shows the L BV (∧) formula   computed by spyro for   ⊔   , as well as the satisfying solutions of   .The join computation was performed by applying spyro to   .In this case, spyro showed that the result could be expressed with a single inequality, namely, 1 ≤ 7 + 9 + 1.

C.5 Detailed Evaluation for each Applications
Table 2 shows how many times each query is performed for each benchmark in our first application in Section 5.1 and the total running time spent performing each type of query.Table 3 shows the same metrics for the other three applications."Last iter."denotes the time taken to perform the final call to CheckPrecision from the final call on SynthesizeStrongestConjunct at line 5 of Alg. 2. That call on SynthesizeStrongestConjunct is the most difficult one because it implicitly establishes that the synthesized L-conjunction is indeed a best one.
In Table 2, for the benchmarks above the thick line, the semantics of the involved operations are expressible in SMT-Lib, whereas for the other benchmarks the semantics are expressed in sketch.Table 2. Evaluation results of spyro for specification-mining.The Enum. column reports the estimated time required to run CheckSoundness for all formulas in the DSL L. This estimation is achieved by multiplying the size of the grammar by the average running time of the CheckSoundness.A (*) indicates a timeout when attempting to prove precision in the last iteration.In that case, we report as total time the time at which spyro timed out.
3.2.2Checking Precision.Given an L-property , a set of positive examples  + accepted by , a set of negative examples  − rejected by , and a Boolean formula  denoting the set from which examples can be drawn, CheckPrecision(,,  + ,  − ), checks whether there exist an L-property  ′ and a negative example  − such that: (i)  ′ accepts all the positive examples in  + and rejects all the negative examples in  − ; (ii)  ( − ) and  ′ rejects  − , whereas  accepts  − .Formally,

Fig. 5 .
Fig. 5. Evaluation of the running time of spyro for different input sizes and optimizations.

( 2 *Fig. 8 .
Fig.8.Each green and red cell represents a solution in 4-bit unsigned modular arithmetic of the indicated formula.In (d), the occurrences of red cells are points that do not satisfy (c), but are needed for a conjunctive formula to over-approximate (c).

Table 3 .
Evaluation results of spyro for sensitivity analysis and abstract domains.The Enum. column reports the estimated time required to run CheckSoundness for all formulas in the DSL L. This estimation is achieved by multiplying the size of the grammar by the average running time of the CheckSoundness.A (*) indicates a timeout when attempting to prove precision in the last iteration.In that case, we report as total time the time at which spyro timed out.