Equivalence by Canonicalization for Synthesis-Backed Refactoring

We present an enumerative program synthesis framework called component-based refactoring that can refactor "direct" style code that does not use library components into equivalent "combinator" style code that does use library components. This framework introduces a sound but incomplete technique to check the equivalence of direct code and combinator code called equivalence by canonicalization that does not rely on input-output examples or logical specifications. Moreover, our approach can repurpose existing compiler optimizations, leveraging decades of research from the programming languages community. We instantiated our new synthesis framework in two contexts: (i) higher-order functional combinators such as map and filter in the statically-typed functional programming language Elm and (ii) high-performance numerical computing combinators provided by the NumPy library for Python. We implemented both instantiations in a tool called Cobbler and evaluated it on thousands of real programs to test the performance of the component-based refactoring framework in terms of execution time and output quality. Our work offers evidence that synthesis-backed refactoring can apply across a range of domains without specification beyond the input program.


INTRODUCTION
Functional programming languages often provide programmers with higher-order reusable components to achieve functionality like mapping a function over a list.For example, a programmer can write in a combinator style by stringing together these components using the "pipeline" operator (|>) common in languages like Elm, OCaml, and F#, here expressed in Elm:

OVERVIEW
To illustrate the core insight of our approach, we consider a non-recursive example.Consider the following component library: Our algorithm uses top-down enumerative synthesis to generate, among other candidates, the following sketch, which we will call : Fig. 1.A visual overview of our component-based refactoring algorithm.Given a library of components to refactor an input program with, the algorithm enumerates candidate library programs with holes (sketches).
The algorithm applies a fixed syntactic transformation (called a canonicalization function) to both the provided input program and the enumerated candidate sketches until it finds a sketch whose canonicalized form unifies with the canonicalized input program.The algorithm concludes by using the result of this unification to fill the holes in the matching candidate sketch.
withDefault ? 1 (map ? 2 ? 3 ) The key question is then: How can we check whether this candidate sketch can be used to refactor the input function, main?As a first step, we might consider inlining the library components into the candidate expression, which yields: The original motivation for deforestation was to remove intermediate allocations in the combinatorstyle code that functional programmers often want to write.It turns out that what makes this transformation so useful for optimizing functional code is exactly what makes it useful for this synthesis problem as well: It applies to code that functional programmers often want to write!If we now partially evaluate any case expression whose scrutinee is a constructor literal on the deforested example, we arrive at the following sketch: Higher-order unification of this sketch with the original main function now yields the substitution { ? 1 ↦ → 0, ? 2 ↦ → z. f (f z), ? 3 ↦ → mx}, which, when applied to the enumerated sketch , yields: Output withDefault 0 (map ( z. f (f z)) mx) Figure 1 illustrates how we use this idea for our synthesis framework: We enumerate many candidate programs and apply a fixed syntactic transformation to each one until one unifies with the transformed input program.In the example above, the transformation is deforestation and unification is higher-order unification.While this technique is necessarily incomplete, our evaluation in Section 8 shows that this approach can be made to work well in practice.Key insight: Naïve inlining (even modulo the equational theory of the -calculus) is not sufficient to unify a candidate sketch and the reference program; success requires an additional syntactic transformation (in this case, deforestation).
In Section 4, we define precisely what properties this syntactic transformation must satisfy; intuitively, it cannot map semantically distinct programs to the same program.In Section 6, we describe two particular such transformations, one that works for refactoring with higher-order functional combinators in the Elm programming language (as in the non-recursive example from this section and the recursive example from Section 1), and one that works for refactoring with numerical computing combinators in the NumPy library [50] for the Python programming language.

PROBLEM STATEMENT
We now formalize our problem statement via a series of definitions.
Definition 3.1.The sketch closure of a set equipped with a notion of composition (such as function application) is notated cl ?and is the closure of under composition among elements in the set and the infinite set of named holes {? : ∈ N}, where we call the name of the hole ? .We call the elements of cl ?sketches, even if they do not contain a hole.Definition 3.2.For our purposes, a language L has (1) an inductively defined set of programs Prog L (whose equality is syntactic and decidable); (2) a set of values Val L (whose equality may be undecidable); and (3) a semantics • L : Prog L → Val L (that may not be computable).We further suppose the existence of a finite library Lib L ⊆ Prog L of components, whose (infinite) sketch closure cl ?Lib L consists of what will call component sketches.When clear from context or irrelevant, we omit the subscript L on these terms.Definition 3.3.A hole substitution for a language L is a finite mapping from hole names in N to closed terms of Prog L .We notate the set of all hole substitutions for L by Σ L (again occasionally omitting the subscript).The application of a hole substitution ∈ Σ L to a sketch ∈ cl ?Prog L is notated and is defined by replacing holes ? in the domain of present in with ( ).
With these definitions, our goal is as follows: Definition 3.4 (Problem statement).Given a program ∈ Prog, find a component sketch ∈ cl ?Lib and hole substitution ∈ Σ such that = without any further specification.
Remark 3.5.Sketches separate the parts of the solution that use library components from those that use fragments of the input program or other non-library expressions, which enables a variety of cost metrics, such as maximizing or minimizing the number of components in , minimizing the number of holes in , or something else entirely.Indeed, there is always at least one solution ( = ?0 with = [0 ↦ → ]) and possibly infinitely many (if the identity function id is in Lib, then the -fold composition of identity functions = id (? 0 ) with = [0 ↦ → ] is a solution for all ∈ N).Thus, in practice, an algorithm must balance between "too few" and "too many" components.and (ii) the semantics • is uncomputable.Thus, the best we can hope for is an algorithm that can, in many practical cases, tell us whether a candidate filled component sketch is semantically equivalent to the reference program , but in some cases simply return ⊥ to indicate uncertainty.

EQUIVALENCE BY CANONICALIZATION
Before we define such an algorithm, we first establish a few standard preliminary definitions: With these definitions, we can say that our problem definition requires establishing that and are in the same equivalence class of the quotient Prog/Ker • .These definitions also enable us to leverage the ideas behind Noether's first isomorphism theorem for sets (illustrated in Figure 2):

Key insight.
The key insight is then to choose such that Ker is a refinement of Ker • .Noether's first isomorphism theorem then enables us to soundly check the semantic equivalence of 1 and 1 by checking the syntactic equivalence of ( 1 ) and ( 2 ).This motivates the following definition: Example 4.4.Let L be the language of integer arithmetic.Then 1 , 2 , and 3 below are basic canonicalization functions for L (even though 1 and 3 are not semantics-preserving) because they are computable and ( 1 ) = ( 2 ) (syntactic equality) implies 1 = 2 (semantic equality): We can now define Algorithm 1-an algorithmic version of Noether's first isomorphism theoremto check whether two programs are semantically equivalent, then prove its soundness: Theorem 4.5.Algorithm 1 always terminates, and if it returns ⊤ on ( , 1 , 2 ), then 1 = 2 .
For concision, we provide all proofs in an appendix in the supplementary materials.

Key takeaway:
We can check semantic equivalence of programs using a combination of (i) syntactic equivalence and (ii) particular kinds of computable syntactic transformations.Algorithm 1 captures the insight of our approach, but the following example shows that it suffers from two key limitations, which we overcome in the next section: (i) it cannot handle sketches, and (ii) syntactic equality can be too restrictive.
Example 4.6.Suppose is a basic canonicalization function for the simply-typed -calculus and and are terms with ( ) = ? 1 and ( ) = ( ) (roughly as in Section 2).Conceptually, establishes the semantic equivalence of and even though ( ) has a hole and ( ) ≠ ( ) because if ? 1 is set to .( ), then the two terms are equal modulo the equational theory of the -calculus.We need (i) to allow holes to be filled and (ii) more flexibility than syntactic equality.

Explicitly Handling Holes
To handle cases like those in Example 4.6, we first introduce holes to our language as follows.
We now define the most relaxed possible generalization of syntactic equality in this new language.Although it is undecidable because it relies on computing semantics and value equality, we will use it to define a correctness criterion for a broader notion of canonicalization functions momentarily.Definition 4.8.Let L be a hole-free language.The semantic unification relation Remark 4.12.A basic canonicalization function is a canonicalization function with respect to the relation that discards its third argument and checks syntactic equality of its first two arguments.
The following lemma is one convenient way to make a canonicalization function (however, as in Example 4.4, canonicalization functions need not be semantics-preserving).We provide other sufficient conditions for canonicalization functions in an appendix in the supplementary materials.We now introduce the full equivalence by canonicalization algorithm in Algorithm 2. The crucial subtlety of this algorithm is that we can soundly apply the substitution generated by on the outputs of to the inputs of ; this is precisely why we need to "keep track" of hole substitutions as described in Remark 4.10.The following soundness theorem captures exactly this property.Theorem 4.15 (Termination and soundness of eqivalence by canonicalization).Let be a canonicalization function for a hole-free language L with respect to and be an inference algorithm for .Then EbC( , , •, •) is a semi-inference algorithm for Ker ≡ L • L ? .
Key takeaway: To support sketches and equality modulo theories, we can loosen the restriction of syntactic equivalence when performing equivalence by canonicalization.

How Should We Choose a Canonicalization Function?
Algorithm 2 is parameterized by a canonicalization function , but not all choices of are equally useful.Intuitively, we want to choose a function that maps many input programs to the same output program while still satisfying the canonicalization function requirements.We can formalize this idea by again turning to the notion of a refinement.Refinements give a necessary condition for basic canonicalization functions (Ker must be a refinement of Ker • ), but they also tell us "how good" a choice of is: we can say that 1 is better choice than 2 if Ker 2 is a strict refinement of Ker 1 -that is, if 1 maps more expressions to the same result than does 2 .To extend this idea to full canonicalization functions, we first introduce the following auxiliary definition.
Whereas before we wanted to keep track of hole substitutions, now we explicitly do not want to do so when assessing if one canonicalization function is better than another; so long as the hole substitutions are valid, we do not care whether the use of one canonicalization function returns the same hole substitutions as the use of another, but rather whether the use of one canonicalization function succeeds more often than the use of another.The following definition leverages the preceding auxiliary definition to formalize this idea, which we visually depict in Figure 3. Definition 4.17 (Goodness of canonicalization functions).A canonicalization function 1 with respect to 1 is no worse than a canonicalization function 2 with respect to 2 (written Example 4.18.Let L be the language of integer arithmetic.Then 1 , 2 , and 3 from Example 4.4 (extended to L ? ) are canonicalization functions for L with respect to the syntactic unification relation ≈ and Theorem 4.19 (Better canonicalization implies better eqivalence checking).Let 1 and 2 be canonicalization functions for a hole-free language L with respect to 1 and 2 and let 1 and 2 be inference algorithms for 1 and 2 .Suppose 1 ⪰ 2 .Then: (1 for some hole substitution 1 ∈ Σ.
Remark 4.20.Theorem 4.19 provides a theoretical grounding for what makes a good canonicalization function.In practice, the goal is to choose to map component sketches and programs that the user is likely write to the same output program; accordingly, equivalence by canonicalization benefits greatly from accurate program characterization of the target programming domain.

Key takeaway:
We should choose canonicalization functions that map many input programs to the same output program.

SYNTHESIS FRAMEWORK
We now describe our top-level synthesis framework.At a high level, the framework is an enumerative synthesizer whose search space is candidate sketches and whose satisfaction criterion is given by equivalence by canonicalization (Algorithm 2).Definition 5.1.An enumerator for a set is an algorithm Enum(•) that, given a reference program in , lazily generates a (possibly infinite) sequence of elements of .Remark 5.2.There are many kinds of enumerators in the synthesis literature [10,67,78,88]; the choice of enumerator is orthogonal to the techniques and soundness results we present here.
Using this definition of an enumerator and our earlier definition of equivalence by canonicalization in Algorithm 2, we now introduce our component-based refactoring algorithm in Algorithm 3, which we illustrated earlier in Figure 1.The following theorem establishes its soundness, proving that, upon termination, it solves the problem posed in Definition 3.4.

Theorem 5.3 (Soundness of component-based refactoring)
. Let be a canonicalization function for a hole-free language L with respect to , be an inference algorithm for , Enum be an enumerator Lib L , and be a program in if EbC( , , , ) = then return ( , ) Remark 5.4 (Completeness and termination).The completeness of Algorithm 3 is dependent on the choices of Lib L , the canonicalization function , the relation , and the enumerator Enum.For sufficiently complex languages and libraries, it is impossible to achieve completeness (which would amount to solving the halting problem), but in general two changes bring the algorithm "closer" to completeness: (i) increasing the coverage of Enum and (ii) choosing high in the better-than partial order (Definition 4.17).Of course, (i) introduces a tension between completeness and termination: If the enumerator provides an infinite sequence of programs, then Algorithm 3 may not terminate.Otherwise, however, it will terminate, as equivalence by canonicalization always terminates.

Key takeaway:
To solve the component-based refactoring problem posed in Definition 3.4, we can slot the equivalence by canonicalization algorithm into an enumerative synthesizer by using it as a notion of satisfaction.
Remark 5.5 (Optimizations).An easy optimization is to inline the equivalence by canonicalization procedure and only compute ( ) once.A more exciting possibility is to leverage term indexing to separate the algorithm into an "offline" and "online" stage.Briefly, the offline stage would run the enumerator up to a fixed bound and store each canonicalized sketch as a key and the original sketch as a value; then, the online stage would canonicalize a given input program and look it up in the term index.Term indexes for languages beyond first-order logic are a long-running and active area of research [31,43,44,57,68,93,110]; consequently, we defer this extension to future work.

FRAMEWORK INSTANTIATIONS
To demonstrate the applicability of the component-based refactoring (CBR) synthesis framework to different domains, we instantiated it in two different contexts: ( §6.1) CBR-Elm: Refactoring code in the statically-typed functional programming language Elm that uses pattern matches and explicit recursion into code that uses higher-order combinators like List.map and List.filter.( §6.2) CBR-Python: Refactoring code in Python that uses for loops into code that uses numerical computing functions provided by the NumPy library like np.sum and np.convolve.
In this section, we discuss the high-level choices we made to instantiate the synthesis framework we described in Section 5; we defer low-level implementation details to Section 7. We present these instantiations via detailed worked examples, as the exact formal definitions of the syntactic transformations we rely on (such as catamorphism fusion) are not a contribution of our work.
As we discussed in Remark 3.5, algorithms solving the problem statement in Definition 3.4 must balance between choosing "too few" and "too many" library components; we opt to return the smallest solution with at least one component (ruling out the trivial solution of always returning a hole) by performing top-down enumeration starting from single components of the library rather than starting from a top-level hole.As we will see in the user study in Section 8.3 and performance evaluation in Section 8.4, this heuristic was sufficient to result in desirable programs.

Higher-Order Functional Combinators in Elm
For CBR-Elm, we synthesize programs using 17 functions mostly from the Elm standard library (listed in an appendix in the supplementary materials) relating to the Bool, Maybe, Result, and List types (we include find and findMap functions for List, which happen to be absent in the standard library).We note, however, that our approach extends to arbitrary user-provided libraries with no additional work beyond writing the functions in standard Elm code.
To make CBR-Elm's task harder, we purposely do not include the Maybe, Result, and List catamorphisms in its library (known as maybe, either, and foldr in Haskell) even though to do so would simply require defining these functions in standard Elm code (which we do in Section 9.1).Many real-world functions that use structural recursion can easily be rewritten to use a single catamorphism without any synthesis whatsoever (and, as we will see, this is actually the first step to our canonicalization function).Thus, to force CBR-Elm to find nontrivial compositions of components, we do not provide it the opportunity to return the trivial result of a single catamorphism.
Our CBR-Elm canonicalization function operates with respect to higher-order unification1 and has two main stages: (1) rewrite the program using catamorphisms, and (2) perform catamorphism fusion wherever possible.We present a worked example here with the notion of a catamorphism specialized to if and foldr and refer the reader to Meijer et al. [76] for additional background.
Consider the following input program: To do so, CBR-Elm first canonicalizes the input program.
Step (1) is to rewrite the right-hand side using catamorphisms: Step (2) is to perform catamorphism fusion, which in this case is not possible.
Step (1) of canonicalization is to rewrite the sketch using catamorphisms: Step ( 2) is to perform catamorphism fusion to rewrite the program into the form foldr g [] ? 3 .Meijer et al. [76]'s catamorphism fusion theorem says that it is sufficient for g to satisfy for all y, which is non-constructive in that it does not provide a definition for g.Partial evaluation [41,56] does not help solve for g here, but we can recursively perform catamorphism fusion on the if expression (also known as deforestation [115]): This expression higher-order unifies with the canonicalized input program, resulting in the substitution { ? 1 ↦ → x. f (f x), ? 2 ↦ → p, ? 3 ↦ → xs} and the overall output listed above.By Lemma 4.13, this procedure is a canonicalization function for Elm with respect to higherorder unification because it is computable and semantics-preserving-with one exception.Elm is a strict language, and catamorphism fusion is technically valid only under a lazy evaluation scheme because it can change whether or not a program diverges; even in Haskell, the presence of the seq operator makes catamorphism fusion an unsound transformation in general.We inherit the general limitation of syntactic transformations often becoming unsound in the presence of diverging code.

Numerical Computing Combinators in NumPy
For CBR-Python, we synthesize programs operating over 1D arrays using 21 NumPy functions (listed in an appendix in the supplementary materials), including some that return numbers such as np.sum, fixed-sized arrays such as np.add and np.convolve, and variable-sized arrays such as filtering (e.g.x[x > 0]).We also include two additional functions we call cosmetic (list and np.vectorize) that provide no performance benefits but can expose opportunities to apply other functions.
Unlike in CBR-Elm, our CBR-Python canonicalization function is specific to these library functions, and adding additional functions would require additional work; it would be interesting future work to compute the necessary rewrites automatically via array-aware program slicing.
Our CBR-Python canonicalization function operates with respect to a small set of relations backed by an e-graph, including symmetric arithmetic rewrites such as ↔ • 1 and asymmetric rewrites that capture NumPy broadcasting functionality (in which a scalar can be treated as an array) such as Here partial evaluation uses the rules len(np.multiply(a,b)) → len(a) and np.multiply(a, b The second expression unifies with the canonicalized input program, yielding the substitution { ? 1 ↦ → x, ? 2 ↦ → y, ? 3 ↦ → s, ? 4 ↦ → i} and overall output above. As a second example, consider the following input program that computes a rolling sum of x: Again, CBR-Python's canonicalization function happens not to modify this input program.However, consider the candidate sketch np.convolve( ? 1 , np.full( ? 2 , ? 3 ), mode="valid").
Unfortunately, it is impossible to define a semantics-preserving program transformation in Python due to the ability of Python code to arbitrarily inspect the call stack.Such features make it impossible to prove any program transformation is a canonicalization function.However, for the set of programs for which the above procedure is semantics-preserving, Lemma 4.13 proves that it is a canonicalization function; in practice, we believe this space of programs to be quite large, and none of the synthesized programs in our Section 8.4 evaluation fell outside this space.

IMPLEMENTATION
We implemented our component-based refactoring instantiations in a tool called Cobbler in approximately 5,000 lines of OCaml code (not including the frontend nor the experimental setup we describe in Section 8).We support only a subset of Elm and Python, omitting features such as partially-applied data constructors and nested patterns in Elm, and dictionaries, method calls, and statements not of the form of an assignment, a call to .append(), an if, or a for in Python.To work with e-graphs, we use the OCaml library Ego [114], which is based on the Rust library egg [117].

EVALUATION
Although we are most interested in the synthesis framework itself-rather than our two particular instantiations-we used the instantiations as opportunities to explore if our framework can work well in practice.Thus, for our empirical evaluation, we investigated two main research questions: RQ1.How fast does Cobbler run on real-world programs?RQ2.To extent does Cobbler improve real-world programs?
We investigated these questions via three experiments detailed in the following sections using a corpus of programs from The Stack dataset [60].In Experiment 1, we answered RQ1 by measuring Cobbler's synthesis time.In Experiment 2, we answered RQ2 for Elm by running a user study ( = 159) to explore when Cobbler's outputs are preferable to its inputs.In Experiment 3, we answered RQ2 for Python by measuring how long Cobbler's outputs take to run on increasing data sizes (accounting for synthesis time) compared to its inputs.
Our evaluation does not compare against a baseline.An appropriate baseline would need to take a program and, without further specification, soundly refactor it using a given set of library components.The Haskell linter hlint [80] technically meets these requirements, as it provides hardcoded linting rules that replace pattern matches over Maybe expressions with (single) combinators.However, hlint does not support refactoring over other types or using more than one component, yet, as we will see, only 16% of Cobbler Elm transformations required just one component.Thus, we felt a comparison against hlint would not be informative to understanding Cobbler nor be fair to hlint given its purpose as a linter rather than synthesizer.As we discuss in Section 10.2.1, other than hard-coded approaches, Cobbler is the first tool in the category we outline above.
Additional questions.In addition to our main research questions, we asked three additional research questions to gain a better fundamental understanding of our framework and instantiations: RQ3.How much does semantic unification help Cobbler?RQ4.How does Cobbler scale with the number of components it uses for a solution?RQ5.How long do each of the Cobbler sub-components take to run?
We answer RQ3 with an ablation study, RQ4 by running Cobbler on synthetic programs that require increasingly many components, and RQ5 by timing Cobbler at a granular level.

Input Programs
We drew real-world input programs for our experiments from The Stack [60], a dataset of permissivelylicensed open-source code from GitHub repositories that includes programs written in over 300 languages.From all 90,637 Elm files, we drew 3,371 Elm functions that immediately pattern-match on a variable of type Maybe, Result, or List; these constituted our input programs for CBR-Elm.From 1,000,000 of the Jupyter notebooks, we drew 572 cells that included a set of variable definitions followed by a for loop followed by a final line consisting of a variable, as in Section 6.2; these constituted our input programs for CBR-Python.These numbers exclude programs using features orthogonal to the synthesis task but which would require additional engineering effort, such as  (i) whether synthesis succeeded or failed, (ii) the synthesized program, if successful, and (iii) how long synthesis took (median of 10 runs).We did not use a hard time cutoff; instead, based on performance on the training set, we chose a synthesis depth cutoff of 3 for Elm and 4 for Python.We ran this experiment (and all following) on a 2020 MacBook Pro with a 2.3 GHz Quad-Core Intel Core i7 processor and 32 GB of RAM running macOS Big Sur with OCaml v4.14 and Python v3.11.
With this setup, Cobbler applied to 2,007/2,960 Elm programs and 102/355 Python programs3 in the test set.Figure 4 summarizes how many components each refactored program contains; because Cobbler enumerates sketches in increasing size, this metric serves as a rough proxy for the complexity of the task and indicates that Cobbler does more than naïve inlining.

8.2.2
Results.Cobbler's median synthesis time was less than 0.5s across both successful and unsuccessful runs for both Elm and Python, and all successful runs took less than 1s.
Theoretically, as an exhaustive enumerative synthesizer, the asymptotic running time of Cobbler is at least exponential in the number of components used (which we validate in Section 8.5.2).In practice, the successful Elm and Python runs both took median synthesis times of 0.02s and the unsuccessful Elm and Python runs took median synthesis times of 0.21s and 0.39s respectively.Figure 5 summarizes the synthesis time for all input programs broken down by language and whether or not they succeeded.Figure 6 summarizes the synthesis times for successful runs broken down by language and the number of components used.Remark.Unsuccessful synthesis times can be arbitrarily inflated or deflated by adjusting the maximum enumeration depth, as an unsuccessful run means that Cobbler must consider the entire search space (except when Cobbler performs a lightweight, syntax-based early cutoff by checking that its library of functions is insufficient for the task).On a successful run, however, the maximum enumeration depth does not affect the synthesis time (other than possibly causing the successful run to become unsuccessful), as Cobbler will stop at the first solution in finds.

Experiment 2:
Answering RQ2 (on Cobbler Improving Programs) for Elm 8.3.1 Setup.For Elm, we operationalize "improving programs" as increasing their subjective desirability to real statically-typed functional programmers, which we assessed via a user study approved by our institution's IRB.However, we do not claim that combinator-style programs are always preferable to direct-style programs.Consequently, a result that programmers always prefer the output of Cobbler to its input for Elm would be surprising.
To observe when Cobbler improves Elm programs, we randomly sampled the input-output program pairs from the test set of Experiment 1 whose outputs used more than one combinator.Within each of three categories-Maybe, Result, and List programs-we randomly sampled 200 such pairs.(In the case of List, the test set only included 62 such pairs, so we included all 62.) We then created a survey with randomized order that displayed a random sample of these pairs (equally distributed among the three types) and, for each pair, asked participants which program they preferred. 4We did not refer to either program as "input, " "output, " "direct-style, " "combinator-style" or any other descriptive label; we merely presented their source code in a random order.
We distributed this survey via email, Slack, and X (formerly Twitter) along with a request to share the survey more broadly.We did not compensate participants, and we requested that only self-identified statically-typed functional programmers fill out the survey.In total, we had 159 participants who overall answered 3,206 comparison questions.

Results
. Participants preferred combinators (Cobbler's output) in approximately one quarter of the Maybe and Result code (26% and 24% respectively) and approximately half of the List code (46%).Although not necessarily indicative of causality, Figure 7 shows that participants more often preferred combinators for programs operating over a recursive datatype (List) than those operating over simpler non-recursive datatypes (Maybe and Result).Moreover, Figure 8 shows that participants tended to prefer combinators when the textual size of the program was much shorter than in the direct style, which was most often the case with List programs.We again stress that these trends do not necessarily indicate causality.8.4 Experiment 3: Answering RQ2 (on Cobbler Improving Programs) for Python 8.4.1 Setup.For Python, we operationalize "improving programs" as increasing their performance.Most input programs from Experiment 1 load input data from the authors' file systems (which we cannot access), load other external input data (such as online resources, also often inaccessible), do not define their input data, or operate on small data represented in the text of the program (for which the equivalent NumPy code is actually slower than the naïve Python).Thus, to observe when Cobbler could provide helpful speedup, we took the successfully-refactored Python programs, manually modified both the input and output programs to operate on data of sizes of 10 0 , 10 1 , . . ., 10 8 , and recorded how long each of the sixteen variants took to terminate (median of 10 runs).

Results
. When using performant NumPy functions, Cobbler's outputs are more often faster (even including synthesis time) than its inputs with a data size of at least 10 6 ; at a data size of 10 8 , the median speedup is 1.95×, and 42/56 programs exhibit speedups.In Section 6.2, we discussed that not all components are intended to improve performance; some are merely cosmetic.We break down our performance results-which we visualize in Figure 9 and plot this quantity over input data size in Figure 9.We used only 100/102 of the transformed Python programs for this experiment because two input programs exponentiated integers to powers that scaled with the size of the input data; these two input programs did not terminate on all data sizes within one hour or crashed due to overflow.

Additional
Experiments: Answering RQ3, RQ4, and RQ5 8.5.1 Answering RQ3 (on How Much Semantic Unification Helps Cobbler).We ran Cobbler on the same programs as in Experiment 1, but instead of using higher-order unification for Elm and e-unification for Python, we used syntactic unification (modulo holes).We found that semantic unification approximately doubles the number programs Cobbler can refactor for Elm, but has little effect for Python.We plot this result in Figure 10.8.5.2Answering RQ4 (on Cobbler's Scalability).We ran Cobbler on programs we constructed to require increasingly many components and measured how long Cobbler took to find the solutions.We found that Cobbler scales at least exponentially in the number of components it uses for a solution, as expected of an exhaustive enumerative synthesizer We plot this result in Figure 11.8.5.3 Answering RQ5 (on Cobbler's Timing Breakdown).We ran Cobbler on the same programs as in Experiment 1, but with more granular timing instrumentation.Specifically, we measured the time Cobbler spent (median of 10 runs) during (i) canonicalization, (ii) unification, and (iii) enumeration (not including canonicalization or unification).We found that Cobbler spends most of its time performing unification.We plot this result in Figure 12.

Threats to Validity
As we discuss in Section 7, Cobbler does not support the entirety of the Elm and Python languages, which may introduce sample bias for the programs used for all three experiments.Additionally, using only The Stack (which contains programs only from GitHub) may introduce sample bias.For Experiment 2, although the input code is real-world code drawn from The Stack, participants indicated their code style preference removed from a larger context or codebase, which may not be representative of real-world conditions.We also ran Cobbler on functions that programmers decided to commit to GitHub in a direct style, which may not be representative of the kinds of functions programmers would want to run Cobbler on to convert to combinator style.Moreover, survey respondents may not be representative of the statically-typed functional programming community as a whole.For Experiment 3, we generated synthetic variants of the real programs from The Stack with varying input data sizes, which may not be representative of real-world conditions.9 DISCUSSION AND LIMITATIONS 9.1 When Does Cobbler Fail?
Cobbler refactors many real-world programs, but why does it sometimes fail?To answer this question, we took programs Cobbler failed to refactor and attempted to manually refactor them.For CBR-Elm, we took all 3 failed Maybe programs and randomly sampled 10 failed Result programs and 10 failed List programs.For CBR-Python, we randomly sampled 20 failed Python programs.
For CBR-Elm, we successfully refactored all 23/23 failed Elm programs and found that: (Elm 1 ) 3/3 failed Maybe programs failed due to the limitation of our unification algorithm that it cannot unify terms of the form . ? with . .(Elm 2 ) 10/10 failed Result programs and 5/10 failed List programs failed due to an insufficient component library-specifically, not having catamorphisms (as we discuss in Section 6.1) or the trivial pattern-matching combinator List.uncons.We verified that these failures were due to an insufficient library by temporarily adding these functions to the component library and observing that Cobbler could indeed refactor these 15 programs successfully.(Elm 3 ) 2/10 failed List programs failed due to Cobbler only supporting catamorphisms and not other recursion schemes; specifically, these programs relied on foldl.(Elm 4 ) 3/10 failed List programs failed due to requiring substantial non-syntactic reasoning (two for using an early-cutoff search idiom that is not possible to capture directly with a catamorphism, and one for requiring an entirely different approach to use a catamorphism).
For CBR-Python, we successfully refactored 19/20 failed Python programs; we were unable to refactor the last one because it required jagged arrays, which NumPy does not support.
For 11/20 failed Python programs, we found that a small semantics-preserving manual modification to the input program enabled Cobbler to refactor the program (we checked this by performing the modification, then re-running Cobbler).We do not suggest that programmers should do any of these transformations; rather, we showcase these as a demonstrations of particular failure modes.
(Py 1 ) 4/20 failed programs failed due to our insufficient support for np.vectorize (e.g., not introducing lambdas to vectorize over method accesses).Our manual modification was to introduce the relevant lambdas ourselves (e.g., defining f = lambda x: x.m()).(Py 2 ) 3/20 failed programs failed due to working with dictionaries (which we consider out of scope in Section 7).Our manual modification was to introduce a single helper variable that encapsulated all the dictionary operations to a single line as a pre-processing step.(Py 3 ) 2/20 failed programs failed due to our insufficient support for multi-argument range calls.
Our manual modification was to modify such calls to use the single-argument version.(Py 4 ) 1/20 failed programs failed due to using a helper variable in the body of a loop.Our manual modification was to inline this variable.(Py 5 ) 2/20 failed programs failed because Cobbler does not synthesize complex expressions as arguments to some functions.For example, Cobbler does not synthesize the filter operation (x + x)[(x + x) > 0].Our manual modification-which was a bigger hint that the modifications above-was to introduce intermediate names for some expressions.
There were no small modifications to the remaining 8/20 failed Python programs that would enable Cobbler to refactor them successfully: (Py 6 ) 4/20 failed programs failed due to an insufficiently strong canonicalization function for compositions of np.where with other functions.(Py 7 ) 2/20 failed programs failed due to working with multi-dimensional arrays or lists of arrays (which we consider out of scope in Section 6.2).
(Py 8 ) 1/20 failed programs failed due to requiring substantial non-syntactic reasoning; specifically, for searching for the last element of a list satisfying a predicate, which requires subtle indexing with np.argwhere.
Takeaways.Failure mode Elm 2 shows that the library of combinators has a large impact on what code can be refactored, and Elm 3 suggests that implementing additional recursion schemes (which should be straightforward given their similar fusion laws) could be useful in practice.
Failure modes Py 1 -Py 5 show that equivalence by canonicalization fails if the canonicalization function is not good enough in the sense of Definition 4.17.Given a canonicalization function with respect to and two programs 1 , 2 with 1 = 2 yet ¬ ( ( 1), ( 2)), it is always possible to extend and to a better canonicalization function ′ ≻ with respect to ′ such that ′ ( ′ ( 1 ), ′ ( 2 )).Extending these canonicalization functions thus becomes an engineering tradeoff: Which extensions are worth the human effort of implementing them?Indeed, the modifications in Py 1 -Py 4 could easily be incorporated into CBR-Python's canonicalization function, but property-based testing for equality (PBT)-which unsoundly checks the equality of two programs by checking that they have the same output on many different inputs-needs no domain-specific rules.Overall, equivalence by canonicalization trades completeness for soundness (which is most evident in Elm 4 and Py 8 ) and PBT makes the opposite tradeoff.
Lastly, one limitation this discussion does not capture is that equivalence by canonicalization requires source code to be available, whereas PBT, for example, does not.

Practical Tips for Component-Based Refactoring
Although authors of canonicalization functions can directly use off-the-shelf transformations like catamorphism fusion, knowing which transformations to employ requires domain expertise.Unlike PBT, for example, we do not expect end users to be able to define new canonicalization functions to check semantic equivalence without such domain expertise.To make implementing new instantiations of component-based refactoring easier, we provide a short sequence of steps for bootstrapping a canonicalization function and unification relation: (1) Identify a baseline notion of equivalence for the domain that is as simple as possible to serve as the unification relation.Our timing breakdown analysis in Section 8.5.3 shows that Cobbler spends most of its time performing unification.Intuitively, canonicalization functions are typically syntax-directed and unification typically requires non-directional reasoning, so complex unification relations can be a bottleneck.However, our ablation analysis in Section 8.5.1 shows that it is possible for this relation to be too simple; in the case of CBR-Python, even though our performance analysis in Section 8.2 shows that Cobbler runs fast in practice, our e-graph unification takes most of the time of synthesis, yet does not substantially increase the number of programs Cobbler can refactor.(2) Craft the input and output program for a simple two-component synthesis problem.
We found it helpful to start with the input and output programs for the non-recursive Elm example in Section 2 and the dot product Python example in Section 6.2.(3) Inline the components in the output program of the previous step.It can be helpful to consider inlining both components (as in CBR-Elm) or just one of them (as in CBR-Python).(4) Manually rewrite the resulting program of the previous step to be more idiomatic for the domain.For example, for Elm, we rewrote the case expression in the scrutinee of another case expression as a single case expression, and, for Python, we rewrote  [25,29,30,69,99,105,111], techniques [11,12,27,32,48,81,100,112,113], and formalisms [7,13,15,62,65,118] depend on logical specification and verification.Techniques based on equality saturation and e-graphs [107,117] enable syntactic reasoning, but it can be challenging to represent arbitrary canonicalization functions using first-order syntactic rewrite rules.Normalization by evaluation [16] is an approach to normalization in which programs are evaluated to a semantic domain and reified back to the syntactic domain.A sound (but not necessarily complete) semantic equivalence check can be done by checking the syntactic equality of these resultant terms, as in equivalence by canonicalization.However, this process relies on evaluation, and thus may not terminate for a Turing-complete language.Equivalence by canonicalization can be viewed as a variant of normalization by evaluation in which (i) evaluation is swapped out for a syntactic transformation that satisfies some (but not all) properties of evaluation while still guaranteeing termination, and (ii) reification is unnecessary and thus omitted.Interestingly, Shashidhar et al. [100] and Verdoolaege et al. [113] also compute a limited semantics-preserving normal form of associative operators in the context of program dependence graphs by flattening binary associative operator applications into multi-ary applications, indicating that it may be fruitful to consider canonicalization functions over the language of program dependence graphs.10.1.2Unsound Automated Techniques.Many verifiers in the previous section can be used in bounded relational verification, in which loops and recursion are unrolled up to a fixed depth and the relational logical property (such as program equivalence) is verified on the resulting program, as in bounded model checking [19].Alternatively, random testing techniques such as property-based testing [28] and coverage-guided fuzzing [121] do not require or infer logical specifications about the programs at hand, but do require running the programs many times on a variety of inputs., where the goal is to synthesize compositions of components.This contrasts with recursive functional synthesis, where the goal is to synthesize functions using direct recursion [3,33,36,37,59,66,71,79,82,88,94,120].Component-based synthesizers sometimes rely on logical specifications [55] or types [21,39,46,53,61,73]; many use input-output examples [45], including some for functional combinators [40,51,102,106] and numerical computing combinators [14,83,101].
Component-based synthesizers sometimes target refactoring, including verified lifting [1,18,26,58,64] (in which logical summaries of provided programs are inferred and fed to a logic-based synthesizer to find an equivalent formulation in a domain-specific language) and NGST2 [74] (in which a neural-guided synthesizer translates imperative code to functional code annotated with custom logical specifications using a bounded verification check).NGST2's neural architecture is orthogonal to our idea of equivalence by canonicalization; an interesting avenue of future work could be to slot their neural-guided search in as an enumerator in our framework.
Although JLibSketch [75] does not target refactoring, it uses algebraic specifications [49] to synthesize code, which get compiled to logical specifications in JSketch [54] and ultimately Sketch [104].Like our canonicalization functions, algebraic specifications are syntactic rewrites.However, unlike our canonicalization functions, these rewrites must be of the form pattern ⇒ result, where pattern and result are opaque compositions of library functions.It is therefore not clear to us that these specifications could encode transformations like the non-constructive catamorphism fusion we described in Section 6.1.Moreover, our framework dispenses with logical specifications entirely: canonicalization functions can reuse off-the-shelf transformations developed by the programming languages community without any encoding into logical systems.
Lastly, Smith and Albarghouthi [103] introduce a generalization of the popular optimization in enumerative synthesis to consider only terms in normal form (such as -normal -longterms) by enumerating normal forms of arbitrary term rewriting systems.They rely on syntactic transformations not as a form of equivalence checking of candidates sketches as we do, but as an optimization for enumerating candidates in the first place.10.2.2 Using Unification for Synthesis.Although anti-unification is common in program synthesis [2,5,6,24,34,53,77,95,96], unification is rarer, but has been used in e-graph based synthesis approaches [84,85] and to ensure input-output example satisfaction [21].10.2.3 Library Learning.In a sense, the converse to our problem statement is the library learning problem: given a corpus of code, find a library of components to rewrite it [22,24,34,35].

CONCLUSION
In this paper, we introduced a sound, automated semantic equivalence check called equivalence by canonicalization.This technique (i) requires only the source code as a specification and (ii) can leverage decades of work from the programming languages community on syntactic transformations that were first developed for optimizing compilers.We use this technique in our componentbased refactoring synthesis framework, which translates direct-style programs (like those that use pattern matching and recursion in Elm or for loops and lists in Python) into combinatorstyle programs (like those that use higher-order functional combinators in Elm or numerical computing combinators provided by NumPy).We found that this technique allowed us to synthesize combinator-style variants of thousands of real programs.Moreover, synthesis based on equivalence by canonicalization is fast; the median synthesis time was less than half a second.We applied our technique to synthesize Elm programs that programmers often preferred and Python programs that ran 1.95× faster, even accounting for synthesis time.The applicability of this synthesis technique for two differing purposes suggests we can use synthesis to accomplish refactoring across a variety of domains without requiring specification beyond the input program.
map : (a → a) → Maybe a → Maybe a map f mx = case mx of Nothing → Nothing | Just x → Just (f x) withDefault : a → Maybe a → a withDefault d mx = case mx of Nothing → d | Just y → y Suppose we wish to use this component library to refactor the following function: Input main : (Int → Int) → Maybe Int → Int main f mx = case mx of Nothing → 0 | Just x → f (f x) Proc.ACM Program.Lang., Vol. 8, No. PLDI, Article 223.Publication date: June 2024.Equivalence by Canonicalization for Synthesis-Backed Refactoring 223:5

Definition 4 . 1 (
Preliminaries).The quotient of a set by an equivalence relation is the set of equivalence classes / = {[ ] : ∈ }.An equivalence relation is a refinement of an equivalence relation ′

Definition 4 . 16 .
Let and be sets.For a relation ⊆ × × , the forgotten version of is

ϕ 1 ϕ 2 Fig. 3 .
Fig. 3.The canonicalization function 1 is be er than the canonicalization function 2 ( 1 ≻ 2 ) because the equivalence classes of the kernel of 1 are coarser than those of 2 .

Algorithm 3
Prog L .If Algorithm 3 terminates, it returns a component sketch ∈ cl ?Lib L and hole substitution such that L = L .Component-based refactoring Parameter: Canonicalization function for hole-free language L with respect to Parameter: Inference algorithm for Parameter: Enumerator Enum for Lib L Input: A program ∈ Prog L Output: A component sketch ∈ cl ?Lib L and hole substitution 1: for ∈ Enum( ) do 2:

Fig. 4 .
Fig. 4. The number of components Cobbler's test set solutions use.1,678/2,007 Elm programs and 46/102 Python programs in the test set require more than a single component.

Fig. 6 .
Fig. 6.Cobbler's synthesis time on successful runs in the test set by language and number of components used.For all such runs, Cobbler provides a synthesis result in <1s.Whiskers are ±1.5 × IQR.

Fig. 7 .Fig. 8 .
Fig. 7. Preference by program type.How o en participants preferred the direct-style program vs. the equivalent combinator-style program.

Fig. 9 .Fig. 10 .
Fig.9.The speedup (accounting for synthesis time) of the synthetic Python program variants as their input data size increases.The data is broken down by whether the refactored program uses NumPy functions intended to increase performance or only cosmetic functions (list, np.vectorize).By a data size of 10 6 , the median speedup for refactored programs that use performant functions is >1×; by 10 8 , this median is 1.95× and 42/56 programs are sped up >1×.Whiskers are ±1.5 × IQR.
-based on whether or not Cobbler refactors the program to use at least one non-cosmetic function.To account for synthesis time, we define speedup as original program execution time ÷ (refactored program execution time + Cobbler synthesis time)

Fig. 11 .
Fig. 11.Cobbler's synthesis time on increasingly large synthetic problems.Synthesis time is median over 10 runs (min-max error bars smaller than points).

10. 2
Program Synthesis 10.2.1 Component-Based Synthesis.Component-based refactoring is a form of component-based synthesis Algorithm 1 Basic equivalence by canonicalization Parameter: Basic canonicalization function for L Input: Two programs 1 , 2 ∈ Prog L Output: Either ⊤ or ⊥ 1: if ( 1 ) = ( 2 ) then return ⊤ else return ⊥ Fig. 2. Basic equivalence by canonicalization is an algorithmic version of Noether's first isomorphism theorem.Prog/Ker is the set of equivalence classes of programs with the same canonicalized form.
Our problem statement requires establishing the semantic equivalence of a reference program and a candidate filled component sketch in that it requires = .However, checking the equivalence of and may be undecidable, as, in general, (i) equality of values is undecidable ., 1 L ? ≡ L 2 L ? ).A relation ⊆ Prog L ? × Prog L ? × Σ L is a partial semantic unification relation for L if for all 1 , 2 ∈ Prog L ? and ∈ Σ L such that ( 1 , 2 , ), semantically unifies 1 and 2 .To take advantage of this definition, we must first define a suitable generalization of kernels: This definition generalizes the notion of a function kernel in the following sense.Let and be sets with : → .Take = {⊤} and ( 1 , 2 , ⊤) to hold if and only if 1 = 2 .Then Ker ≃ Ker via the bijection( 1 , 2 , ⊤) ↦ → ( 1 , 2 ).Alternatively, if we define ′ ( 1 , 2 ) to hold if ∃ ∈ .( 1 , 2 ,) and ′ is an equivalence relation, then we can define ′ : → / ′ by ′ ( ) = [ ( )] ′ and view Ker as a version of Ker ′ that "keeps track" of the witness ∈ that causes ( ( 1 ), ( 2 ), ) to hold for 1 , 2 ∈ .
Algorithm 2 Equivalence by canonicalization: EbC( , , 1 , 2 ) Let be a partial semantic unification relation for L and : Prog L ? → Prog L ? be computable and semantics-preserving. Then is a canonicalization function for L with respect to .Definition 4.14.Let , , and be sets.A semi-inference algorithm for a relation ⊆ × × is a computable function : × → ∪ {⊥} with the property that if ( , ) = ∈ , then ( , , ); is an inference algorithm if ( , ) = ⊥ implies there is no ∈ such that ( , , ).
2 ): ?6 += ? 1 [ ? 5 + ?7 ] * ? 3 ?4 [ ? 5 ] = ?6 ? 4 [109]5.Cobbler's synthesis time on the test set broken down by language and whether or not synthesis succeeded.Median synthesis time is <0.5s in all conditions.Cobbler is considered unsuccessful if it cannot find a solution in the provided component depth maximum (no hard time cutoff).Whiskers are ±1.5 × IQR.nested patterns in Elm and dictionaries in Python (seeSection 7)and Python programs using the pandas library[109].Aside from this constraint, we drew all such Elm programs and Jupyter cells.We split our input programs into training and test sets.We made all implementation decisions for Cobbler based on the training set (including which library components to use) and ran Cobbler on the test set only for the final evaluation.2Tosplit the training and test sets, we split the input files into training and test sets, and all programs extracted from a given file went into the associated set.Using this process, we reserved 411 of the 3,371 Elm programs and 217 of the 572 Python programs for the training set, leaving a test set of 2,960 Elm programs and 355 Python programs.We ran Cobbler on each input program and collected the following information: [87,89,108]levant (existing) program transformations that perform this rewrite.Both CBR-Elm and CBR-Python perform a type of fusion (combining multiple data passes into one), suggesting that fusion transformations may be fruitful to investigate first.If no existing transformations exactly apply, the rewrite will need to be generalized to a new transformation.The goal for Step (4) is to rewrite the output program to be in a form that unifies with the input program under the relation defined in Step(1).If this unification is achieved, then the transformation in Step (5) constitutes a canonicalization function (assuming it satisfies the required properties); this canonicalization function can then be further expanded by repeating this process for more complicated examples.If this unification is not achieved, it may be because the relation defined inStep (1) is too simple or the example in Step (2) is too complicated for a single transformation.Program Equivalence Checking Equivalence by canonicalization is a form of sound, automated program equivalence checking.Although proof assistants[87,89,108]provide facilities for semi-automatically proving program equivalence, we restrict our review of related work to automatic program equivalence checkers.10.1.1 Sound Automated Techniques.Program equivalence checking is a type of relational verification, or verification of properties that include multiple programs.Many existing relational verification tools np.multiply(x, y)[i] as x[i] * y[i].(5)