Guided Equality Saturation

Rewriting is a principled term transformation technique with uses across theorem proving and compilation. In theorem proving, each rewrite is a proof step; in compilation, rewrites optimize a program term. While developing rewrite sequences manually is possible, this process does not scale to larger rewrite sequences. Automated rewriting techniques, like greedy simplification or equality saturation, work well without requiring human input. Yet, they do not scale to large search spaces, limiting the complexity of tasks where automated rewriting is effective, and meaning that just a small increase in term size or rewrite length may result in failure. This paper proposes a semi-automatic rewriting technique as a means to scale rewriting by allowing human insight at key decision points. Specifically, we propose guided equality saturation that embraces human guidance when fully automated equality saturation does not scale. The rewriting is split into two simpler automatic equality saturation steps: from the original term to a human-provided intermediate guide , and from the guide to the target. Complex rewriting tasks may require multiple guides, resulting in a sequence of equality saturation steps. A guide can be a complete term, or a sketch containing undefined elements that are instantiated by the equality saturation search. Such sketches may be far more concise than complete terms. We demonstrate the generality and effectiveness of guided equality saturation using two case studies. First, we integrate guided equality saturation in the Lean 4 proof assistant. Proofs are written in the style of textbook proof sketches, as a series of calculations omitting details and skipping steps. These proofs conclude in less than a second instead of minutes when compared to unguided equality saturation, and can find complex proofs that previously had to be done manually. Second, in the compiler of the RISE array language, where unguided equality saturation fails to perform optimizations within an hour and using


INTRODUCTION
Term rewriting [Baader and Nipkow 1998] provides compositional reasoning and optimizations through a formal theory of syntactic changes and is used from theorem proving [Bachmair and Ganzinger 1994;Hsiang et al. 1992] to compilation [Dershowitz 1993;Hagedorn et al. 2020;Visser et al. 1998].Automatic theorem provers such as Z3 provide optimized algorithms for equational reasoning [de Moura and Bjørner 2007], and the Glasgow Haskell Compiler [Jones et al. 2001] optimizes a widely-used functional programming language through term rewriting.In both cases, the de nition and exploration of the rewriting search space is automatic and scales (for speci c domains) to complex problems.Yet, term rewriting o ers bene ts even for domains or problem sizes where decision procedures do not exist or do not scale well, but where humans succeed in identifying desirable rewrite sequences.
For example, proving some theorems requires speci c insights that are hard to come up with automatically.Manually specifying rewrites allows injecting such insights, and conveying a particular chain of reasoning steps.As an illustration, consider the following proof that inversion in groups is an involution: for a group and an element ∈ , we claim that ( −1 ) −1 = .The proof is conceptually straightforward and can be found in any standard textbook: This proof shows how the di erent group axioms imply this result in a couple of reasoning steps, using the multiplicative identity of one, the de ning property of inverse and associativity of multiplication at di erent points.Interestingly, this proof requires a creative insight highlighted in red: how did we know it was a good idea, "out of nowhere", to multiply the starting term by −1 • ?The axiom of the multiplicative inverse is universally quanti ed, ∀ ∈ , −1 • = 1, so there's a potentially in nite number of ways to instantiate it.The insight is crucial for the proof, and automated term rewriting systems struggle to produce this proof.In fact, automatically nding these insights on arbitrary rewrite systems is impossible: word problems on such equational theories are known to be undecidable in general [Evans 1978].
Controlling rewrites manually also has value in optimizing compilers.Strategy languages [Visser et al. 1998] empower programmers to manually control when to apply individual rewrite rules and compose them step-by-step into complex compiler optimizations.Elevate [Hagedorn et al. 2020] shows that manually de ned rewriting strategies perform complex optimizations of functional array programs in less than a second of compilation time, producing high performance code.However, these complex optimizations emerge from thousands of rewrite steps making the rewriting strategies challenging to write manually.Automated rewrite techniques are appealing as they promise to release humans from the burden of specifying when and where rewrite rules are applied, e.g. to prove a theorem or optimize a program.This is particularly important for more complex applications and domains where manually rewriting can be quite di cult and quickly becomes tedious for anything but the simplest of cases.
There are many applications for which existing automatic rewrite techniques work very well.These are, for example, well-behaved applications where the rewrite system has a convergence property, i.e., con uence and termination.A concrete example is the group axioms extended with certain additional identities, like the involutive property of group inverses above, which can be obtained by a method called Knuth-Bendix completion [Knuth and Bendix 1983].Similarly, many practical but simple compiler optimizations can be achieved with simple greedy rewriting.
Unfortunately, greedy rewriting gets stuck in local optima.Equality saturation [Tate et al. 2009;Willsey et al. 2021] and the related Congruence Closure [Nelson and Oppen 1980;Nieuwenhuis and Oliveras 2007] avoid this problem by using a graph data structure to e ciently represent and Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 58.Publication date: January 2024.
Guided Equality Saturation 58:3 rewrite many equivalent terms.Aggressively exploring many rewrites in this way is expensive, and in some cases the graph grows exponentially, quickly exceeding practical memory limits.
We observe that when reasoning about rewrites informally, humans tend to skip simple steps but discuss intermediate points.A proof like the one above for groups might be written like that for didactic reasons in an introductory course, but usually, only the key insight would be shown.One might write, e.g. ( −1 ) −1 = ( −1 ) −1 • ( −1 • ) = .This is a proof sketch, skipping some steps and omitting details like the concrete rewrites, but the details left out are much simpler to reconstruct than the key insight.Similarly, when discussing program transformations, we think in intermediate steps and sketch out how the transformed terms roughly ought to look, without writing out entire programs.So we may seek to tile a loop, without specifying exactly how its body changes.
Based on this insight this paper proposes guided equality saturation, a novel semi-automatic rewriting technique that factors complex rewrite problems into a sequence of simpler rewrite problems, each su ciently simple to be found by equality saturation.The expert controls rewriting by specifying a sequence of guides that provide human insight.If problems are too complex, additional guides make them feasible.Thus, guided equality saturation gradually scales to increasingly complex problems, as we will see in our two case studies.
Our theorem proving case study demonstrates that guided equality saturation allows more proofs to be found signi cantly faster, compared to unguided techniques that sometimes fail to nd a proof at all.We show that human-added guides closely resemble the reasoning steps found in textbooks.
Our program optimization case study shows that guided equality saturation allows more optimizations to be found with signi cantly lower computational cost, compared to unguided equality saturation that fails to nd optimizations by exceeding runtime or memory limits.We demonstrate how human experts can write guides as incomplete term sketches, greatly simplifying the speci cation of guides.The sketch guide is instantiated to a complete term by the equality saturation.
To summarize, the main contributions of this paper are: • Guided equality saturation, a semi-automated technique o ering a novel, practical trade-o between manual rewriting and automated rewriting.Human-provided guides break a single infeasible rewrite problem into a sequence of feasible rewrite problems, each solvable with equality saturation.Guides can either be speci ed as concrete intermediate terms, or in many cases as imprecise sketches to simplify their speci cation.• A theorem proving case study integrating guided equality saturation in the Lean 4 theorem prover [de Moura and Ullrich 2021] and demonstrating proofs from group theory and ring theory as examples.The insights injected as guides either make proofs up to two orders of magnitude faster, or make proofs feasible where fully automated techniques fail.• A program optimization case study using guided equality saturation to optimize programs in the functional array language RISE [Steuwer et al. 2022].Seven advanced optimizations of a matrix multiplication application are implemented, including loop blocking, vectorization, and multithreading.Unguided equality saturation fails to perform the ve most complex optimizations, even given an hour and 60 GB of RAM.Using at most three sketch guides, each 10 times smaller than the complete program, all optimizations are applied in seconds using less than 1 GB of RAM.
We start in Section 2 with some background on equality saturation, the automated technique we build on, before introducing guided equality saturation in Section 3. We present our case studies in Sections 4 and 5, before discussing related work in Section 6.We discuss how to use guides in Section 7, and conclude in Section 8.

STRENGTHS AND WEAKNESSES OF EQUALITY SATURATION
This section provides some background on equality saturation, the automated technique we build on.We demonstrate its strengths compared with greedy rewriting, a popular automatic technique used, for example, in GHC [Jones et al. 2001] and LLVM [Lattner and Adve 2004].We then show equality saturation's weaknesses by discussing examples that are too challenging for it.

Greedy Rewriting and the Phase Ordering Problem
Greedy rewriting applies rewrites greedily, aiming to minimize a cost function in every local rewrite step.Let us look at a practical example related to optimizing functional array code.We want to fuse operators to avoid writing intermediate results to memory, such as for this program: The initial program (A) applies function to each element of a two-dimensional matrix using two nested s, transposes the result, then applies function to each element.The optimized program (B) avoids an intermediate matrix, transposing the input before applying and to each element.(B) can be derived from (A) by applying the following rewrite rules in the correct order: To perform greedy rewriting, we have to choose a cost function.As minimizing term size results in maximizing the number of fused operators, it is a reasonable choice.If we greedily apply only rewrite rules that decrease the term size, we will only apply rule (3) as this is the only rule that reduces term size (from 5 to 4 terms).However, rule (3) cannot be directly applied to term (A): it is a local optimum.The only way to reduce term size further is to rst apply the other rewrite rules, which may or may not pay o depending on future rewrites.We observe a similar phenomenon for the theorem proving example in the introduction, where we rst have to increase the term size before we can reduce it.But in greedy rewriting, we do not apply rules that do not make local progress towards our cost function.
This makes it easy to get stuck in a local optimum, and, therefore, does not work well for applications where the global optimum is signi cantly better than local optima.The challenge is that often the global bene t of applying a rewrite rule depends on future rewrites.In the compiler community, the problem of automatically deciding when to apply each rewrite rule is known as the phase ordering problem [Touati and Barthou 2006].

Mitigating the Phase Ordering Problem with Equality Saturation
Di erent communities have developed techniques to mitigate the phase ordering problem, and related problems of nding sequences of rewrites, by automatically exploring many possible ways to apply rewrite rules.The theorem proving community have developed Congruence Closure [Nelson and Oppen 1980;Nieuwenhuis and Oliveras 2007], after which the program optimization community developed the closely-related equality saturation technique [Tate et al. 2009;Willsey et al. 2021] that we focus on in this paper.

Equality Saturation.
Starting from an input term, equality saturation grows an equality graph (e-graph) by applying all possible rewrites iteratively until reaching a xed point (saturation), achieving a performance goal, or timing out.An e-graph e ciently represents a large set of equivalent terms and is grown by repeatedly applying rewrite rules in a purely additive way.Instead of replacing the matched left-hand side of a rewrite rule with its right-hand side, the equality between both sides is recorded in the e-graph.After growing the e-graph, the best program found is extracted from it using a cost model, e.g. one that selects the fastest program.

Exploring Past the Local Optimum with Equality Saturation
. The e-graph (equivalence graph) is used to e ciently represent and rewrite a set of equivalent terms, intuitively: • An e-graph is a set of equivalence classes (e-classes) • An e-class is a set of equivalent nodes (e-nodes) • An e-node ( 1 , .., ) is an -ary function symbol ( ) from the term language, associated with child e-classes ( ) Figure 1a shows the e-graph for the program (A) from Section 2.1.E-classes are shown as boxes with a dashed outline and e-nodes are shown as boxes with a solid outline.The e-class children of an e-node are shown as directed edges forming the e-graph.
To start the exploration phase, the initial e-graph is iteratively grown by applying rewrite rules non-destructively (Figures 1b to 1d).On each equality saturation iteration, all possible rewrites are applied in a breadth-rst manner.This contrasts with standard term rewriting where a single possible rewrite is selected in a depth-rst manner, requiring careful ordering of rewrite rule applications.In Figure 1, we only apply a handful of rewrite rules per iteration for the sake of simplicity.Rewrite rule applications are considered even if they do not locally lower cost, which avoids getting stuck in local optima.When applying a rewrite rule, the equality between its matched left-hand side and its instantiated right-hand side is recorded in the e-graph.In contrast, standard term rewriting destructively replaces the matched left-hand side with the instantiated right-hand side, producing a new term from the initial one.
Crucially for e ciency, an e-graph is far more compact than a naive set of terms, as equivalent sub-terms are shared.E-graphs can represent exponentially many terms in polynomial space, and even in nitely many terms in the presence of cycles [Willsey et al. 2021].To maximize sharing, a congruence invariant is maintained: intuitively identical e-nodes should not be in di erent e-classes (Figure 2).Later we will see that even extensive sharing does not necessarily prevent e-graph sizes from growing exponentially or exceeding practical memory limits.

Extracting a Global
Optimum with Equality Saturation.The exploration phase terminates, and rewrite rules stop being applied, when a xed point is reached (saturation), or when another stopping criteria is reached (e.g.timeout or a certain goal has been achieved).If saturation is reached, it means that all possible rewrites have been explored.After the exploration of rewrites has terminated, we extract a term from the e-graph according to a cost function, e.g., the one with the smallest term size.The extracted term is a global optimum if saturation was reached.
The extraction procedure can be a relatively simple bottom-up e-graph traversal, similar to Knuth [1977]'s generalization of Dijkstra's algorithm, if the cost function is local.Non-local cost functions require more complex extraction procedures [Wang et al. 2020;Wu et al. 2019].
A local cost function can be de ned as a function of a term language symbol and the costs of its children, i.e. has signature ( ( 1 : , .., : )) : with costs of type .For example, term size is a local cost function: termSize( ( 1 , .., )) = 1 + .The term sizes computed during a bottom-up extraction procedure are shown in Figure 3.The gure reveals that the grown e-graph contains a smaller term of term size 7 in the same e-class as the original term (A) of term size 9 (top left in Figure 3).The extracted term is indeed program (B).

Successful Applications of Equality Saturation.
The / example (A) demonstrates how equality saturation can succeed where greedy rewriting is not su cient.This is also true for many applications in other domains.In theorem proving, while a proof for −1 • ( • ) = can easily be found via greedy rewriting from the group axioms, a proof of 1 −1 = 1 is out of the reach of greedy rewriting but can be found with equality saturation 1 .We can think of this in analogy of overcoming local optima, by taking for cost function the "simplicity" of the term, as de ned by the rewrites that will simplify the term (i.e., nd a normal form if the rewrite system is Noetherian).
A renewed interest in equality saturation has been sparked by the recent egg library [Willsey et al. 2021] with applications in optimizing linear algebra [Wang et al. 2020], shrinking 3D CAD (Computer-Aided Design) models [Nandi et al. 2020], optimizing deep learning programs [Smith et al. 2021;Yang et al. 2021], vectorizing digital signal processing code [VanHattum et al. 2021], inferring rewrite rules [Nandi et al. 2021], and more.
The closely related congruence closure technique has been e ectively used to produce proofs in proof assistants [Nieuwenhuis and Oliveras 2005;Selsam and de Moura 2016] and automated theorem provers [de Moura and Bjørner 2007].

The Limits of Equality Saturation
While equality saturation has found many successful uses, it has limits.We have seen already in the introduction a theorem proving example that does not work with equality saturation, as crucial human insight is required for the proof.In this section, we explore an additional example highlighting a signi cant, practical scaling limitation of equality saturation.In this example, the e-graph grows extremely rapidly, adding many e-classes and e-nodes in each iteration without enough sharing opportunity to t the e-graph within a practical memory limit.

Reaching the Limits of Equality Saturation.
Loop tiling is a traditional compiler optimization that improves memory access patterns and with it cache performance.Tiling is typically performed on multiple nested loops.To demonstrate the scaling behavior of equality saturation, we rst attempt to perform the tiling of a single loop, then two nested loops and nally three nested loops.As before, we perform the optimizations on functional array code, but now the rst parameter of represents the array size.We use the •-associativity and -fusion rewrites rules (see ( 2) and (3) in Section 2.1 and two more rules below: To perform one-dimensional tiling, a single loop (represented by the functional ) is split according to the rules into two nested loops, as shown here: This is a trivial rewrite only applying rule (5).A more complex sequence of rewrites has to be performed to tile two nested loops into four: This is a bit more challenging, but still within easy reach of equality saturation.Finally, to tile three nested loops into six: Here equality saturation struggles and fails to perform the rewrite.Figure 4 shows the memory footprint required by equality saturation for performing 1D (4a), 2D (4b), and 3D (4c) tiling.The e-graph for three-dimensional tiling grows very quickly, requiring large amounts of memory before the rewrite completes.This points to a general characteristic of equality saturation: either a successful rewrite sequence is found relatively quickly, or, computational costs explode.

Ways to Reduce E-Graph Growth.
Various ad-hoc ways to limit the growth of e-graphs have been discussed, such as limiting the number of rules applied [Wang et al. 2020;Willsey et al. 2021], which risks not nding rewrites that require an omitted rule.An application-speci c alternative is to use an external solver to speculatively add equivalences [Nandi et al. 2020], but this requires the identi cation of sub-tasks that can bene t from being delegated.It is also possible to trade-o between the exploitation of greedy rewriting and the exploration of equality saturation [Kourta et al. 2022], but this requires a good enough heuristic cost function to make local decisions.
As an alternative mitigation to this issue, this paper proposes a novel semi-automatic approach: allowing experts to guide the rewrite process by breaking a single infeasible equality saturation process into a sequence of feasible equality saturations.On top of reducing computational cost, this also provides a mechanism for experts to inject insights into the rewrite process, supporting applications such as the theorem proving example from the introduction.

GUIDED EQUALITY SATURATION
This section introduces guided equality saturation.Figure 5 illustrates how a human expert guides the rewrite process towards a nal goal by specifying a sequence of intermediate goals that we call guides.By doing so, they break a rewrite problem that is too complex for unguided equality saturation into simpler problems, each su ciently simple to be solved by equality saturation.
We rst present the algorithm for implementing guided equality saturation and demonstrate that it is capable of performing the three-dimensional loop tiling for which unguided equality saturation fails.Then we discuss how to use terms as guides and introduce sketch guides that simplify the speci cation of guides.−1 an e-graph is initialized (listing 1 line 3).As long as the next goal is not reached (while loop in listing 1 line 4) rewrites are applied to grow the e-graph as usual.Once the goal has been satisfied, satisfying the goal is extracted from the e-graph (listing 1 line 17) to start the next equality saturation aiming to reach the next goal.

Guided Equality Saturation Algorithm
Listing 1 shows the guided equality saturation algorithm as pseudocode in the style of the egg paper [Willsey et al. 2021].A sequence of equality saturations is performed (line 2, Figure 6).Each equality saturation initializes a new e-graph (line 3) and performs standard equality saturation iterations as described by Willsey et al. [2021] (lines 7-14).Note, that each equality saturation can use a di erent set of rewrites.We will explore the practical implications of choosing di erent sets of rewrites for di erent equality saturations as part of our case studies.
After each equality saturation iteration, the algorithm checks whether the intermediate goal has been reached (line 4).If so a term satisfying the goal is extracted (line 16-17) and assigned to the term variable.This term will be used to initiate a new e-graph for the next equality saturation.
An equality saturation step may fail to nd an intermediate goal, and if so the process has failed.Section 7 will discuss how users can deal with search failures.
Scaling Beyond Equality Saturation. Figure 7 shows the three-dimensional loop tiling example from Section 2.3.1.On the left, we see how unguided equality saturation runs out of memory quickly, while on the right how a single intermediate goal as a guide makes the rewrite feasible.Two equality saturations are now performed.The rst nishes after 6 iterations reaching the intermediate goal, before the second equality saturation performs 7 further iterations growing a new e-graph.We will discuss guides shortly, and show the sketch guide used here in Section 3.3.

Terms as Goals
Goals, nal or intermediate, may be speci ed either as concrete terms or as imprecise sketches.We rst discuss uses of concrete terms, and postpone the discussion of sketches to Section 3.3.When using a concrete term as a goal, the e-graph is initialized (Listing 1 line 3) with the start term and the goal term.This allows rewrites to be applied to both the start term and the goal term when searching for a rewrite sequence that establishes their equivalence.Checking that the goal is reached is straightforward, as we simply check if the goal and start term are in the same e-class.
Using terms as goals is useful in many situations, such as in theorem proving.As an example, let us analyze the process of equational reasoning about mathematical identities with an algorithmic lens.When reasoning about a mathematical term with a piece of paper, it is common to write a proof sketch as a sequence of concrete terms corresponding to a sequence of key reasoning steps.Mathematicians do this to incrementally explore potential proofs and record di erent attempts at reaching a goal term from an initial term, until successful.On paper, we usually skip simple intermediate steps and only document the key reasoning for non-obvious steps by annotating them with the theorem used.A good proof sketch should be enough to guide an informed reader so that they can ll in the reasoning gaps.Section 4 shows that using terms as intermediate goals replicates this form of reasoning closely by providing key reasoning steps as intermediate goals, and automates the detailed and routine reasoning between them.

Sketches as Goals
For some applications specifying a concrete intermediate term is tedious, as terms may be large and even a minor syntactic mistake results in the term not being found.Let us again consider the three-dimensional loop-tiling example.In Section 3.1 we showed that a single intermediate goal speci ed as a guide is su cient to perform the optimization where unguided equality saturation fails.But how complex is it to write the guide?
An experienced performance engineer can easily come up with the intuition that performing loop-tiling can be broken into two steps: rst, splitting the loops, and then reordering them.So specifying a guide where the three nested loops are split into six makes sense, but specifying the concrete term is overly tedious: ?| ( , .., The engineer must not just specify the six nested loops (represented as maps) in the second line, but also the precise transformations of the input and output data.To greatly simplify the speci cation of guides for such applications, we introduce sketches.
We provide an open-source implementation of sketches on top of the egg library, which includes this loop tiling example (https://github.com/Bastacyclop/egg-sketches).
Writing the same guide as a sketch is much simpler, and focuses on the key intuition that the desired term must contain six nested loops: Informal program snippets are often used by performance engineers to visualise and explain program optimizations.This can be observed in many papers that use rewriting strategies or schedules to specify optimizations, such as [Adams et al. 2019;Anderson et al. 2021;Chen et al. 2018;Ikarashi et al. 2021;Koehler and Steuwer 2021;Ragan-Kelley et al. 2013;Sioutas et al. 2020].Our sketches can be seen as a formalization of these informal program snippets.
When using a sketch as a goal, the e-graph is initialized only with the starting term, as we cannot add a sketch to the e-graph.As we will see next, sketches are speci ed as logical predicates.To check if the sketch goal has been reached, we perform e-class analysis to e ciently check if there are terms in the e-graph that satisfy the sketch predicate.To obtain a term for the next equality saturation we perform a sketch-satisfying extraction that is described in Section 3.3.3.But, rst, we formally de ne sketches.

Defining Sketches.
Sketches are speci ed in a SketchBasic language with just four constructors.The syntax of SketchBasic and the set of terms that the constructors represent are de ned in Figure 8.A sketch represents a set of terms R⟦ ⟧, such that R⟦ ⟧ ⊂ where denotes all terms in the language we rewrite.We say that any ∈ R⟦ ⟧ satis es sketch .
The ? sketch is the least precise, representing all terms in the language.The ( 1 , .., ) sketch represents all terms that match a speci c -ary function symbol from the term language, and whose children satisfy sketches .The contains( ) sketch represents all terms containing a term that satis es sketch : the least solution to the recursive R⟦contains( )⟧ equation.Finally, the 1 ∨ 2 sketch represents terms satisfying either 1 or 2 .

Sketch Precision.
Writing a useful sketch to guide an optimization search requires striking a balance between being too precise and too vague.An overly precise sketch may exclude valid terms with a slightly di erent structure.An overly vague sketch may lead to nding undesirable terms.This balance also interacts with the set of rewrite rules used, since terms that may be found by the search are R⟦ ⟧ ∩ E ⟦ ⟧ where E ⟦ ⟧ represents the set of terms that can be discovered to be equivalent to the initial term according to the given .This means that using a more restricted set of rules generally enables specifying less precise sketches.

Sketch-Satisfying Extraction.
As a sketch may be satis ed by many terms, we require a cost function to extract a single term from the e-graph that satis es the sketch.The extracted term is used as the starting point for the next equality saturation search.More formally, to extract the best term that satis es a sketch from an e-class of an e-graph we de ne a helper function ex( , , , ), where is a cost function that must be monotonic and local (Section 2.2.3).The function ex returns a cost associated with the best term, and as extraction may fail, the return value is optional.Thus, the return type is Option[( , Term)].For e ciency ex is memoized.Then, extract uses to return a term if possible, failing otherwise: ex is recursively de ned over the four SketchBasic cases as follows.
The lowest cost alternative is selected.The next two sections present two case studies applying guided equality saturation to theorem proving (Section 4) and program optimization (Section 5).Each provides distinct challenges for equality saturation and demonstrates how guidance provides a practical way to tackle them.

CASE STUDY: THEOREM PROVING
Machine-checked theorem proving has seen several milestone achievements in mathematics, like the Kepler conjecture [Hales et al. 2017] or the four-color theorem [Gonthier et al. 2008].These, however, have been more sparse and not always at the cutting edge of research in mathematics.Of central importance to these machine-checked proofs are interactive theorem provers (ITPs).These provide an interactive environment that can guide a user step-by-step in constructing a proof.In particular, the Lean theorem prover [de Moura and Ullrich 2021] has garnered much attention from the Mathematics community.Recent applications have reached elds like number theory [Buzzard et al. 2020], addressing research-level questions in that space, e.g.Hence we use Lean 4 for this study.ITPs like Coq or Isabelle also have a long history of successful uses in computer science and veri cation, and could have been used in our study.
E-graphs and congruence closure have been used in ITPs with considerable success.The key idea is that an e-graph can be used to keep metadata about how it was constructed, enabling the construction of a proof witness: a sequence of rewrites that prove equivalence by transitivity [Corbineau 2006;Nieuwenhuis and Oliveras 2005;Selsam and de Moura 2016].Many modern provers like Coq, Isabelle or Lean also feature rich sets of so-called tactics, meta-programs that enable partial proof automation by operating in the intermediate proof state.Both Coq and Isabelle have tactics that use congruence closure, as well as older versions of Lean.

Equational Reasoning in Lean
Consider the following snippet of Lean 4 code, which corresponds to the example from Section 1: The rw tactic allows us to write the proof as a series of rewrite steps, which are sequentially applied to the current goal.The ← character denotes the rewrite should be applied from right to left, and arguments like g -1-1 or g instantiate universally-quanti ed theorems to concrete rewrites.The rewrites are speci ed by names, while the intermediate states are implicit.However, most ITPs provide tactics to make this calculation more explicit; in Lean this tactic is called calc.With it, the proof looks more like in the introduction: ) * g := by rw [mul_assoc] _ = 1 * g := by rw [inv_mul_self] _ = g := by rw [one_mul] Note that in this case, the correct rewrite arguments are inferred through uni cation [Dowek 2001].However, in calc, steps have to be given individually, which usually becomes tedious and error-prone.We would prefer to write a proof sketch like the one outlined in Section 1: This section describes a prototype Lean 4 tactic2 based on egg [Willsey et al. 2021], which enables guided equality saturation in Lean 4 and allows writing proofs similarly to the example above.

Implementing Guided Equality Saturation in Lean 4
To integrate this method into a theorem prover like Lean 4 [de Moura and Ullrich 2021], we use tactic metaprogramming to build the rewrites, the e-graph, saturate it and extract a proof out of it.This section outlines the main parts of the implementation.Lean 4 has a foreign-function interface that allows calling libraries like egg, but our prototype uses IO and operating system pipes instead.
The implementation rst gathers the equalities passed to the tactic, as well as the equality we want to prove, l = r, and concretely its left-and right-hand sides, l and r.These are used to instantiate an e-graph using egg representing the two terms l and r, and the given rewrites.This implements the algorithm with terms as guides, as described in Section 3.2.As soon as the equivalence classes of the left-hand-side and right-hand-side get merged the procedure stops.In this case, we learn that l and r belong to the same equivalence class.We return the sequence of rewrites that witness l = r.
In fact, rather than directly sending terms such as l and r, we send their internal encoding in Lean's Expr datatype.The Expr datatype represents terms in the Lean kernel and has constructs for constants, free variables, lambda abstractions, let-bindings, and function applications.These are serialized, sent to egg, and deserialized into the Expr datatype in Lean.
Rewrites are usually universally quanti ed.For example, associativity is expressed as forall a b c : a * (b * c) = (a * b) * c, i.e. universally quanti ed over the variables a, b, c.When looking through the passed arguments for terms with equalities, we rst consider a term that begins with a universal quanti er.Since Lean uses a locally nameless [Charguéraud 2012] approach, we instantiate these quanti ers, which converts the quanti ed variables into free variables in an extended context.This exposes the bare equality that was behind universal quanti cation.We keep track of which free variables represent universally quanti ed variables that have now been instantiated due to Lean's locally nameless encoding.Then, when performing the encoding into egg, we communicate that these terms are, in fact, metavariables.
Finally, if egg nds a series of rewrites that relate both sides of the equation, we can use a reconstruction algorithm to obtain a witness of the equality from the e-graph [Flatt et al. 2022;Nieuwenhuis and Oliveras 2005].Once we have constructed the series of rewrites with the corresponding instances of the applications, we can apply them to our goal within the Lean tactic.Crucially, Lean will then type check this and ensure the series of rewrites is sound.Hence, neither our tactic nor egg are part of the Trusted Computing Base (TCB), as a bug in either will only result in a failed tactic application and Lean will report that the goal was not proven.In practice, this also means that our tactic works as a proof checker for egg [Pnueli et al. 1998], by translation validation -our tactic translates the egg data structures into a proof that is veri ed by the Lean kernel.

Failing Gracefully with Simplification.
If the equality saturation procedure fails to nd an equivalence between the left and right-hand sides, all is not lost.For each side, we use the extraction mechanism to select a term with the smallest (AST) size in the corresponding equivalence class, along with the proof of equivalence between the smaller terms and their parents.This allows us to simplify the goal, potentially making progress towards a proof.In fact, it is not even necessary to have a goal of the form l = r.We can apply our technique to simplify a single term, much like the simpli cation tactics simp and autorewrite of Lean and Coq do, with all the additional rewrite capabilities from equality saturation.These simpli cation results may be, for example, used as guides in a guided equality saturation, to then continue the manual part of the search from there.

Limitations.
The success of rewrite-based tactics like ours, or the greedy simp, hinges largely on being able to nd rewrites that witness the equality of the left-hand and right-hand sides of the equation we are trying to prove.In fact, tactics like simp or Coq's autorewrite build entire tagging systems just to tag rewrites that might be relevant for them to use in simpli cation, instead of explicitly passing them.It is possible to go one step further and try to nd a matching rewrite out of the theorems available; Sledgehammers in Isabelle do something similar, for example [Blanchette et al. 2011;Böhme and Nipkow 2010].Our prototype implementation takes the rewrites as explicit arguments, and extending it with these kinds of systems is an orthogonal task.
Another limitation of our tactic is that it only supports the fragment of Lean 4 terms characterized by the following syntax: e ::= named-constant | app e e | free-variable (type of terms we support) eq ::= Eq e e | forall (x : ), eq Here, app denotes function application and Eq denotes the equality type (with a single constructor rfl a a for all : ).This encoding is parametric over , but only one type can be instantiated at a time.The types of the terms, in particular, are also part of our encoding, as they are in Lean.
We currently do not add types to the metavariables to guard the rewrites, however.Lean's type checker will certainly not allow us to break things if the rewrite is applied to a term of the wrong type.However, there is nothing fundamentally preventing us from encoding our terms as typed expressions to avoid this issue in the future.Similarly, we do not rewrite over lambda expressions, nor support generalized rewriting.All of these are future work, but orthogonal to demonstrating guided equality saturation.There is in fact a work-in-progress port of the Lean 3 congruence closure tactic to Lean 4, independent of this work, that will address many of these issues3 .

Evaluating Guided Equality Saturation for Proving Theorems in Lean 4
To evaluate the tactic, we consider examples of increasing complexity.

Proving the Knuth-Bendix
Lemmas for Groups.We rst consider the example from group theory discussed in Section 1, as well as the rest of the additional lemmas from the Knuth-Bendix completion for groups.These are simple to evaluate, not requiring a large buildup of theory, yet useful as they provide a decision procedure for the word problem in (free) groups: Table 1 shows the comparison of di erent methods available in Lean to prove these lemmas with rewriting.We compare with aesop, the state of the art in proof automation in Lean [Limperg and From 2023] and simp.simp uses greedy rewriting; it does not follow an explicit cost function, it applies tagged rewrites marked as 'simp' until no further rewrite applies.aesop uses a tree-search to nd tactic sequences that prove a goal.aesop relies on simp for the equational rewriting and does not expose rewrite choices and locations through its search tree.As a result, both tactics can prove only two of the four lemmas.Unguided equality saturation can apply rewrites in both directions and thus proves an additional one, one_inv.Both guided equality saturation and manual rewriting prove all lemmas, as they are provided with human insights.The main di erence, however, is that in guided equality saturation only the key insights have to be provided, as motivated in Section 1.For example, we can prove inv_inv with a single guide, rather than 5 manual rewrite steps.We also compare to Isabelle's sledehammer [Blanchette et al. 2013;Böhme and Nipkow 2010], which also closes all lemmas.This is not very surprising, since sledgehammer is much more powerful, combining multiple semi-decision procedures and including congruence closure as one of them.

Larger Use Case: "Freshman's Dream".
The previous examples are simple and require only a handful of rewrites each.To increase the complexity of the reasoning, without invoking esoteric mathematics, we consider examples from ring theory, where there are two operations (addition and multiplication) and thus signi cantly more rewrites available.For example, consider the theorem ( + ) 2 = 2 + 2 , sometimes dubbed "freshman's dream", which holds in a commutative ring with characteristic 2. The proof using our tactic looks like this:   et al. 2013] are able to reconstruct a proof.The code for these two versions, as well as the ones for groups, can be found in the supplementary material.Figure 9a compares the runtime, and number of steps required, for the manual version of rewrites, the guided equality saturation version above, and an unguided equality saturation version.We report the median runtimes of 10 executions, on an AMD Ryzen 9 3900X 12-Core machine with 132 GB of memory.The runtimes include the startup time of Lean, parsing, and the serialization and deserialization times when communicating with the egg-based library.As a baseline to measure the overhead, we also compare with a manual version with guided equality saturation using all 15 steps from the manual proof as guides.The manual version took 0.534 ± 0.019 , while this baseline version using all guides with eqsat took 0.540 ± 0.015 : barely slower, and within a standard deviation of each other.For reference, the textbook-guided version took 0.534 ± 0.031 , basically identical to the manual version.We see that the overhead even with this prototype implementation is negligible for these cases.Unguided equality saturation, on the other hand, takes about 4 minutes to prove this equality.This shows that while unguided equality saturation is powerful in principle, it is much less useful in practice, as it fails to prove simple theorems in a short period of time.

4.3.
3 Proving ( + ) 3 = 3 + * 2 + 2 * + for Rings.We can scale this up by considering the next power of the binomial, ( + ) 3 = 3 + * 2 + 2 * + , in characteristic 2. Adding two more guides, for multiplying out the remaining ( + ), we still stay well below a second; in contrast, the unguided version takes over 20 minutes.For techniques like these to be useful in practice in a proof assistant, they need to be responsive and nish within seconds, at most.
To evaluate the e ect of the guides, we take this proof sketch and remove the 5 intermediate guides, testing al 32 possible combinations of them.Figure 9b shows the results, where multiple points exist for each number of guides , corresponding to the di erent ways of selecting out of 5 guides.We see that, as expected, more seem to generally mean less time for nding a proof.Crucially, the right step can signi cantly speed everything up.This theorem can be proven in less than a second with a single well-chosen guide, but choosing a bad guide can mean almost no speedup at all.This corresponds to our intuition, some steps are easier to follow than others.
In summary, guided equality saturation allows us to write proof sketches that look like a textbook's, skipping steps and leaving out details, yet still reconstructing a full formal proof quickly.

Comparison to an Actual Textbook.
We now look at an actual textbook with this result to compare, but we see it is proven as a special case of a more general theorem, using the binomial theorem [Rotman 2006].Figure 10 shows the comparison between the Lean code and the textbook proof.We see that while the syntax is not as polished, it has essentially the same information as the textbook.Unfortunately, guided equality saturation does not complete this sketch if just given the ring axioms, like in the examples above.However, that is also not how we reconstruct the proof from our intuition either: consider the second step, which rewrites ( 1 ( − +1) + 1 ) to ( + − +1 ( − +1) ).When reconstructing this step, we are implicitly using a lemma that tells us how to add fractions: + = + .This lemma does not hold for all , , , ∈ though, the textbook also uses the fact that − + 1 ≠ 0, which follows non-trivially from ≤ in the integers, and the canonical ring homomorphism from them.If we give a handful of additional lemmas like this to our guided equality saturation tactic, with the satis ed preconditions, it can reconstruct the full equational proof from this textbook as well.That is to say, if we manually prove that − + 1 > 0 and then instantiate 1/( − + 1) + 1/ = + − + 1/( * ( − + 1)) explicitly for these concrete and , then we can reason about the equality using this manually-proven rewrite.However, our tactic cannot reason about these non-trivial preconditions yet.In future work we can explore the use of guarded rewrites (cf.[Willsey et al. 2021]) to achive this.
Guided equality saturation thus allows us to prove theorems with a semi-automatic approach that we could not prove with unguided equality saturation.For many cases where the full equality saturation does work, a few well-chosen guides can take a proof from several minutes down to under a second.Overall, this allows us to match the intuition often found in textbooks, where proof sketches skip steps and omit details, by reconstructing the full formal proof using these proof sketches as guides -at least when these work with purely equational reasoning.

CASE STUDY: PROGRAM OPTIMIZATION
Equality saturation has found many applications in program optimization [Smith et al. 2021;Tate et al. 2009;VanHattum et al. 2021;Wang et al. 2020;Yang et al. 2021].Many applications tightly couple language and rewrite rule design with the equality saturation technique to mitigate e-graph growth, and often deliver impressive results.
This case study explores the limits of equality saturation for optimizing parallel linear algebra code in the RISE functional language, and how to overcome those limitations.First, we introduce RISE in Section 5.1.Then in Section 5.2 we describe how to encode the full language including variable bindings, which are often avoided for equality saturation.Finally, in Section 5.3 we optimize matrix multiplication as a case study performing seven complex program optimizations that required tens of thousands of manual rewrite steps in [Hagedorn et al. 2020]-beyond the scope of what prior applications of equality saturation have attempted.

Rewriting the RISE Functional Array Language
RISE [Hagedorn et al. 2020] is a functional array programming language.It is a spiritual successor of Lift [Steuwer et al. 2015[Steuwer et al. , 2017] ] that demonstrated performance portability across hardware by automatically applying semantics-preserving rewrite rules to optimize programs from various domains, including scienti c code [Hagedorn et al. 2018] and convolutions [Mogers et al. 2020].

5.1.1
The RISE Language.RISE is well suited for rewrite-based optimizations as a functional, side-e ect-free language.It is based on typed lambda calculus and thus provides lambda abstraction ( x. b), function application (f x), identi ers and literals.RISE expresses data-parallel computations as compositions of high-level computational patterns over dense multidimensional arrays (a.k.a.tensors).map applies a function to each element of an array.reduce combines all elements of an array to a single value given a binary reduction operator.split, join, transpose, zip, unzip and slide reshape arrays in various ways.
High-level programs, such as the matrix multiplication in the top left of Figure 11, specify computations without committing to a particular implementation.Implementation choices are explicitly encoded in RISE programs by applying rewrite rules that introduce low-level patterns that directly correspond to a particular implementation.For example, reduceSeq is a sequential reduction, and multiple low-level map-like patterns correspond to di erent sequential and parallel implementations.After the rewriting of a RISE program is complete, it is translated to low-level imperative code such as C or OpenCL for execution.The RISE program at the bottom of Figure 11 shows an optimized version of matrix multiplication.A common loop blocking optimization, that improves data locality and hence memory usage, has been introduced by rewriting.

Rewriting RISE Programs.
RISE is complemented by a second language Elevate [Hagedorn et al. 2020] that allows programmers to manually describe complex optimizations as compositions of rewrite rules, called rewriting strategies.The performance of the code generated by RISE and Elevate is comparable with state-of-the-art compilers, e.g. with the TVM deep learning compiler [Chen et al. 2018] for matrix multiplication [Hagedorn et al. 2020]; and with the Halide image processing compiler [Ragan-Kelley et al. 2012] for the Harris corner detection [Koehler and Steuwer 2021].

Limitations of Manual Rewriting with Strategies.
Although Elevate enables the development of abstractions that help with conciseness, strategies remain challenging to write.Fundamentally, Elevate delegates the problem of ordering thousands of rewrites to the human expert.Hagedorn et al. [2020] and Koehler and Steuwer [2021] estimate spending between two and ve person-weeks developing the Elevate strategies for their matrix multiplication and image processing case studies.This case study explores how to reduce the manual e ort required to perform these optimizations by using guided equality saturation.No good heuristic to make local rewriting decisions is known for RISE, which is why we do not explore greedy rewriting and its variations [Kourta et al. 2022].

Rewriting RISE with Equality Saturation
Rewriting RISE with equality saturation requires encoding RISE terms and rewrite rules in an equality saturation framework like egg.This poses challenges as RISE is based on the lambda calculus, and its rewrite rules involve name bindings, substitutions, and freshness predicates.
Figure 12 shows the familiar fusion and ssion rules, as well as the standard -andreduction rules of lambda calculus.In contrast to the rules of Section 2.1, name bindings are used instead of function composition.Dealing with name bindings in equality saturation is an open challenge [Willsey et al. 2021].Lambda calculus programs with name bindings can be encoded as combinatory logic terms, but the translation results in a term of size ( 3 ) in the worst case [Lachowski 2018].The language and rewrite rules could be redesigned to avoid name bindings, as in Smith et al. [2021], but here we wish to minimize changes to RISE and the rewrite rules.
Here we adopt the techniques explored by Koehler [2022] to e ciently encode a polymorphically typed lambda calculus.In practice, these techniques reduce the runtime and memory consumption of equality saturation by orders of magnitude when optimizing RISE programs.We now give a quick overview of these techniques.

Dealing with Substitution.
The -reduction rule requires substituting [ / ].Standard term substitution cannot be used directly during equality saturation, as the and pattern variables are not matched by terms, but by e-classes.
A simple way to address this is to use explicit substitution as in egg's lambda calculus example [Willsey et al. 2021]: a syntactic constructor is added representing substitutions, along with rewrite rules to encode its small-step behaviour.Unfortunately, explicit substitution adds all intermediate steps to the e-graph, quickly exploding its size.
To avoid this e ect, extraction-based substitution is used, an approximation that works as follows: (1) extract a term for each e-class involved in the substitution (i.e. and ); (2) perform standard term substitution; (3) add the resulting term to the e-graph.

Dealing with Name Bindings.
During equality saturation, inappropriate handling of name bindings leads to serious e ciency issues.Consider rules like map fusion that create a new lambda abstraction on their right-hand side.What name should be introduced when they are applied?Generating a fresh name using a global counter (aka.gensym) is a common solution in standard term rewriting [Augustsson et al. 1994].However, such an approach quickly burdens the e-graph with many -equivalent terms. 4nstead, De Bruijn indices [de Bruijn 1972] are used to represent lambda terms without naming the bound variables.The more user-friendly name-based rewrite rules are automatically translated to the index-based rules used internally [Bonelli et al. 2000].Translations to use indices are computationally inexpensive: they are linear in term size, and performed outside of the guided search hot loop.It is only necessary to convert the starting term and the sets of rewrite rules before starting the guided search, and to convert the nal term once the search completes.

Dealing with Freshness Predicates.
Handling predicates is also non-trivial in equality saturation.The -reduction rule has the side condition "if not free in ", but in an e-graph is an e-class and not a term.Following egg's lambda calculus example [Willsey et al. 2021], we only apply the rule if ∀ ∈ .not free in .This method is e cient but is an approximation.In our case study, we have neither observed substitution nor freshness predicate approximations to be an issue.

Further Considerations.
Types are also important for RISE rewrite rules, we, therefore, embed types in the e-graph, associating a type with each e-class, that all of its e-nodes must satisfy.The more user-friendly partially-typed RISE rewrite rules are automatically translated to the explicitly typed rules used internally.Types are inferred by rst inferring the types on the left-hand side, before checking that the right-hand side is well-typed for any well-typed left-hand side.
Finally, certain program properties are required in RISE to obtain a valid low-level program from which imperative code can be generated.We fully automate enforcing such properties in a nal cleanup phase.For example, sequential loops and memory copies are inserted where required, and let expressions are hoisted as much as possible.

Evaluating Guided Equality Saturation for Optimizing Programs in RISE
This section compares guided and unguided equality saturation performing complex RISE program optimizations.We evaluate seven typical compiler optimizations, including loop blocking, loop permutation, vectorization, and multithreading, described in the TVM manual. 5They have been reproduced by Hagedorn et al. [2020] using manually speci ed Elevate strategies that express the optimizations as compositions of rewrites achieving the same high performance as TVM.
We compare the runtime and memory requirements for unguided and for guided equality saturation.In both cases the nal optimization goal is speci ed as a sketch, allowing for slightly di erent programs to be found.We validate the performance of the optimized code by checking that the generated C code is equivalent, modulo variable names, to the code obtained using the manual Elevate strategies.The generated C code is provided in the paper's supplementary material.

Experimental Setup.
We have implemented an equality saturation engine inspired by egg in Scala 6 , allowing close integration with the existing RISE codebase, instead of interfacing directly with egg via Rust-Scala interoperability.The standard Java utilities are used for measurements: System.nanoTime() to measure search runtime, and the Runtime API to approximate maximum heap memory residency with regular sampling.
Platforms.The experiments are performed on two platforms.For manual Elevate strategies and our sketch-guided equality saturation, we use a less powerful AMD Ryzen 5 PRO 2500U with 4 GB of RAM available to the JVM.For unguided equality saturation, we use a more powerful Intel Xeon E5-2640 v2 with 60 GB of RAM available to the JVM.
Rewrite Rule Scheduling.By default, the egg library uses a BackoffScheduler preventing speci c rules from being applied too often and reducing e-graph growth in the presence of "explosive" rules such as associativity and commutativity.Our experience with RISE optimization is that using the BackoffScheduler is counterproductive as the desired optimization depends on some explosive rules.For this reason, and to make result analysis easier, we do not use a rewrite rule scheduler.

Matrix Multiplication
Optimizations.We investigate seven increasingly complicated matrix multiplication optimization goals.Each goal incrementally adds more optimizations.
• The baseline goal uses 3 straightforward nested loops to perform the matrix multiplication.
• The blocking goal adds a blocking (or tiling) optimization for improved data locality, resulting in 6 nested loops where the 3 innermost ones process 4 × 32 × 32 blocks.• The vectorization goal adds parallelism by vectorizing the innermost loop over 32 elements.
• The loop-perm goal changes the order of the 6 nested loops, for improved data locality.

Runtime and Memory Consumption of (Un)Guided Equality Saturation.
Unguided Equality Saturation.Table 2 shows the runtime and memory consumption required to nd the optimization goals with unguided equality saturation.The search terminates when the sketch describing the optimization goal is found in the e-graph.
The 5 most complex optimization goals are not found before exhausting the 60 GB of available memory.Only the baseline and blocking goals are found, and the search for blocking requires more than 1 h and about 35 GB of RAM.Millions of rewrite rules are applied, and the e-graph contains millions of e-nodes and e-classes.More complex optimizations involve more rewrite rules, creating a richer space of equivalent programs but exhausting memory faster.As examples, vectorization and loop-perm use vectorization rules, while array-packing, cache-blocks, and parallel use rules for optimizing memory storage.
Sketch-Guided Equality Saturation.Table 3 shows the runtime and memory consumption for sketch-guided equality saturation with 1, 2 or 3 sketch guides.
All optimizations are found in less than 10 s using less than 0.5 GB of RAM.Interestingly the number of rewrite rules applied by sketch-guided equality saturation is in the same order of magnitude as for the manual Elevate strategies reported in [Hagedorn et al. 2020].On one hand, equality saturation applies more rules than necessary because of its explorative nature.On the other hand, Elevate strategies apply more rules than necessary because they re-apply the same rule to the same sub-expression and do not necessarily orchestrate the shortest possible rewrite path.The e-graphs contain of the order of 10 4 e-nodes and e-classes, two orders of magnitude less than the 10 6 required for blocking without sketch-guidance.

E-Graph
Evolution in (Un)Guided Equality Saturation. Figure 13 plots the growth of the egraphs during unguided and sketch-guided searches for the blocking and parallel optimization goals from Tables 2 and 3.The e-graphs produced by unguided equality saturation grow exponentially with each search iteration.The e-graph contains millions of e-nodes and e-classes after applying millions of rules within a few iterations: less than 10.Such rapid growth limits the scalability of unguided search, for example in the 7th iteration of the parallel search, the e-graph exhausts 60 GB of memory.
While the e-graphs produced with sketch guidance typically also grow exponentially with each iteration, each sketch is satis ed within a few iterations thanks to an appropriate selection of sketches.The number of rewrites and the maximum e-graph size is three orders of magnitudes smaller than for unguided search: no more than 11 K in our example searches.Once a program satisfying a sketch guide is found, a new search is started for the next sketch using that program, growing a fresh e-graph.Sketch-guidance enables scaling to more complex optimizations, such as parallel, by factoring optimizations into as many sketch-guided searches as necessary.
The search for the nal parallel sketch goal shows linear rather than exponential growth, as the rewrite rules selected for the search have little interaction.

Human Guidance Provided.
Sketch-Guides.Table 4 shows how each optimization goal is achieved in logical steps, each corresponding to a sketch describing the program after the step is applied.It transpires that a split sketch similar to the one shown in Section 3.3 is a useful rst guide for all goals.While the sketch sizes range from 7 to 12 operators, programs are of size 90 to 124 operators, showing that each sketch elides around 90% of the RISE program.Even when 4 sketches must be written, the total sketch size is still small: the largest total being 38 operators.The paper's supplementary material contains all handwritten sketches as well as examples of discovered RISE programs.Some intricate program aspects never need to be speci ed in the sketches, for example, array reshaping patterns such as split, join and transpose.
Choice of Rules and Cost Model.Besides the sketches, programmers also specify the rules used in each search and a cost model.For the split sketch, 8 rules explain how to split map and reduce.The reorder sketches require 9 rules that swap various nestings of map and reduce.The store sketch requires 4 rules and the lower sketches 10 rules, 6 rules for vectorization, 1 rule for loop unrolling and 1 rule for loop parallelization.If we naively use all rules for the blocking search, the search runtime increases by about 25×, still nding the goal in minutes but showing the importance of selecting a small set of rules.
A simple cost model that minimizes weighted term size is used.For example, we increase mapPar weight to avoid implicit multi-threading.This cost model is only e ective when combined with the guides requesting loop splitting, temporary storage, and parallelization.On its own, this cost model is a poor greedy rewriting heuristic, as the optimizations increase weighted term size.Rules and cost models may be reused and packaged into libraries for recurring logical steps.

RELATED WORK
Equality Saturation.Equality saturation [Tate et al. 2009;Willsey et al. 2021] has been proposed as an optimization technique based on e-graphs.E-graphs were originally designed for e cient congruence closure [de Moura and Bjørner 2008;Nelson 1980].Besides theorem proving and program optimization, e-graphs are also useful in other settings, such as program synthesis and semantic code search [Premtoon et al. 2020].In this paper, we propose guided equality saturation as a semi-automatic technique that mitigates scaling problems of equality saturation.Alternative approaches have been used to scale equality saturation in practice, as discussed in Section 2.3.2.
Sketches.While we propose using incomplete program sketches to guide rewriting, sketching has also been used for synthesizing programs.In [Lezama 2008], sketches are used with counterexampleguided inductive synthesis that combines a synthesizer with a validation procedure.Our approach di ers as we use sketches for program rewriting rather than program synthesis.We use sketches as program patterns to lter a set of equivalent programs generated via equality saturation, and as a result do not require a validation procedure.
An automatic optimization procedure for TVM [Zheng et al. 2020] generates sketches before sampling low-level details, but does not use sketches as a vector for human insight.
Also related are Wiedijk [2003]'s formal proof sketches, which are in-line with our motivation for the theorem proving use-case of guided equality saturation.The scope and vision of these formal proof sketeches is much larger, of course, as guided equality saturation is limited to equational reasoning.Recently, Jiang et al. [2023]'s "draft, sketch, prove" also used a related notion of proof sketches aided with machine learning to close the gap between full formal proofs and informal ones, using formal proof sketches as an intermediate.Our guides can be seen in this context as the sketches, and guided equality saturation is an e cient technique to prove sketches of this kind.
Interactive Theorem Proving.Equational reasoning is important in Interactive Theorem Proving.Tactic languages [Gordon et al. 1979] allow experts to specify proofs manually.Most interactive theorem provers have a form of automatic greedy rewriting and usually also congruence closure [Corbineau 2006;Gjørup and Spitters 2020;Nieuwenhuis and Oliveras 2005;Selsam and de Moura 2016].The idea of verifying an external solver is also established in provers, with integrations like SMTCoq [Ekici et al. 2017] or Isabelle's Sledgehammer [Blanchette et al. 2013].Isabelle's Sledgehammer [Böhme and Nipkow 2010] integrates multiple automatic theorem provers, like SPASS [Weidenbach et al. 2009], Vampire [Riazanov and Voronkov 1999] or Zipperposition [Bentkamp et al. 2023].Sledgehammer also has heuristics for nding appropriate lemmas, and even counter-examples when writing a wrong statement [Blanchette et al. 2011].Such heuristics would also be very useful in the context of our Lean tactic, as outlined in Section 4. Indeed, our design is heavily in uenced by Sledgehammer and SMTCoq.A concrete di erence is that we don't reconstruct the proofs at the source-level [Paulson and Susanto 2007] like Sledgehammer does, which e ectively means the external solvers just work to prune the space of lemmas required to nd a proof [Böhme and Nipkow 2010].Instead, we reconstruct the proof from the series of rewrites that witness the equivalence [Flatt et al. 2022;Nieuwenhuis and Oliveras 2005].
To the best of our knowledge, none of these tools integrate equality saturation, the speci c focus here.However, many of the tools share principles and even data structures with equality saturation and are also useful for equational reasoning.With these tools, it may be feasible to provide guidance for congruence closure, superposition, SMT, or other operations with a view to obtaining similar bene ts.However, the technical challenges and trade-o s are likely to be di erent.For example, it is not clear how to simplify terms (c.f.Section 4.2.1), or how to use sketch guides without the term extraction from equality saturation.
There is a sense in which our approach can be compared more to structured proof languages, like the calculational proof style in Isar [Nipkow 2002] or equational reasoning in Agda (cf.[Kokke et al. 2020]).Indeed, by writing proof steps with the sketches in Isar's "calculational style" and calling sledgehammer on each subgoal, we can manually reproduce something very similar to the Lean tactic described in this paper.This is also enough to prove the examples from ring theory in Section 4 where using sledgehammer alone did not work.While there are syntactic di erences between the two, these techniques are very similar.The di erent backends of sledghammer are probably more powerful in general, and will prove a larger set of theorems automatically.Guided equality saturation, on the other hand, allows you to optimize terms (our tactic does this for simpli cation).In future work we may use sketch-guides instead of terms for more exible proof sketches, and guarded rewrites to deal with preconditions.These are unique to sketch-guided equality saturation.
Program Optimization.Historically, programmers had either to explicitly write optimized code, e.g., explicit vectorization or loop ordering, or to entrust optimization to a black box compiler.
Black box compilers use fully automatic optimization techniques such as greedy rewriting [Jones et al. 2001;Lattner and Adve 2004], equality saturation [Tate et al. 2009;Yang et al. 2021] or heuristic searches [Mullapudi et al. 2015;Steuwer et al. 2015].Although full automation can sometimes yield high performance, it is not always feasible or even desirable, as it may result in poor performance or may be too time-consuming [Maleki et al. 2011;Parello et al. 2004].In particular, greedy rewriting is used in compilers like GHC and LLVM because it is fast.However, ordering greedy rewriting phases to maximize performance is a di cult problem, known as the phase ordering problem [Touati and Barthou 2006].Moreover, creating e ective optimization heuristics is a challenging task that the community seeks to automate [Cummins et al. 2017;Stephenson et al. 2003].
When automatic optimization is unsatisfactory, programmers often fall back to manual optimization in order to achieve their performance goals [Lemaitre et al. 2017;Niittylahti et al. 2002].
More recently, programmers may control optimization through transformation scripts [Chen et al. 2008;Cohen et al. 2005], rewriting strategies [Hagedorn et al. 2020;Koehler and Steuwer 2021;Visser et al. 1998] or scheduling APIs [Chen et al. 2018;Ragan-Kelley et al. 2012].As discussed in Section 5.1.3,precisely controlling optimization is challenging, quickly requiring excessive programmer e ort.Guided optimization is a middle ground that aims to combine the productivity of automatic optimization with the exibility of controlled optimization.Programmers may guide optimization within tools for the polyhedral framework [Zinenko 2016] or for scheduling APIs [Ikarashi et al. 2021].To the best of our knowledge, this paper proposes the rst guided optimization technique that uses program sketches as guides.
7 DISCUSSION: USER EXPERIENCE Creating appropriate (sketch) guides is the key creative step in using guided equality saturation: be it for optimization, proof, or something else.This is akin to explaining program optimizations using incomplete program snippets; or explaining proofs using incomplete proof sketches.To understand such explanations, human readers must have the implicit background that enables them to ll in the gaps.With guided equality saturation, this background is explicitly encoded as sets of rewrite rules and cost functions.
Proposed Methodology.Guided equality saturation users can explore using di erent guides, rewrite rules and cost functions with a view to achieving their rewrite goal.They may well start with their rewrite goal and include all potentially useful rewrites.At rst the search may fail by: (1) Exhausting resources, either time or memory: there is insu cient guidance to make the search feasible, or there is no solution.
(2) Saturating without nding a solution: the goal is unreachable with the given set of rules.
(3) Finding an unexpected solution: the goal is too vague or the rules too permissive.
When a search fails, the user can iteratively apply some of the following steps: (1) Adding more guides.This splits the search into simpler, more achievable, searches.
(2) Removing misguides.An incorrect guide may obstruct the search.One way to detect misguides is to try nding all guides at once: if a later guide is found more easily, then the earlier guide is likely a misguide.(3) Adding rewrite rules.Forgetting to explicitly encode background commonly leads to failures.(4) Removing unnecessary rewrite rules.This both speeds up the search and reduces the size of the e-graph (Section 5.3.5).Moreover it allows the use of less precise sketch guides and cost functions.(5) Changing sketch precision.An overly precise sketch excludes valid terms with a slightly di erent structure.An overly vague sketch includes undesirable terms (Section 3.3.2). (6) Changing the cost function.If a search succeeds but extracts a di erent term than expected, using a better cost function may solve the problem.
Improving the User Experience.To be useful, guided equality saturation must require less human e ort than manual term rewriting, e.g. with rewriting strategies or tactics.While our intuition and case studies suggests that this is true in many scenarios, we have not attempted to quantify and contrast the human e ort required.User studies could provide evidence of reduced e ort, and insights into how to further improve user experience.Future work may investigate how to build interactive tools for guided equality saturation, and how to provide e ective feedback: e.g.visualizing the cost and result of the searches, or suggesting one of the iterative steps previously listed using machine learning.
Additional features might also improve user experience.For example, manipulating a tree of guides instead of a sequence may encourage more experimentation.SketchBasic includes the constructs that we found useful for our case studies.Di erent applications however, may bene t from other constructs like named holes or sketch intersection.This paper does not present a universal sketch language that would serve a wide range of applications, as expressivity and extraction cost must be considered for a given application.As with regular expressions or database queries, more powerful constructs tend to require more expensive algorithms: this is a problem worth studying on its own.
In some domains developing a library of guides, sets of rewrite rules, and cost functions could also improve user experience.The supplementary material illustrates how we have de ned sketch abstractions and reused them across all of the guides in Table 4. Moreover we nd that some guides are useful across di erent goals, and discuss how four sets of rewrite rules are reused across searches.Future work could extend this to investigate the design of generic guides, sets of rewrite rules, and cost functions for reuse across multiple searches.

CONCLUSION
This paper explores an e ective trade-o between the painstaking manual control of rewriting and automated, but often unsuccessful, rewriting.The motivating intuition is that humans often explain rewriting using (potentially incomplete) terms that identify the key rewrite stages, omitting a large number of intermediate routine rewrites.Speci cally, we propose guided equality saturation, a semi-automatic rewriting technique, to scale beyond fully-automatic rewriting with limited human e ort.For problems requiring long rewrite sequences, or some speci c insight, such that fully automated equality saturation fails, a human provides guides.The guides are intermediate rewrite goals and may be either complete terms or incomplete sketches.With guidance, rewriting scales as it is decomposed into a sequence of equality saturations, each of feasible cost (Section 3).
We demonstrate the generality and e ectiveness of guided equality saturation using case studies in theorem proving and program optimization.Using guided equality saturation as a novel tactic in the Lean 4 proof assistant allows proofs to be written in the style of textbook proof sketches, i.e., as a series of calculations that omit details and skip steps.With the use of guides, we nd more proofs than unguided equality saturation.When both nd a proof the guides reduce search time from minutes to fractions of a second (Section 4).Using guided equality saturation to optimize programs in the RISE functional array language enables optimizations that cannot be achieved with unguided equality saturation within an hour and using 60 GB of memory.Using no more than 3 guides, the same optimizations are achieved within seconds and using less than 1 GB of memory (Section 5).
We believe that guidance will increase the scale of the problems that can be solved with equality saturation across application domains.For theorem proving, we hope that equality saturation can bring scalable sketch-guided proof-search into day-to-day proof development.For program optimization, we envision performance engineers guiding compilers to generate better code using program sketches.In both domains, we expect problems to be resolved more e ectively due to a powerful interplay between human intuition and scalable rewriting.
Fig. 1.Growing an e-graph for the term ( ( )) • ( • ( ( ))).An e-graph is a set of e-classes themselves containing equivalent e-nodes.The dashed boxes are e-classes, and the solid boxes are e-nodes.New e-nodes and e-classes are shown in red.

Fig. 2 .Fig. 3 .
Fig. 2. The congruence invariant simplifies the e-graph on the le by merging two identical enodes for into a single e-node as shown on the right.

Fig. 4 .
Fig. 4. Memory footprint during equality saturation performing loop tiling.While one-and two-tiling is performed easily, three-dimensional loop tiling fails as the e-graph requires too much memory.

Fig. 5 .
Fig. 5.Where unguided equality saturation (top) fails to reach its rewrite goal, guided equality saturation (bo om) succeeds.The human expert guides the rewrite process by providing intermediate goals (guides), and with them breaks the single infeasible equality saturation into a sequence of feasible equality saturations.

Fig. 6 .
Fig.6.Performing a single equality saturation as part of the overall guided equality saturation process.With the term from the prior equality saturation −1 an e-graph is initialized (listing 1 line 3).As long as the next goal is not reached (while loop in listing 1 line 4) rewrites are applied to grow the e-graph as usual.Once the goal has been satisfied, satisfying the goal is extracted from the e-graph (listing 1 line 17) to start the next equality saturation aiming to reach the next goal.
Fig.7.Memory footprint for performing three-dimensional loop tiling for unguided equality saturation on the le and guided equality saturation on the right.Where unguided equality saturation runs out of memory, a single guide is su icient for guided equality saturation to succeed.The guide is described in Section 3.3.
Di erent numbers of guides, power 3 variant.

Fig. 9 .
Fig.9.Comparing runtimes and number of steps required for proving variants of the "freshman's dream" equality in Lean 4.

Fig. 11 .
Fig. 11.A blocking optimization for matrix multiplication performed via rewriting in RISE.In the initial program (top), a dot product is computed between each row of a (aRow) and column of b (bCol).In the final program (bo om), a blocking optimization has been applied.Loops characteristic of the optimization are shown on the right of the | symbols, and are not part of the RISE program.The remaining program regions reshape input arrays, initialise arrays, compute with scalars, and reshape output arrays.
Fig. 13.The evolution of the e-graph, and the number of rewrite rules applied, during searches for two optimization goals.Sketch guides are depicted with purple vertical lines.Note that the scale of the y-axes for unguided graphs (a) and (b) is millions, while for guided graphs (c) and (d) it is thousands.

Table 1 .
Comparing di erent methods for proving the Knuth-Bendix Lemmas for groups.
[Hurd 2003;Paulson and Susanto 2007]suitable syntactic sugar -corresponds roughly to what you would expect to see in a textbook proof.For comparison, if we use the same intermediate steps and try to prove the individual equalities with greedy rewriting, this does not yield a proof.On the other hand, a manual version that does all the rewriting explicitly requires 15 steps for this proof and is tediously verbose.Unguided equality saturation also nds a proof, but has a signi cantly longer runtime.Interestingly, while sledgehammer's calls to external solvers nd proofs, neither metis[Hurd 2003;Paulson and Susanto 2007]nor Isabelle's smt tactics [Blanchette

Table 2 .
Runtime and memory consumption for unguided equality saturation with e icient lambda calculus encoding.Only the baseline and blocking optimization goals are found, with other optimizations exceeding 60 GB.

Table 3 .
Runtime and memory consumption for sketch-guided equality saturation with e icient lambda calculus encoding.All optimizations are found in seconds using less than 0.5 GB of memory and requiring at most 3 sketch guides.

Table 4 .
Decomposition of each optimization goal into logical steps.A sketch is defined for each logical step.In this table, sketch size counts operators such as containsMap, program size counts operators such as map, lambdas, variables and constants: not applications.