Efficient Bottom-Up Synthesis for Programs with Local Variables

We propose a new synthesis algorithm that can efficiently search programs with local variables (e.g., those introduced by lambdas). Prior bottom-up synthesis algorithms are not able to evaluate programs with free local variables, and therefore cannot effectively reduce the search space of such programs (e.g., using standard observational equivalence reduction techniques), making synthesis slow. Our algorithm can reduce the space of programs with local variables. The key idea, dubbed lifted interpretation, is to lift up the program interpretation process, from evaluating one program at a time to simultaneously evaluating all programs from a grammar. Lifted interpretation provides a mechanism to systematically enumerate all binding contexts for local variables, thereby enabling us to evaluate and reduce the space of programs with local variables. Our ideas are instantiated in the domain of web automation. The resulting tool, Arborist, can automate a significantly broader range of challenging tasks more efficiently than state-of-the-art techniques including WebRobot and Helena.


INTRODUCTION
Web automation can automate web-related tasks such as scraping data and filling web forms.While an increasing number of populations have found it useful [Chasins 2019;Katongo et al. 2021;UiPath 2022], it is notoriously difficult to create web automation programs [Krosnick and Oney 2021].Let us consider the following example, which we also use as a running example throughout the paper.
Example 1.1.https://haveibeenpwned.com/ is a website where one could check whether or not an email address has been compromised (i.e., "pwned").Figures 1 and 2 show the webpage DOMs for an email with no pwnage detected and for a pwned email respectively; both figures are simplified from the original DOMs solely for presentation purposes.Consider the task of scraping the pwnage text for each email from a list of emails. 1 Figure 3 shows an automation program  for this task.While the high-level logic of  is rather simple, implementing it turns out to be very difficult.First, one must implement the right control-flow structure, such as the loop in  with three instructions.Second, each instruction must use a generalizable selector to locate the desired DOM element for all emails across all iterations.For example, line 4 from Figure 3 uses one generalizable selector  that utilizes the aria-expanded attribute.This attribute is necessary for generalization, since both messages (pwned and not pwned) are always in the DOM but which one to render is determined by this attribute's value.On the other hand, a full XPath expression /html/body/div/div/div/div/h2 locates the element with "Good news -no pwnage found!" in both Figure 1 and Figure 2, which is not desired.In general, one has to try many candidate selectors before finding a generalizable onein other words, this is fundamentally a search problem.
Synthesizing web automation programs.In general, implementing web automation programs requires creating both the desired control-flow structure (with arbitrarily nested loops potentially) and identifying generalizable selectors for all of its instructions; this is very hard and time-consuming.
WebRobot [Dong et al. 2022] is a state-of-the-art technique that allows non-experts to create web automation programs from a short trace  of user actions (e.g., clicking a button, scraping text).A key underpinning idea is its trace semantics: given a program , trace semantics outputs a sequence  ′ of actions that  executes, by resolving any free variables (such as the local loop variable from Figure 3).Then, one can check if  satisfies  by checking if 's output trace  ′ matches .
Key challenge: synthesizing programs with local variables.While trace semantics significantly bridges the gap between programming-by-demonstration (PBD) and programming-by-example (PBE) by enabling a "guess-and-check" style synthesis approach for web automation [Chen et al. 2023;Dong et al. 2022;Pu et al. 2022Pu et al. , 2023]], state-of-the-art synthesis algorithms unfortunately fail to scale to challenging web automation tasks due to the heavy use of local variables in those tasks.Consider the enormous space of potential loop bodies to be searched.All these bodies use loop variables and therefore must be evaluated under an extended context that also binds such local variables to values.Conventional observational equivalence (OE) from the program synthesis literature [Albarghouthi et al. 2013;Udupa et al. 2013] fundamentally cannot reduce this space: they track bindings for only input variables, but not for local variables.As a result, they cannot build equivalence classes for loop bodies, and hence fall back to enumeration.The only existing work (to our best knowledge) that pushed the boundary of OE is RESL [Peleg et al. 2020].Briefly, its idea is to "infer" bindings for local variables given a higher-order sketch, which enables applying OE over the space of lambda bodies.A fundamental problem, however, is the data dependency across iterations (for functions like fold).Its implication to synthesis is succinctly summarized by RESL as a "chicken-and-egg" problem: we need output values of programs in order to apply OE for more efficient synthesis, but we need the programs first in order to obtain their output values.This cyclic dependency fundamentally limits all prior work (such as RESL, among others [Feser et al. 2015;Smith and Albarghouthi 2016]) to sketch-based approaches with crafted binding-inference rules, which still resort to enumeration in many cases.The general problem of how to reduce the space of programs with local variables remains open [Peleg et al. 2020].
Our idea: lifted interpretation.The same problem occurs in our domain: as we will show shortly, loops in web automation use local variables and exhibit data dependency across iterations.In this work, we propose a new algorithm that can apply OE-based reduction for any programs, without requiring binding inference.We build upon the OE definition from RESL [Peleg et al. 2020]: two programs belong to the same equivalence class if they share the same context and yield the same output.Notably, "context" here is a binding context with all free variables including both input and local variables.Our key insight can be summarized as follows.
We can compute an equivalence relation of programs based on OE under all reachable contexts, by creating equivalence classes simultaneously while evaluating all programs from a given tree grammar (with respect to a given input).Furthermore, we can use (a generalized form of) finite tree automata to compactly store the equivalence relation.
Here, a reachable context is one that emerges during the execution of at least one program from the grammar, for a given input.Furthermore, programs rooted at the same grammar symbol share reachable contexts.For instance, the loop from Figure 3 introduces the same binding context for loop bodies that can be put inside it.However, computing reachable contexts requires program evaluation which again requires reachable contexts -this is the aforementioned "chicken-and-egg" problem described in RESL [Peleg et al. 2020].To break this cycle, our key insight is to simultaneously evaluate all programs top-down from the grammar, during which we construct equivalence classes of programs bottom-up based on their outputs under their reachable contexts.This idea essentially lifts up an interpreter from evaluating one single program at a time to simultaneously evaluating all programs from a grammar, with respect to a given input.This lifted interpretation process allows us to systematically enumerate all reachable contexts, and hence build equivalence classes for all programs including those with local variables.
General idea, instantiation, and evaluation.In the rest of this paper, we first illustrate how our idea works in general in Section 3 using a small functional language.Then, in Section 4, we present an instantiation of our approach in the domain of web automation.We implement this instantiation in a tool called Arborist2 .Our evaluation results show that Arborist can solve more challenging benchmarks using much less time, significantly advancing the state-of-the-art for web automation.

PRELIMINARIES
In this section, we review the standard concepts of observational equivalence (OE) and finite tree automata (FTAs) from the literature, focusing on their application to program synthesis.

Synthesis using Observational Equivalence
Observational equivalence (OE) was originally proposed by Hennessy and Milner [1980] to define the semantics of concurrent programs, and has been used widely within the programming languages community.Intuitively, two terms are observationally equivalent whenever they are interchangeable in all observable contexts.In the field of programming-by-example (PBE), OE has been utilized to reduce the search space of programs, typically in bottom-up synthesis algorithms.
Bottom-up synthesis.Bottom-up algorithms synthesize programs by first constructing smaller programs which are later used as building blocks to create bigger ones.Specifically, the algorithm begins with an initial set  containing all atomic programs of size 1 (e.g., input variables, constants), and then iteratively grows  by adding new programs of larger sizes that are composed of those already in  .The algorithm terminates when  has a program  that meets the given specification.
For instance, for a language that includes variable  and integer constants 1 and 2,  is initially {, 1, 2} but later will contain more terms such as  + 1 and  + 2, assuming + operator is allowed by the language.If the specification is given as an input-output example pair (1, 3) meaning "return 3 when  = 1", then  + 2 is a correct program whereas  + 1 is not.
Observational equivalence reduction.Bottom-up synthesis often uses observational equivalence to reduce the program space in order to improve the search efficiency.The key idea is to not add a new program  to  , if there already exists some  ′ ∈  that behaves the same as  observationally.
In particular, existing PBE work [Albarghouthi et al. 2013;Udupa et al. 2013] defines two programs  1 ,  2 to be observationally equivalent if they yield the same output on each input example.This idea keeps only programs that are observationally distinct (given input examples), thereby reducing the size of  and accelerating the search.RESL [Peleg et al. 2020] further generalizes OE to consider an extended context that also includes local variables: two programs are observationally equivalent if they yield the same output given a shared context (which may include local variables).However, conventional bottom-up synthesis algorithms no longer work under this OE definition, as they cannot evaluate programs with free local variables before knowing their binding context.

Synthesis using Finite Tree Automata
OE essentially defines an equivalence relation of programs, which can be stored using finite tree automata (FTAs) [Wang et al. 2017a].
Finite tree automata.Finite tree automata (FTAs) [Comon et al. 2008] deal with tree-structured data: they generalize standard finite (word) automata by accepting trees rather than words/strings.
Definition 2.1 (Finite Tree Automata).A (bottom-up) finite tree automaton (FTA) over alphabet  is a tuple A = (, ,   , Δ), where  is a set of states,   ⊆  is a set of final states, and Δ is a set of transitions of the form  ( 1 , • • • ,   ) →  where  1 , • • • ,   ,  ∈  and  ∈  .A term  is accepted by A if  can be rewritten to a final state according to the transitions (i.e., rewrite rules).The language of A, denoted (A), is the set of terms accepted by A.
Notations.We also use A = (  , Δ) as a simpler notation, since  and  can be determined by Δ.We use SubFTA(, A) to mean the sub-FTA of A that is rooted at state .
Program synthesis using FTAs.Given a tree grammar  defining the syntax of a language and given an input-output example E = (E in , E out ), we can construct an FTA A = (, ,   , Δ), such that (A) contains all programs from  (up to a finite size) that satisfy E. In particular, the alphabet  Here,  is the input variable, which is a list of integers.We simplify the standard fold operator to use a default seed of 0 (which is implicit and not shown as an argument).Note that fold introduces two local variables:  is the accumulator, and  will be bound to each element from .  is the lambda body, which may use local variables  and .The "add" and "mult" operators are the standard addition and multiplication.
consists of all operators from .We have a state    ∈  if there exists a program rooted at symbol  from  that outputs  given input E in .We have a transition  ( and  is a start symbol of .Once A is constructed, one can extract a program  from (A) heuristically (e.g., smallest in size) and return  as the final synthesized program [Wang et al. 2017a].

Remarks. Every state 𝑞 𝑐
∈  represents an equivalence class of all programs rooted at grammar symbol  that produce the same value  on E in .In other words,    stores all observationally equivalent programs.To our best knowledge, all existing FTA-based synthesis techniques [Miltner et al. 2022;Wang et al. 2017aWang et al. ,b, 2018b;;Yaghmazadeh et al. 2018] are based on the notion of OE that considers only input variables.In other words, all existing techniques resort to enumeration of programs with free local variables [Peleg et al. 2020].

LIFTED INTERPRETATION
This section illustrates how the general idea of lifted interpretation works on a simple functional language.Section 4 will later describe a full-fledged instantiation to the domain of web automation.

A Simple Programming-by-Example Task
Example 3.1.Given the simple functional language from Figure 4, let us consider the following programming-by-example (PBE) task: synthesize a program that returns 7 given input list [1,2,4].Suppose the intended program is:  1 : fold , (, ) ⇒ add(, ) which calculates the sum of all elements from the input list .Consider another program:  2 : fold , (, ) ⇒ add(mult(, 2), 1) which returns the same output 7 as  1 , given the example input [1,2,4].Notably, while the lambda bodies in  1 and  2 are different, they share the same context-output behaviors (or footprint), for the given input list.Please see Table 1 which shows their local variable bindings and corresponding output values across all iterations.In what follows, we will illustrate how to synthesize  1 and  2 from the input-output example [1, 2, 4] ↦ → 7, using our lifted interpretation idea.local variable bindings  1 : add(, )  2 : add(mult(, 2), 1)

FTAs based on Observational Equivalence
Let us first present a new FTA-based data structure that our lifted interpretation approach utilizes to succinctly encode equivalence classes of programs.The main ingredient is its generalization of the FTA state definition from prior work [Wang et al. 2017a]: our state includes a context  which contains information (e.g., all variable bindings) to evaluate programs with free local variables.In particular, we define an FTA state  as a pair: Here, s is a grammar symbol, and Ω is a footprint which maps a context   to an output   .Each entry   ↦ →   is called a behavior; so a footprint is a set of behaviors.The remaining definitions are relatively standard.Our alphabet  includes all operators from the programming language.A transition  ∈ Δ is of the form  ( 1 , • • • ,   ) →  which connects multiple states to one state.However, because our state definition is more general, the condition under which to include a transition now becomes different from prior work [Wang et al. 2017a].Specifically, A includes a transition  =  ( 1 , • • • ,   ) → , if for every behavior  ↦ →  in , we have behavior   ↦ →   in   (for all  ∈ [1, ]), such that according to  's semantics and given

Illustrating Lifted Interpretation
Now we are ready to explain how our lifted interpretation idea works for Example 3.1.
Setup.First, we build an FTA A  = ({ 1 }, Δ  ) -see Figure 5 -for the grammar in Figure 4. We will later apply lifted interpretation to A  .Each state in A  is annotated with a grammar symbol and an empty footprint.Notice the cyclic transitions around  3 , due to the recursive add and mult productions.This induces an infinite space of programs -our approach finitizes the grammar by bounding the size of its programs, as standard in the literature [Wang et al. 2017a].
Applying lifted interpretation.Then, we use lifted interpretation to "evaluate" A  under an initial context, which would eventually produce another FTA A  = ({ 11 }, Δ  ) (see a part of it in Figure 6).Different from A  , A  will cluster programs (including all sub-programs) into equivalence classes based on OE. Figure 7 shows the annotations (i.e., grammar symbols and footprints) for A  .
At a high-level, lifted interpretation traverses A  systematically, computes reachable contexts on-the-fly during traversal, and most importantly, constructs the equivalence classes simultaneously given these reachable contexts.In particular, given an initial context  and an FTA A  = (  , Δ) with final states   and transitions Δ, lifted interpretation returns an FTA A  = ( ′  , Δ ′ ) with final states  ′  and transitions Δ ′ , such that if two programs from (A  ) yield the same output under , then they belong to the same (final) state in A  .More specifically, we have: That x fold q 1 : (P, ∅) q 2 : (L, ∅) q 3 : (E, ∅) Each FTA state is annotated with a grammar symbol and a footprint (which maps a reachable context to a value).
x fold q 4 q 11 acc elem q 5 add mult 1 2 q 7 q 6 q 8 q 9 q 10 add Figure 6.A part of A  after applying lifted interpretation on A  from Figure 5. Annotations on FTA states can be found in Figure 7.
. Annotations (i.e., grammar symbols and footprints) for all states in A  from Figure 6.
Key judgment.The key judgment that drives the lifted interpretation process is of the following form.Note that this judgment is non-deterministic; that is, a state may evaluate to multiple states.
Inference rules.The key inference rule that implements this judgment is the Transition rule: It says: evaluating a state  boils down to evaluating each of 's incoming transitions .For instance, to evaluate  1 in A  (see Figure 5) under  = { ↦ → [1, 2, 4]}, we would evaluate fold( 2 ,  3 ) →  1 under .The following inference rules describe how to actually evaluate a fold transition. (Fold-1) Let us take fold( 2 ,  3 ) →  1 from A  as an example and explain how these two rules work.
The Fold-1 rule first evaluates  2 from A  to  4 in A  , which, as mentioned above, boils down to evaluating  2 's incoming transition  →  2 .This is done using the following Input-Var rule.
Specifically, Input-Var first obtains the value , which is [1, 2, 4], that  binds to.Then it creates a footprint Ω ′ that includes all behaviors from  and a new binding  ↦ → .Finally, a new state  ′ is created, with grammar symbol L and footprint Ω ′ .Here, MkState is simply a state constructor.In our example, Input-Var yields  4 , which has footprint Popping up to Fold-1, given  4 , it retrieves the output  for  4 given context , notably, by looking up  4 's footprint.It then uses an auxiliary rule, Fold-2, to evaluate  3 from A  -which corresponds to lambda bodies -given  and .This yields  7 in A  , among potentially other states which are not shown in Figure 6.Again, the guarantee is: all programs in ({ 7 }, Δ  ) share the same footprint.Now given  7 , Fold-1 finally creates  11 in A  , as well as transition fold( 4 ,  7 ) →  11 .
The key is to compute  11 's footprint Ω ′ , which inherits everything from  1 's footprint Ω but also includes additionally the behavior  ↦ →  | | .Here,  | | (which is  3 in our example, as | | = 3) is the output for the fold operation, under context  = { ↦ → [1, 2, 4]} which is the input example we are concerned with.We note that the computation of  | | is based on looking up  7 's footprint.Now let us briefly explain how the Fold-2 rule evaluates  3 to  7 .The evaluation is an iterative process that follows the fold semantics.It begins with context  0 that binds the accumulator  to the default seed 0 and binds  to the first element of , then recursively evaluates  ′ 0 (i.e.,  3 in our example) under  0 which yields  ′ 1 (not shown in Figure 6), and finally obtains  1 (which  should be bound to in the next iteration) by (again) looking up the footprint of  ′ 1 .Note that  ′ 1 is an intermediate state whose footprint has one behavior, since we have only seen  0 so far.The second iteration will repeat the same process but for  ′ 1 and using  1 , which would yield  ′ 2 whose footprint has two behaviors.This process continues until we reach the end of , eventually yielding  ′ | | ;  7 in A  is one such state.As shown in Figure 7,  7 has three behaviors in its footprint.
We skip the discussion of the other rules which are used to construct all the other states in A  , and refer readers to the appendix of our paper for a complete list of rules.In the end, given A  , we will mark states whose footprint satisfies the specification as final states, and extract a program from A  .In our example,  11 is final, and both  1 and  2 are in ({ 11 }, Δ  ).
. Web automation language.Left is syntax, where  is a string,  is an integer,  is an HTML tag, and  is an HTML attribute.Right is a subset of the trace semantics rules; please find the complete set of rules in the appendix.

INSTANTIATION TO WEB AUTOMATION
This section presents a full-fledged instantiation of the lifted interpretation idea to web automation.

Web Automation Language
Figure 8 presents our web automation language.The syntax is slightly different from the one in WebRobot [Dong et al. 2022].First, our syntax looks more "functional": for example, loop bodies are presented as lambdas.This is solely for the purpose of making it easier to later present our approach.Second, the language is also slightly more expressive: ForSelectors allows starting from the -th child/descendant with  ≥ 1, whereas WebRobot's syntax requires  = 1.This extension is motivated by our observation when curating new benchmarks: many tasks require this more relaxed form of loop.We refer interested readers to the WebRobot work for more details about the syntax, but in brief, a web automation program  is always a sequence of statements.It supports different types of statements.For example, Click clicks a DOM element located by selector expression .We use an XPath-like syntax for selector expressions: / [] gives the -th child of  that satisfies  , whereas // considers 's descendants. is an input variable that is bound to a user-provided data source (like the list of emails from Example 1.1), whereas  and  are local variables introduced by and internal to the program.ForData is a loopy statement that iterates over a list of data entries (such as emails from Example 1.1) given by , binds  to each of them, and executes loop body .ForSelectors is quite similar, but it loops over a list of selector expressions returned by .While handles pagination, where it repeatedly clicks the "next page" button located by  and executes loop body , until  no longer exists on the webpage.Figure 8 also presents a subset of the trace semantics rules, which are cleaner than WebRobot's.The evaluation judgment is of the form: which reads: given a context -consisting of a DOM trace Π and a binding environment Γ (that binds all free variables in scope) -evaluating program  yields an action trace  ′ and a DOM trace ) where A = (, ,   , Δ); 2: while A_is_not_saturated and not_timeout do 3: 5: A  := EvaluateFTA(A  , GetContexts( 0 )); ⊲ GetContexts gives all contexts in  0 's footprint.
Π ′ .This evaluation does not execute  in browser; instead, it simulates the execution by "replaying"  given Π.We refer interested readers to the WebRobot paper [Dong et al. 2022] for the design rationale.Here, we briefly explain a few representative rules.The Seq rule is standard: it evaluates  and  in sequence, and concatenates the resulting action traces.The Click rule is more interesting.It first evaluates  (which may use a variable) under  1 and Γ, yielding a selector  which is used to form the output action.Then, it removes the first DOM from Π to obtain the resulting DOM trace Π ′ .In other words, the program under evaluation and the DOM trace are always "in sync": the first action to be executed always corresponds to the first DOM.The ForData rules are perhaps the most interesting.ForData-1 first evaluates  to a list , and then invokes ForData-2 which is a helper rule that executes all iterations of the loop until termination.The key observation is: similar to the fold function from Example 3.1, ForData also performs dependent iterations; that is, an iteration has to be executed under a context computed by its previous iteration.In particular, the input DOM trace carries the dependency.This data-dependent feature actually is not specific to ForData: all loops in our language are data-dependent, and in fact, Seq is too.Prior work cannot reduce the space of our web automation programs.Our lifted interpretation idea is able to reduce this space, and we will present how it works next.

Top-Level Synthesis Algorithm
Algorithm 1 shows the top-level algorithm that synthesizes web automation programs from demonstrations.It shares the same interface as WebRobot's algorithm and thus can be directly integrated with WebRobot's front-end UI.At a high-level, it takes as input an action trace , a DOM trace Π, and input data  .It returns a program  that generalizes  -i.e., given Π and Γ = { ↦ →  }, evaluating  using trace semantics produces  ′ such that  is a strict prefix of  ′ .This generalization is possible, because we require Π to have one more DOM than the number of actions in .
To synthesize , we first find all programs that reproduce  (i.e.,  ′ has  as a prefix) -notably, this process compresses a large number of programs in an FTA A. Then line 8 picks a smallest program , from A, that generalizes .If no such  exists in A, the algorithm returns null.Now let us dive into the algorithm a bit more, though more details will be described in subsequent sections.Line 1 initializes A based on the input traces, such that A stores all loop-free programs that are guaranteed to reproduce .The reason that we base our synthesis on the input DOM trace Π is because selector expressions are not given a priori; they are known only when Π becomes available.The input action trace  is used to further guide synthesis.The following example briefly illustrates what an initial FTA looks like; we defer the more detailed explanation to Section 4.4.
Figure 9 shows the abstract syntax tree for , and Figure 10 gives the corresponding initial FTA A. Here, we use   as a shorthand to denote the selector in   .Note that  and A in both figures are pretty much "isomorphic", except that each action in A has multiple selectors as "leaf transitions" (see each dashed circle), whereas each action in  has only one selector.Section 4.4 will present more details around how A is constructed.
Given the initial A, we then enter a loop (lines 2-7) which iteratively adds to A loopy programs, until no new programs can be found (i.e., A saturates) or a predefined timeout is reached.In each iteration, we non-deterministically pick 2 consecutive Seq transitions  1 , • • • ,  2 (line 4), and then perform three key steps: (1) SpeculateFTA guesses an FTA A  based on these 2 transitions, (2) EvaluateFTA performs lifted interpretation over A  , yielding another FTA A  , and (3) MergeFTAs merges A  into A. The resulting A at line 7, compared to A before the merge, includes new loopy programs that are synthesized during this iteration, with the same guarantee that all programs in A still reproduce .The generalization step takes place in (1), where we reroll a slice of statements in a program from A to a loop that is then stored in A  .This loop rerolling step, however, is speculative, meaning some rerolled loops may not be correct.Therefore, in step (2), we use EvaluateFTA to check all loops and retain only those that can indeed reproduce  -this EvaluateFTA algorithm (i.e., lifted interpretation) is the key contribution of this paper.
In what follows, we explain how each step works in more detail.In particular, we will begin with EvaluateFTA in Section 4.3, given A  and its corresponding input contexts.Then in subsequent sections, we explain how FTA initialization, speculation, merging, and ranking work, respectively.

Lifted Interpretation
As mentioned in Section 3, our lifted interpretation uses the following key judgment.
⊢ ; Δ ⇝ ( ′ , Δ ′ ) In our domain, a context  consists of a DOM trace Π and a binding environment Γ. Figure 12 shows the EvaluateFTA rules that implement lifted interpretation.We suggest readers looking at Figure 11 at the same time.Rule (1) reduces multi-context evaluation to single-context evaluation.We note that the evaluation under   relies on the previous evaluation result for   −1 .To evaluate a state  under a single context , Rule (2) further reduces it to evaluating each of 's incoming transitions.The remaining rules evaluate various types of transitions; they share the same principle as illustrated in Figure 11.In particular, given transition  =  ( 1 , • • • ,   ) →  and context : (1) First, we recursively evaluate each argument state   .The input context under which to evaluate each   and the order to evaluate them depends on the semantics of  .(2) Given each resulting state  ′  for   , we obtain the output  for  for .Note that  is computed compositionally, from the corresponding outputs associated with  ′  and per  's semantics.
(3) Finally, we construct state  ′ that  evaluates to.In particular,  ′ includes  ↦ →  as a behavior in its footprint Ω, as well as all behaviors from 's footprint Ω.Let us examine the rules in detail.Rule (3) is a base case for a nullary transition  with no argument states: it directly evaluates the selector expression  and yields a state  ′ with behavior Π, Γ ↦ → .Rule (4) is more interesting.We first evaluate  1 to  ′ 1 under context Π, Γ: note that here we use the context for  to evaluate  ′ 1 , due to Click's semantics.Then, given  ′ 1 , we obtain its output  which is later used to construct the output trace  ′ for .Finally, we construct  ′ with footprint Ω ′ , which includes the new behavior we just created based on  ′ 1 and Click's semantics, as well as everything from Ω. Rule (5) considers a Seq transition with two argument states.We first evaluate  1 to  ′ 1 .However, before evaluating  2 , we need to obtain Π ′ 1 from  ′ 1 to form the context to evaluate  2 .The output action traces  ′ 1 ,  ′ 2 for  ′ 1 ,  ′ 2 form the output action trace for .Finally, we construct  ′ with Ω ′ , same as previous rules.Rule ( 6) is another base case for skip.Rule (7) also concerns a base case: if the input DOM trace is empty, it yields the sub-FTA rooted at .Note that in general,  would also include [], Γ ↦ → [], [] in its footprint.Rule (7) does not show this explicitly, because we assume all states implicitly have [], Now let us look at the most exciting rules for loops.Consider ForData: its first argument is a data expression that yields a list , and the second argument (i.e., loop body) is evaluated with the loop variable  being bound to each element from .To evaluate  with an incoming ForData transition, Rule (8) first evaluates  1 to  ′ 1 and obtains  from  ′ 1 .Then, we evaluate "loop body" state  2 under a context with : Rule (9) presents how this evaluation works.Intuitively, Rule (9) has a series of  evaluations: (1) Figure 12.A subset of rules for EvaluateFTA; the complete set can be found in the appendix.
state is  ′  , which encapsulates information from all  iterations.Popping up back to Rule (8): given  ′ 2 , we obtain the output traces  ′  for all iterations, which are then concatenated to form the output trace in  ′ .We skip the discussion on other loop types as they are very similar to ForData.
Example 4.2.Consider the task from Example 4.1.Suppose A  = ({}, Δ) shown in Figure 13 is the FTA speculated from the initial FTA in Figure 10.Section 4.5 will later explain how SpeculateFTA generates this A  , but in brief, it contains ForData loops inferred from the initial FTA.Each state in A  is annotated with a grammar symbol and an empty footprint (see a few examples in Figure 13).
Given a context consisting of a DOM trace [ 1 , • • • ,  6 ] and a binding environment Γ = { ↦ →  } (both of which are from Example 4.1), EvaluateFTA applies lifted interpretation to A  : the process is very similar to that in Example 3.1, except that now we use trace semantics for a different syntax.Figure 14 illustrates a part of the FTA A  = ( ′  , Δ ′ ) returned by EvaluateFTA.Here, we show two states  1 ∈  ′  and  2 ∈  ′  , with different footprints, that  evaluates to: where Here, [ 1 , • • • ,  6 ] is the desired action trace (see Example 4.1) where  3 scrapes "Oh no -pwned" and  6 scrapes "Good news -no pwnage found".The action  ′ 3 , however, scrapes "Good news -no pwnage found", which is undesired.The reason we can have  ′ 3 is because programs in ({ 2 }, Δ ′ ) . Seq λz  use an undesired selector expression (such as the full selector expression described in Example 1.1) that does not generalize.More specifically, evaluating  ′ 3 from A  (with a full selector expression in ScrapeText) can yield state  6 in A  : That is, for both emails, the full selector always scrapes "Good news -no pwnage found".This in turn means that  0 from A  can evaluate to  4 in A  : Finally, this leads to the aforementioned state  2 .

FTA Initialization
Now let us circle back and describe the FTA initialization procedure, which we note is specific to PBD and web automation.Figure 15 shows the key initialization rule that constructs, from action trace In particular, all   ( ∈ [0, ]) states have grammar symbol P, and all In addition to Seq, we also have transitions for statements (like Click and ScrapeText) and selectors.The following rule shows how to construct transitions for Click and its candidate selectors, from action trace Δ is a set of transitions.output: FTA with final state  and transitions Δ ′ , which represents a set of speculated ForData loops.1: , Δ,  ); 3: create fresh states  0 (P, ∅), • • • ,   (P, ∅), and  ′ (de, ∅),  (E, ∅); ⊲ Transitions constructed from parametrization 5: return {}, Δ ′ ; Algorithm 2. Algorithm for SpeculateForData that speculates an FTA of ForData loops.

Speculating FTAs
Section 4.3 assumed a given speculated FTA A  .In this section, let us unpack the SpeculateFTA algorithm that infers A  .Here, A  is an FTA that contains candidate loops (potentially nested).
Figure 17 illustrates how to speculate candidate ForData loops from the 2 consecutive transitions 16).Algorithm 2 describes the algorithm more formally.We suggest readers simultaneously looking at Figures 16, Figure 17 and Algorithm 2. The first step is anti-unification (line 1): for all  ∈ [1, ], we synchronously traverse SubFTA( ′  , A) and SubFTA( ′ + , A), and compute a set   of anti-unifiers.For ForData, an anti-unifier  is simply a data expression that ForData may iterate over.The second step is parametrization (lines 2): for all anti-unifiers from  , we traverse SubFTA( ′  , A) ( ∈ [1, ]) and build a new FTA with final state  ′  and transitions Δ ′  .The third and last step is to construct the final speculated FTA A  (line 3-4). (1) Figure 18.Anti-unification rules.
(1) Each state in A  is annotated with an empty footprint, because we are yet to evaluate any programs.
In other words, A  is purely a syntactic compression without using any OE at all.We refer interested readers to the appendix for a more complete description of the speculation algorithm that handles other types of loops.In what follows, we explain our anti-unification and parametrization algorithms.
Anti-unification.Given two states  1 ,  2 from FTA A with transitions Δ, anti-unification traverses expressions in SubFTA( 1 , A) and SubFTA( 2 , A) in a synchronized fashion, and returns a set  of anti-unifiers that may be iterated over by the speculated loops.In particular: That is,  ∈  if there is an anti-unifier  for  1 ,  2 given Δ. Figure 18 presents the detailed rules.Rule (1) says that, to anti-unify  1 ,  2 with incoming transition  , we anti-unify the argument states Rule (2) does the same but for loops: we anti-unify expressions (i.e., the first argument) being looped over.Rules (3) and (4) concern the anti-unification of data expressions -the idea is to look for an increment pattern beginning with 1. Rules ( 5) and ( 6) anti-unify selector expressions, where it uses a more flexible pattern that allows starting from  (not necessarily  = 1).
Parametrization.Given all anti-unifiers  and a state  from A with transitions Δ, Parametrize constructs from SubFTA(, A) a fresh new A ′ (with transitions Δ ′ and final state  ′ ) in two steps.First, we make a fresh new copy of each state from SubFTA(, A), keeping the same grammar symbol and but resetting the footprint to empty.This results in a mapping  that maps every state in SubFTA(, A) to a state in A ′ .Therefore, for each transition Then, we add new transitions labeled with parametrized selector/data expressions to A ′ , given each anti-unifier  ∈  , mapping , and the state  from A with transitions Δ.In other words, the actual parametrization takes place in this step.Figure 19 presents our parametrization rules.Rule (1) parametrizes any transition in Δ that uses a selector expression of the form  ⊕ ; that is, this selector uses the anti-unifier  as a prefix.This rule adds a new transition with  being replaced by variable  which points to state  ( ′ ).Rule (2) parametrizes data expressions in a similar manner.13.
Constructing A  .Finally, lines 3-4 connect the created states and transitions, as shown in Figure 17.

Merging FTAs
Let us explain the last two procedures in Algorithm 1. Merging A  into A. Intuitively, MergeFTAs (line 7) incorporates loops from A  into SubFTA( 0 , A). Figure 20 illustrates how this works.We first construct a state  ′ and a Seq transition  that connects a final state  from A  and  ′ to  0 .In particular,  ′ has grammar symbol P and is associated with the following footprint: In other words,  ′ is constructed based on footprints from  0 and , according to Seq's semantics.Note the constraint  ′ 1 ++ ′ 2 =  ′ : we only keep loops in SubFTA(, A  ) that yield a prefix of  ′ , because otherwise  ′ 1 is not a reachable context for , at least not so in this case.For such final states  of A  , we create the aforementioned  ′ and , and add  together with all transitions in A  to the set of transitions of A. Observe that, if  ′ already exists in A, then MergeFTAs effectively connects  and an existing state  ′ in A to  0 which is also in A.
Ranking.The Ranking procedure (line 8) is fairly straightforward.It first runs EvaluateFTA on A using Π, { ↦ →  } as the context.Note that the last DOM  +1 in Π does not have a demonstrated action in the input action trace , because our goal is to use the synthesized program  to automate the unseen actions.EvaluateFTA gives A ′ containing programs with different predicted actions  +1 .We pick a program  with the smallest size and return  as the final synthesized program.

Soundness and Completeness
Theorem 4.5.Given action trace , DOM trace Π and input data  , our synthesis algorithm always terminates.Moreover, if there exists a program in our grammar (shown in Figure 8) that generalizes  (given Π and  ) and satisfies the condition that (1) every loop has at least two iterations exhibited in  and (2) its final expression is a loop, then our synthesis algorithm (shown in Algorithm 1) would return a program that generalizes  (given Π and  ) upon FTA saturation.

Interactive Synthesis with Incremental FTA Construction
Interactive programming-by-demonstration.Same as WebRobot [Dong et al. 2022], our synthesis technique can also be applied in an interactive setting: given an action trace +1 is not intended, the user would manually demonstrate a correct  +1 .This leads to new traces both of which are fed to our algorithm again.This process repeats until the user obtains an intended program.
Incremental FTA construction.We incrementalize our Algorithm 1 based on the interactive setup.Given  +1 and Π +1 , and given FTA A  constructed from   and Π  , we still build A +1 with the same guarantee as in Algorithm 1 but not from scratch.Our key insight is that we can re-use the validated loops from A  , but we need to evaluate them against Π +1 as some of them may not reproduce  +1 .Essentially, our incremental algorithm is still based on guess-and-check, but we have a second type of speculation that directly takes sub-FTAs from A  as speculated FTAs -this generates high-quality speculated FTAs more efficiently.A +1 is still initialized in the same way but this time using  +1 and Π +1 .We still perform SpeculateFTA and EvaluateFTA on A +1 but in a much smaller scope this time.That is,  2 must involve  +1 , because otherwise A  had already considered it.The MergeFTAs and Ranking procedures remain the same.

EVALUATION
This section describes a series of experiments designed to answer the following questions: • RQ1: Can Arborist efficiently synthesize programs for challenging web automation tasks?How does it compare against state-of-the-art techniques?
• RQ2: How necessary is it to use an expressive language in order to have a generalizable program?
In particular, is it important to consider a large space of candidate selectors?
• RQ3: How does Arborist scale with respect to the number of candidate selectors considered?
• RQ4: How useful are various ideas proposed in this work?
Web automation tasks.To answer these questions, we construct a collection of web automation tasks from two different sources.First, we include all 76 tasks that were used to evaluate WebRobot: details about these tasks can be found in [Dong et al. 2022].Second, we curate 55 new tasks, each of which has an English description of the task logic over one or more websites.All of our new tasks are curated based on real-life problems (e.g., those from the iMacros forum) and involve modern, popular websites (such as Amazon, UPS, Craigslist, IMDb) with complex webpages, whereas many of WebRobot's benchmarks involve legacy websites.These new websites have deeply nested DOM structures which would require searching with a significantly larger space of candidate selectors.In particular, all 131 tasks involve data extraction, 45 of them involve data entry, 80 require navigation across webpages, and 48 involve pagination.Some of these tasks involve multiple types: for instance, 31 of them involve data entry, data extraction, and webpage navigation.
Ground-truth Selenium programs.For each task, we obtain a ground-truth automation program  gt (using the Selenium WebDriver framework).In particular, we reuse the 76 ground-truth Selenium programs from WebRobot, and manually write 55 programs for our new tasks.On average, these Selenium programs consist of 43 lines of code, with a max of 147 lines.In general, it takes about half an hour to a few hours for us to write up an automation program, depending on the complexity of webpages and the task logic.
Benchmarks.In our evaluation, a benchmark is defined as a tuple (, Π,  ), where  is an action trace, Π is a DOM trace, and  is input data.We obtain one benchmark for each task by running the corresponding  gt in the browser (given input data  if  gt involves data entry) -during execution, we record the trace Here,   is an action performed on   .Note that selectors in actions are recorded as full XPath expressions.
Candidate selectors.For any full XPath selector  in an action   , we also record a set  of candidate selectors for .In particular, a candidate selector  is a concrete selector expression (i.e., without variable ) from our grammar (see Figure 8) that locates the same DOM element on   as .In other words,  evaluates to  given   , or more formally,   ⊢  : .As mentioned in Example 1.1 and Section 4.2, it is important to consider candidate selectors, since full XPath expressions typically do not generalize.Candidate selectors also affect the overall search space, as explained below.
3 We terminate  gt after every loop from  gt has been executed for three full iterations or when || reaches 500, whichever yields a longer action trace.This essentially serves as a "timeout" to avoid running  gt for an unnecessarily long time.
Proc Program space.In this work, the search space of programs for a benchmark with action trace  is defined jointly by the grammar from Figure 8 and the candidate selectors for all actions in .This is because Figure 8 does not specify a priori the space of selectors  and predicates  .Instead, they are defined once the traces  and Π are given: e.g., the space for  includes all candidate selectors.It is very hard (if not impossible) to define this space a priori without the DOMs.Obviously, the overall search space of programs grows when we consider more candidate selectors.
In this section, we evaluate Arborist against three metrics: (i) how many tasks it can successfully synthesize intended programs for, (ii) how much synthesis time it takes, and (iii) how many userdemonstrated actions it requires in order to synthesize intended programs.In other words, we evaluate Arborist's effectiveness, efficiency, and generalization power.We also compare Arborist against the state-of-the-art, especially on our new tasks that are more challenging.
Setup.We use the same setup from the WebRobot paper [Dong et al. 2022].In particular, given a benchmark with  and Π that have  actions and DOMs respectively, we create  − 1 tests.The th test consists of an action trace For each action, we consider all candidate selectors in our grammar with at most 3 predicates.For example, is not considered even if it can also locate the same DOM element.We run Arborist in a way to simulate an interactive PBD process, same as how WebRobot was evaluated.That is, we feed all tests to Arborist in sequence: we run Arborist on the -th test (with   and Π  ), obtain a synthesized program   , and check if   predicts  +1 (i.e.,   yields  +1 given Π  ).If not,  +1 is counted as a user-demonstrated action; otherwise,  +1 can be correctly predicted and thus is not counted.We always count the first action  1 as user-demonstrated.Furthermore, for each test we use a 1-second timeout and record the time it takes to return   .We run Arborist incrementally (as described in Section ??) in this experiment: for each test, it resumes synthesis based on FTAs from previous tests/iterations.Finally, we inspect if the synthesized program  −1 (in the last iteration) is an intended program: if so, the corresponding benchmark is counted as solved; otherwise, unsolved.
Baselines.Among three baselines, we focus on the following two.
• WebRobot, which is the original tool from the WebRobot work [Dong et al. 2022].We note that, while its underlying algorithm is complete in theory, the implementation is not.For example, it restricts the number of parametrized selectors (to five) when parametrizing actions during speculation.These heuristics seemed to help avoid excessively slowing down the search, without severely hindering the completeness for those tasks considered in the WebRobot work.• WebRobot-extended, which is an adapted version of WebRobot that uses the extended language from Figure 8 (which allows ForSelectors with  ≥ 1).In this experiment, we range  from 1 to 3.This baseline still keeps all heuristics from WebRobot (i.e., its search is incomplete).
We use the same space of candidate selectors as Arborist for these two baselines.The third baseline is Helena [Chasins 2019], which is also a PBD-based web automation tool.
RQ1 take-away: • Arborist can synthesize intended programs for 93.9% of our benchmarks.
• It typically takes Arborist subseconds to synthesize programs from demonstration.
• Arborist uses a median of 12 user-demonstrated actions to generalize.
• Arborist can solve more benchmarks using less time than state-of-the-art techniques.
All ( 131) Prior( 76) New ( 55) All ( 131) Prior( 76) New ( 55  reports the distribution of their corresponding synthesis times.Note that for both figures, we show both the aggregated data across all 131 benchmarks, and separately for prior and new tasks.Let us inspect Figure 21(a) first.Across all 131 benchmarks, Arborist can synthesize an intended program for 93.9% (i.e., 123) of them, whereas baselines solve at most 68.7% (i.e., 90).This is a large gap, because baselines solve significantly fewer new tasks (which are very challenging): in particular, Arborist can solve 53 out of 55 (i.e., 96.4%), which is 2.5x more than that for baselines (i.e., 21/38.2%).These new tasks involve complex webpages and task logics, which require using nested loops and searching for selectors with more predicates in a larger space.This makes them significantly more challenging than prior tasks; as a result, WebRobot's underlying enumeration-based algorithm fundamentally cannot scale to this level of complexity.For prior tasks (total 76) in the WebRobot paper (which baselines were developed and engineered on), Arborist still outperforms baselines by one more benchmark.It turns out this benchmark involves two loops that require the start index to be 2 and 3, which cannot be expressed in WebRobot's original language.WebRobot-extended, however, solves this benchmark, since it uses a richer language.Arborist uses the same (extended) language, and therefore is able to synthesize an intended program as well.
In addition to solving a strict superset of benchmarks than baselines, Arborist is also significantly faster.Figure 21(b) reports statistics of the synthesis times.In particular, for each solved benchmark, we record the maximum synthesis time across all tests, and report the distribution of these times.For each tool, Figure 21(b) presents the quartile statistics of synthesis times.Across all 123 benchmarks solved by Arborist, 120 of them do not even use up the 1-second timeout.In contrast, baselines solve fewer benchmarks and time out on 15 benchmarks.Recall that both baselines have internal heuristics which unsoundly prune search space for faster search.We tested variants of them with these heuristics removed: the best one can solve 55 benchmarks, with 46 timeouts.In other words, for many benchmarks, baselines cannot exhaust the entire search space, although they stumped upon a correct program before timeout.With a longer timeout of 10 seconds, baselines can only solve 9 more benchmarks.On the other hand, with 1-second timeout, Arborist times out on 3 benchmarks -they all require doubly nested loops and are among the most challenging ones.Arborist solves them all (albeit reaching timeout), while baselines solve two.
Arborist uses a median of 12 user-demonstrated actions, which is in line with that for WebRobot.This is reasonable, as we use the same search space for all tools in this experiment.While Arborist searches more programs than baselines (which oftentimes time out and hence search a small subset), our simple ranking heuristics seem to be quite effective at selecting generalizable programs.
Discussion.Arborist failed to solve 8 benchmarks, including 6 from prior work (due to limitations of the web automation language, as also explained in the WebRobot paper) and 2 new ones (which can be solved using a 10-second timeout).Careful readers may wonder why WebRobot-extended solves fewer benchmarks than WebRobot, albeit using a richer language.This is due to the poor performance of its underlying enumeration-based algorithm: using a 1-second timeout, WebRobotextended cannot even find intended programs for some benchmarks that WebRobot solves.
Detailed results.Recall that Arborist internally has two key modules -namely, SpeculateFTA and EvaluateFTA -which typically use most of the running time.Among them, on average, the former takes 20% of the time, and the latter uses 80%, across all benchmarks.The final FTA has an average of 1714 states.The final synthesized programs on average have 6 expressions, and the largest one has 20.Among these programs, 76 use at least one doubly-nested loop, and 12 involve at least a three-level loop.
Arborist vs. Helena.Among the 76 WebRobot benchmarks, Helena's PBD technique was able to synthesize intended programs from demonstrations (provided by us manually) for 13 benchmarks.For the 55 new tasks, Helena solved 11.By contrast, Arborist solved 70 and 53 respectively.

RQ2: How Important Is It To Consider Many Candidate Selectors?
The search space considered in RQ1 uses the grammar from Figure 8 with candidate selectors of size up to three (measured by the number of predicates).This is a quite expressive language containing at least one generalizable program for over 93% of our tasks (as Arborist was able to solve them).However, one might ask: is this high expressiveness really necessary?This is an important question, because if not, advanced synthesis techniques like Arborist may not be necessary in practice.
Setup.In this section, we investigate the impact of candidate selectors on the expressiveness of the resulting search space: that is, given a set  of candidate selectors, whether or not the corresponding program space has at least one generalizable program.We choose to focus on candidate selectors in this experiment, because prior work [Dong et al. 2022] has already shown the necessity of operators from the grammar in Figure 8.
More specifically, given a benchmark with action trace , for each   with full XPath selector , we use the following ways to construct the set  of candidate selectors for   .
•  includes all candidate selectors for  up to a certain size.This is perhaps the simplest heuristic that one can design; RQ1 uses 3 as the max size which was shown to be sufficient for most of our benchmarks.Therefore, in this experiment, we vary the max size from 1 to 3, and investigate how it impacts the expressiveness.For each size, we run Arborist using the corresponding , and record the number of benchmarks solved.Here, "solved" means an intended program can be found by Arborist, which is a witness that corresponding search space is expressive enough.4 •  contains candidate selectors sampled (uniformly at random) from all candidate selectors of size up to 3. We vary | | from 1 to 1500.For each | |, we run Arborist with the corresponding  and record the number of benchmarks solved; we repeat this 8 times.This gives us a finer-grained, more continuous view of the impact of candidate selectors.While one could certainly craft other heuristics for constructing , we do not consider them, because (i) it is impossible to enumerate all heuristics in the first place, and (ii) human-crafted heuristics from prior work [Chasins et al. 2018;Dong et al. 2022]  for challenging tasks.Another note is that, since we use Arborist merely as a means to check if the search space has a generalizable program, the synthesis time is not relevant in this experiment.
For each benchmark with a given , we run Arborist incrementally until it reaches the end of the action trace, same as in RQ1.However, in RQ2, we use 10 seconds as the timeout per iteration, rather than 1 second from RQ1 -this allows Arborist to exhaustively search the program space such that we can more confidently conclude on its expressiveness.If the final synthesized program is intended, we count it as solved (meaning the corresponding search space is expressive enough).

RQ2 take-away:
In general, an expressive search space should consider a large number of selectors (at least at the level of hundreds) for each webpage.
Results.If using only the full XPath expressions from the input action trace, Arborist manages to solve 46 benchmarks (out of 131 total), while terminating on the remaining 85 without synthesizing an intended program.That is, the search spaces of the 85 benchmarks are exhausted before timeout.This confirms that full XPath expressions typically do not generalize, and we need to consider more candidate selectors in order to solve more tasks.If we additionally include all candidate selectors of size 1, the number of solved benchmarks bumps up to 87, which is 66% of all benchmarks.Arborist does not reach timeout for any of the rest, indicating the need to consider more selectors.Further including all candidate selectors of size 2 allows Arborist to solve 122 benchmarks, which is quite close to our results in RQ1.Again, no timeout is observed on the remaining benchmarks.We note that the selector space at this point is already quite large: on average, we have 305 selectors of size up to two (in our grammar) per DOM across our benchmarks; the median and max are 321 and 385 respectively.In other words, using a simple size-based heuristic, we necessarily need to consider multiple hundreds of candidate selectors per DOM, in order to automate a decent number of tasks.Finally, when considering all candidate selectors of size up to 3 (same as in RQ1), 125 benchmarks are confirmed to admit intended programs; no timeouts observed.The remaining 6 benchmarks, upon manual inspection, cannot be solved by Arborist's language (to our best knowledge).While encouraging, this high expressiveness comes at the cost of searching among an average of 7627 selectors per DOM, with the median and max being 8066 and 9649, across all our benchmarks.
Figure 22 presents a finer-grained view of how the expressiveness increases when we range the number of candidate selectors from 0 to 1500 using a small increment.The x-axis is the number of selectors randomly sampled from the universe of all candidate selectors of size up to three.The -axis is the percentage of benchmarks (out of all 131 benchmarks) solved by Arborist; that is, their corresponding search spaces are confirmed to contain at least one desired program.For each , the figure gives the max, min, and mean of the percentage of solved benchmarks across 8 runs.In total, there are only 6 benchmarks for which Arborist times out without returning an intended program.As also mentioned earlier, this is due to the limitation of the web automation language, rather than Arborist not being able to exhaust the search space.Therefore, we believe Figure 22 precisely describes all benchmarks whose program spaces contain a desired program.
Let us inspect Figure 22 more closely.First,  grows pretty quickly when  goes up from 0 to 100.At  = 100, across 8 runs, an average of 73 benchmarks are solved, while the max and min are 78 and 70 respectively.There are 41 benchmarks not solved in any of the runs: notably, for all these 41 benchmarks, Arborist terminates before timeout without returning a generalizable program.This confirms that with (only) 100 (randomly sampled) selectors, the corresponding search spaces of these 41 benchmarks do not contain any generalizable programs.Furthermore, 62 of the 90 solved benchmarks (in at least one run, using 100 selectors) are from prior work.Our observation is that these 62 benchmarks are relatively "easier" compared to our newly curated tasks: their task logics are relatively simpler and their solutions use relatively smaller selectors.
On the other hand, the growth significantly slows down after  = 100.We observe a pretty long tail of benchmarks that require multiple hundreds, or even more than a thousand, of selectors to be solved.For instance, increasing  from 100 to 200 only grows  from 56% to 71% (in terms of average).In order to have another 15% bump, we need at least 400 selectors.Notably, benchmarks solved in the [100, 1000] range mostly come from our new tasks: these problems have to be solved with complex selectors chosen from a larger space which are used in complex nested loops.
Finally, we seem to reach a plateau after  = 1400, after which point increasing the number of selectors does not seem to help grow the number of solved benchmarks anymore.The maximum number of solved benchmarks we observed in this experiment (based on random sampling selectors) is 123 (i.e., 93.9% out of 131 total).(In the previous experiment that includes all candidate selectors based on their size, we observed 125 benchmarks solved when using all selectors up to size 3.)

RQ3: How Does Arborist Scale against Number of Candidate Selectors?
Following up RQ2, one may wonder how Arborist would scale with more candidate selectors than those considered in RQ1 and RQ2.In this experiment, we stress test Arborist's search algorithm given a very large number of candidate selectors.
Setup.We define search efficiency as the amount of time for a search algorithm to exhaust a given search space.For Arborist, this means: given a benchmark with action trace  and given candidate selectors for each   ∈ , how long it takes for the FTA A to saturate (i.e., no more new programs can be found) before timeout.We choose to use the exhaustion time, rather than the time to discover a generalizable program (i.e., synthesis time), because the synthesis time oftentimes depends on the particular search order used while the exhaustion time is more stable.Also note that the exhaustion time does not reflect the synthesis time; the latter tends to be much shorter.This experiment does not evaluate Arborist's synthesis time; see RQ1 for its synthesis times.
Specifically, we sample a set  of candidate selectors from a universe of all candidate selectors of size up to four.This universe is extremely large, with an average of 139,886 selectors per DOM, with the median and max being 135,856 and 291,385 respectively.For each benchmark, we vary | | from 1000 to 10,000.Given each | |, we run Arborist incrementally on each benchmark, until reaching the end of its action trace.We use the same 10-second time per iteration as in RQ2.However, in this experiment, we count the number of benchmarks that are exhausted.That is, Arborist terminates  before reaching the timeout for every iteration.Given | |, we run Arborist on each benchmark 5 times, and record the max, min, and mean number of exhausted benchmarks across all 5 runs.In addition, for each exhausted benchmark, we record its max exhaustion time among all iterations.We also report the distribution of these times across all exhausted benchmarks and across all runs, as a way to quantify Arborist's search efficiency.

RQ3 take-away:
Arborist can search very efficiently: in particular, it can exhaust program spaces that consider multiple thousands of selectors within at most a few seconds.
Results. Figure 23(a) shows for each | |, how many benchmarks Arborist can successfully exhaust in 10 seconds: since we have multiple runs for each | |, we report the max, median, and min across all runs.Figure 23(b) presents the distribution of exhaustion times -same as in RQ1, we also report quartile statistics here in RQ3 -across all exhausted benchmarks for each | |.The key take-away message is clear: Arborist scales quite well as the number of selectors is increased.For example, with 1,000 candidate selectors, Arborist is able to exhaust the program space for 95% of all 131 benchmarks, with a median exhaustion time of about 0.1 seconds.If we further increase | | from 1,000 to 5,000, we observe a small drop from 95% to 79% in terms of the percentage of benchmarks that can be exhausted, while the median exhaustion time is under 1 second.Finally, looking at the extreme of 10,000 selectors: 68% benchmarks exhausted with a median of 1-second exhaustion time.
While exhaustion is in general fast, it takes even less time to discover a generalizable program, as also mentioned earlier.For example, among those 68% (i.e., 89) exhausted benchmarks, Arborist can discover an intended program for 61 within 1 second (and 43 under 0.5 seconds).In contrast, WebRobot's enumeration-based algorithm cannot exhaust more than 10 benchmarks (using the same 10-second timeout), even if fed with only 100 selectors.Furthermore, using 10,000 selectors, WebRobot's median solving time (i.e., returning an intended program) is 10 seconds.These data points again highlight that Arborist's underlying search algorithm is highly efficient.

RQ4: Ablation Studies
Impact of observational equivalence.We consider a variant of Arborist with the OE capability disabled.In other words, this ablation has to enumerate loop bodies (which use loop variables), and does not allow sharing across FTA states that correspond to loop bodies.We evaluate this variant under the RQ1 setup: it solves (i.e., generates an intended program for) 56 benchmarks, among which the median synthesis time is 1 second.In contrast, Arborist solves 123 benchmarks with a median running time of 0.02 seconds across those solved.This again highlights the importance of OE for speeding up the search.
Impact of incremental FTA construction.This ablation builds the FTA A from scratch given new input traces, without reusing previous FTAs.That is, the incremental FTA construction optimization in Section 4.8 is disabled.We run this ablation using the RQ1 setup.It solves (i.e., returns an intended program for) 105 benchmarks -out of these solved benchmarks, 40 reach the 1-second timeout and the median running time is 0.3 seconds.Arborist solves 18 more benchmarks with only 3 timeouts in total and using a significantly less median time of 0.02 seconds.

Case Study: Large Language Models
Given the recent advances in large language models (LLMs) and exploding interests in applying them for program synthesis, we conduct a case study where we use LLMs to generate web automation programs from demonstrations.This is mainly a sanity check, and we refer interested readers to the appendix for more details.In summary, our key take-away is that LLMs (in particular, GPT-3.5 [OpenAI 2022]) fail to generate semantically correct programs, even for some of the simplest benchmarks.The model can produce unstable results, claiming a benchmark is unsolvable in one run while outputting programs in another trial.While these results are poor, it is well-known that LLMs are sensitive to the prompting strategy [Si et al. 2022], and there might be a better method to prompt the model which we have not tried.Nevertheless, we believe these results indicate that our benchmarks are quite hard for state-of-the-art LLMs and require further research in relevant areas in order to better solve these problems.

RELATED WORK
In this section, we briefly discuss some closely related work.
Observational equivalence (OE).OE is a very general concept, which states the indistinguishability between multiple entities based on their observed implications.Hennessy and Milner [1980] proposed OE to define the semantics of concurrent programs, where two terms are observationally equivalent whenever they are interchangeable in all observable contexts.The idea of OE has also been adopted by programming-by-example (PBE) [Albarghouthi et al. 2013;Peleg et al. 2020;Udupa et al. 2013] to reduce a large search space of programs thereby boosting the synthesis efficiency.Building upon the concept of OE, our work extends OE-based reduction to also programs with local variables.
Synthesis of programs with local variables.Programs with local variables are evaluated under non-static contexts.To our best knowledge, there are no principled approaches to effectively reduce the space of such programs.Prior work [Chen et al. 2021[Chen et al. , 2020;;Feser et al. 2015;Peleg et al. 2020;Smith and Albarghouthi 2016;Wang et al. 2017b] typically falls back to some form of brute-force enumeration or utilizes domain-specific reasoning to prune the search space of programs with local variables (such as lambda bodies).We propose a principled approach -i.e., lifted interpretationto reduce the space of such programs, thereby speeding up the search of them.
RESL [Peleg et al. 2020] is especially related to our work: it uses an extended context (same as ours) when searching lambdas; however, it does not present a general approach that can reduce such programs.It clearly articulated the key problem of applying OE in general: computing reachable contexts and evaluating programs depend on each other.Our work presents a new algorithm that computes contexts and evaluates programs simultaneously, by constructing the equivalence relation of programs while evaluating programs which facilitates the computation of reachable contexts.In contrast, RESL utilizes rules (manually provided) to infer reachable contexts for lambda bodies, for a given higher-order sketch.For data-dependent functions (like reduce and fold), RESL falls back to enumeration.Our paper addresses the "infeasible hypothesis" from RESL (see D.3 in its appendix).We believe our work also opens up new ways to further study such program synthesis problems.
Lifted Interpretation.Our lifted interpretation idea can be viewed as a bidirectional approach: it traverses the grammar top-down to generate reachable contexts, during which it builds up programs bottom-up given contexts.Different from prior work [Gulwani et al. 2011;Lee 2021;Phothilimthana et al. 2016] that enumerates programs bidirectionally, we intertwines the enumeration of contexts and programs.Rosette [Torlak and Bodik 2014] is related, in that they also lift the interpretation from concrete programs to symbolic programs (e.g., defined by a program sketch).A key distinction is that our work directly performs program synthesis and uses finite tree automata to succinctly encode the program space, instead of reducing the search problem to SMT solving.
Finite tree automata (FTAs) for program synthesis.During lifted interpretation, programs are clustered into equivalence classes, succinctly compressed in an FTA.Compared to prior work [Handa and Rinard 2020;Koppel et al. 2022;Miltner et al. 2022;Wang et al. 2018aWang et al. , 2017aWang et al. ,b, 2018b;;Yaghmazadeh et al. 2018], states in our FTAs encode context-output behaviors, rather than input-output behaviors of programs.The lifted interpretation idea is not tied to FTAs: our algorithm is developed using FTAs, but we believe other data structures (such as VSAs [Gulwani 2011] or e-graphs [Willsey et al. 2021]), or an enumeration-based approach [Peleg et al. 2020] can also be leveraged.
Program synthesis for web automation.Our instantiation presents a new program synthesis algorithm for web automation.This is an important domain with a long line of work [Barman et al. 2016;Chasins et al. 2015Chasins et al. , 2018;;Chen et al. 2023;Dong et al. 2022;Fischer et al. 2021;Leshed et al. 2008;Lin et al. 2009;Little et al. 2007;Pu et al. 2022Pu et al. , 2023] ] in both human-computer interaction and programming languages.The most related work is WebRobot [Dong et al. 2022]: building upon its trace semantics, we further develop a novel synthesis algorithm that can automate a significantly broader range of more challenging tasks much more efficiently.
Programming-by-demonstration (PBD).In particular, our algorithm is a form of programmingby-demonstration that synthesizes programs from a user-demonstrated trace of actions.Different from prior PBD work [Chasins et al. 2018;Dong et al. 2022;Lau et al. 2003;Lieberman 1993;Mo 1990] that is based on either brute-force enumeration or heuristic search of programs, Arborist uses observational equivalence to reduce the search space and leverages finite tree automata to succinctly represent all equivalence classes of programs.

CONCLUSION
We proposed lifted interpretation, which is a general approach to reduce the space of programs with local variables, thereby accelerating program synthesis.We illustrated how lifted interpretation works on a simple functional language, and presented a full-fledged instantiation of it to perform programming-by-demonstration for web automation.Evaluation results in the web automation domain show that lifted interpretation allows us to build a synthesizer that significantly outperforms state-of-the-art techniques.
Finally, by completeness of EvaluateFTA,  is represented by a final state  in A  as well and  has context {, Π [,] }.By soundness of EvaluateFTA, A  is well-annotated, so the footprint follows the evaluation and has to be {Π [𝑖,𝑚] , Γ ↦ →  [, ] } for some index  that marks the end of loop evaluation (or the end of the trace) satisfying Π Combining everything together, we have the following main theorem: Theorem 4.5.Given action trace , DOM trace Π and input data  , our synthesis algorithm always terminates.Moreover, if there exists a program in our grammar (shown in Figure 8) that generalizes  (given Π and  ) and satisfies the condition that (1) every loop has at least two iterations exhibited in  and (2) its final expression is a loop, then our synthesis algorithm (shown in Algorithm 1) would return a program that generalizes  (given Π and  ) upon FTA saturation.
Proof.Our algorithm always terminates since there are only finitely many ways to reroll a trace.Moreover, if there exists a program  that generalizes  and satisfies condition (1) and ( 2), such a program also reproduces  and its last loop expression produces at least one action (since any loop expression has at least two exhibited iterations).By completeness (theorem C.8) of our algorithm,  is represented by A. As a result, the set of generalizing programs is not empty, and Rank will heuristically pick such a program with smallest size.By soundness (theorem C.5) of our algorithm, the returned program is always correct with respect to the given Π and  .□

Figure 4 .
Figure4.A simple functional language.Here,  is the input variable, which is a list of integers.We simplify the standard fold operator to use a default seed of 0 (which is implicit and not shown as an argument).Note that fold introduces two local variables:  is the accumulator, and  will be bound to each element from .  is the lambda body, which may use local variables  and .The "add" and "mult" operators are the standard addition and multiplication.

Figure 5 .
Figure 5. FTA A  constructed for grammar from Figure 4.Each FTA state is annotated with a grammar symbol and a footprint (which maps a reachable context to a value).

Figure 9 .
Figure 9. AST for action trace from Example 4.1.

Figure 13 .
Figure 13.Illustration of one potential A  given by SpeculateFTA for the initial FTA from Figure 10.

Figure 14 .
Figure 14.Illustration of a part of A  returned by EvaluateFTA for A  from Figure 13.
Example 4.4.Consider the initial FTA A in Figure10.
Synthesis times for solved benchmarks.
Distribution of exhaustion times against each number of sampled candidate selectors.

Figure 23 .
Figure 23.RQ3 results.Selectors are sampled uniformly at random from all candidate selectors of size up to four.Arborist exhausts a benchmark's program space if it terminates before the timeout (i.e., 10 seconds).

Table 1 .
Footprints of lambda bodies from  1 and  2 respectively, across all iterations.
Intuitively, if there exists a program  rooted at s that evaluates to   under   for all  ∈ [1, ], then our FTA A has a state(s, { 1 ↦ →  1 , • • • ,   ↦ →   }), and vice versa.A context is reachable if it can actually emerge, when executing programs in a given grammar for a given input.Given a finite grammar, if all programs terminate, then the number of reachable contexts is finite.Our work assumes a finite number of reachable contexts, which we believe is a reasonable assumption for program synthesis.
is, if any final state   in A  "evaluates to" a state  ′  under , then  ′  is a final state of A  .This process also yields a set Δ ′  of transitions for each   , which are added as transitions to A  .
and input data  .output: Program  that generalizes , or null if no such program can be found. 1: Consider the anti-unifier  from Example 4.3.Let us explain how to parametrize state  ′  ′ 1 .Then, we invoke Rule (2) from Figure 19 with  being EnterData.This yields a new transition  →  ( ′′ 1 ) where  ′′ 1 =  ( ′′ 1 ), which indeed appears in A  in Figure 1 from A, given anti-unifier , to generate state  ′ 1 in A  from Figure 13.We first make a copy of SubFTA( ′ 1 , A): for example, the copy of  ′ 1 is  ′ 1 ; that is, Proc.ACM Program.Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018.
have shown to be not as effective especially Figure 22.RQ2 results.Given X candidate selectors (sampled uniformly at random from all candidate selectors of size up to 3), we have Y benchmarks whose search space is confirmed to contain an intended program.