Programming-by-Demonstration for Long-Horizon Robot Tasks

The goal of programmatic Learning from Demonstration (LfD) is to learn a policy in a programming language that can be used to control a robot’s behavior from a set of user demonstrations. This paper presents a new programmatic LfD algorithm that targets long-horizon robot tasks which require synthesizing programs with complex control flow structures, including nested loops with multiple conditionals. Our proposed method first learns a program sketch that captures the target program’s control flow and then completes this sketch using an LLM-guided search procedure that incorporates a novel technique for proving unrealizability of programming-by-demonstration problems. We have implemented our approach in a new tool called PROLEX and present the results of a comprehensive experimental evaluation on 120 benchmarks involving complex tasks and environments. We show that, given a 120 second time limit, PROLEX can find a program consistent with the demonstrations in 80% of the cases. Furthermore, for 81% of the tasks for which a solution is returned, PROLEX is able to find the ground truth program with just one demonstration. In comparison, CVC5, a syntax-guided synthesis tool, is only able to solve 25% of the cases even when given the ground truth program sketch, and an LLM-based approach, GPT-Synth, is unable to solve any of the tasks due to the environment complexity.


Introduction
Learning From Demonstration (LfD) is an attractive paradigm for teaching robots how to perform novel tasks in end-user environments [Argall et al. 2009].While most classical approaches to LfD are based on black-box behavior cloning [Ho and Ermon 2016;Ly and Akhloufi 2021], recent work has argued for treating LfD as a program synthesis problem [Holtz et al. 2020a;Porfirio et al. 2023;Xin et al. 2023].In particular, programmatic LfD represents the space of robot policies in a domainspecific language (DSL) and learns a program that is consistent with the user's demonstrations.
Although this programmatic approach has been shown to offer several advantages over black-box behavior cloning in terms of data efficiency, generalizability, and interpretability [Holtz et al. 2021;Lipton 2018], existing work in this space suffers from three key shortcomings: First, most prior  techniques focus on simple Markovian policies that select the next action based only on the current state.As a result, the target programs have a simple decision-list structure, and the main difficulty lies in inferring suitable predicates for each branch.Second, most existing techniques have only been applied to restricted domains with limited object and interaction types, such as robot soccer playing where the entities of interest are known a priori and comprise a small set.
Our goal in this work is to develop a programmatic LfD approach for long-horizon tasks that commonly arise in service mobile robot settingse.g., putting away groceries in a domestic setting, or proactively providing tools and parts to a mechanic in assistive manufacturing.Long-horizon robotics tasks are inherently more challenging, as the robot needs to reason about the interactions between a long sequence of actions (e.g., that the dishes must be cleared from a table before it can be wiped down) and the effect of specific environmental states on sequences of actions (e.g., a robot tasked with dusting a shelf must first remove all items from the shelf if it is not empty vs. directly dusting it if there are no items on it).This is in stark contrast to control tasks, such as motion control, where the policy just needs to select the next action for a single time step.
Recognizing the importance of long horizon tasks, recent work [Porfirio et al. 2023] has proposed a multi-modal user interface (combining natural language with hand-drawn navigation paths) to facilitate programmatic LfD in this setting.This paper makes another stride towards that goal, but in an orthogonal direction, by learning more complex programs from demonstration traces.In particular, the approach that we propose in this paper aims to (a) handle tasks that require complex control flow (such as nested loops with multiple conditionals) and (b) scale to demonstrations performed in complex environments with hundreds of objects and a large number of relationships to consider between those objects.
Our proposed approach tackles these two challenges using a novel program synthesis algorithm that is illustrated schematically in Figure 1.First, given a set of demonstration traces, our approach infers a control flow sketch of the target program.To do so, our method abstracts each demo trace as a string over a finite alphabet and then learns a set of simple regular expressions that "unify" all of the demonstrations.As there is an obvious correspondence between regex operators and control flow structures (e.g., loops as Kleene star; conditionals as disjunction), our method can quickly infer the control flow structure of the target program from a small number of demonstrations.Furthermore, because the sketch learner prefers small regular expressions over complex ones, this application of the Occam's razor principle introduces inductive bias towards control structures that are more likely to generalize to unseen traces.
Given a program sketch capturing the underlying control flow structure, the second sketch completion phase of our algorithm tries to find a complete program that is consistent with the given traces.This algorithm is based on top-down enumerative search, meaning that it starts by considering all DSL programs as part of the search space and gradually refines it until it contains a single program.However, because the size of the search space is exponential with respect to the number of entities in the environment, such a search strategy does not scale to complex Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 18. Publication date: January 2024.
Programming-by-Demonstration for Long-Horizon Robot Tasks (Extended Version) 18:3 environments.Our approach deals with this challenge using two key ideas, namely (1) guiding the search procedure using a large language model (LLM), and (2) proving unrealizability of synthesis sub-problems.
LLM-guided refinement.As shown in Figure 1 (and as standard in the literature), our sketch completion procedure represents the search space as a partial program (i.e., a program containing holes), so the refinement step involves filling one of the holes in this partial program with a concrete expression.In our setting, these holes need to be instantiated with objects (or object types) in the environment, as well as properties of -and relationships between-those objects.However, because the demonstration may be performed in complex environments with many such objects and properties, each hole typically has a very large number of possible completions.
To address this problem, our method consults an LLM to perform refinement; intuitively, this serves two purposes: First, when the target program contains object types that do not explicitly occur in the demonstrations (a very common scenario), LLM guidance allows the synthesizer to propose new entities by reasoning about commonalities between objects that do occur in the demonstrations.Second, by conditioning the current prediction on previous ones, the synthesizer can avoid generating programs that do not "make sense" from a semantic perspective.
Proving unrealizability.However, even with LLM guidance, the search procedure may end up constructing partial programs that have no valid completion with respect to the demonstrations.Our approach tries to avoid such dead-ends in the search space through a novel procedure for proving unrealizability.In particular, given a partial program , our approach performs static analysis to construct a suitable abstraction, in the form of a regex, that represents all possible traces of all completions of  for the demonstration environment.Given such a regex  , proving unrealizability of a synthesis problem boils down to proving that the demonstration trace cannot possibly belong to the language of  .
We have implemented the proposed LfD technique in a tool called prolex 1 and evaluate it on a benchmark set containing 120 long-horizon robotics tasks involving household activities.Given a 2 minute time limit, our approach can complete 80% of the synthesis tasks and can handle tasks that require multiple loops with several conditionals as well as environments with up to thousands of objects and dozens of object types.Furthermore, for 81% learning tasks that prolex is able to complete within the 2 minute time limit, prolex learns a program that matches the ground truth from just a single demonstration.To put these results in context, we compare our approach against two relevant baselines, including CVC5, a state-of-the-art SyGuS solver and GPT-Synth, a neural program synthesizer, and experimentally demonstrate the advantages of our approach over other alternatives.CVC5 is only able to solve 25% of the tasks even when given the ground-truth sketch.GPT-Synth, on the other hand, is unable to solve any of the tasks due to environment complexity, even when the environment is simplified to include only a small fraction of the objects in addition to the required ground truth entities.Furthermore, we report the results of a series of ablation studies and show that our proposed ideas contribute to successful synthesis.
In summary, this paper makes the following contributions: • We propose a novel programming-by-demonstration (PBD) technique targeting long-horizon robot tasks.Our approach can learn programs with complex control flow structures, including nested loops with conditionals, from a small number of traces and in complex environments with thousands of objects.• We propose a new (reusable) method for proving unrealizability of synthesis problems in the PBD setting.partial program in a given environment and also represents the demonstration in that environment as a string B over the same alphabet.Then, if string B is not accepted by regular expression A , this means that the search has reached a dead end, allowing our algorithm to backtrack.The remainder of this section presents further details of our sketch generation and completion methodology.
Sketch generation.We now give a brief overview of our sketch generation approach.Our method rst abstracts the user's demonstrations as a set of strings.For instance, in our running example, the user's sole demonstration 3 is abstracted as the following string, U (3): G A>>< A >?4=,18= A 6A01,B⌘44C A ?DC,B⌘44C,18= A 2;>B4,18= G A>>< A >?4=,18= A 6A01,B⌘44C A ?DC,B⌘44C,18= A 6A01,B⌘44C A ?DC,B⌘44C,18= A 2;>B4,18= (U (3)) Joydeep: The line break seems to imply a logical separation, why is the Goto action arbitrarily chosen as the separator, and not (for example) the grab or open action?where G, A denote 6>C> and 02C in the demonstration, and the subscripts indicate the type of their arguments.For example, even though the demonstration species that the user grabbed specic sheets (namely, B⌘44C1, B⌘44C3, and B⌘44C4), the string abstraction omits such details.This idea of converting the demonstration to a more abstract form is a crucial rst step towards generalizing the specic demonstration into a reusable robot execution policy.
Next, our approach utilizes existing techniques to synthesize a regex that matches the string encoding of the demonstrations.This regex learning serves two key roles: First, when the user provides multiple demonstrations, it facilitates unifying them under a common syntactic pattern that can be translated into the target program's control ow.Second, even when the user provides a single demonstration 3, regex learning facilitates generalization because we can nd simpler regexes than U (3) that match the string U (3).Thus, even in the single demonstration setting, regex learning essentially forces the algorithm to generalize the demonstration to a more abstract form.n of the rooms consists of many more object types which are omitted e constraints.
goal is to perform programmatic LfD, we wish to learn a programmatic the provided demonstration.Figure ?? shows the desired policy that s only demonstration.Intuitively, this policy encodes that the robot ntify any bins and beds present in that room, collect all dirty sheets from hem in one of the bins.Joydeep: Since the demonstrations are specic to y were provided, the robot needs to infer what general properties dene e.g., why did the human pick up sheet1 and not sheet2?To address cies include perception actions that allow the robot to enumerate ... and t of the enumerated objects based on... (properties + relations) Joydeep: delete the latter as needed Note that the desired policy contains two lation actions like putting an item inside another one, and (2) perception to become aware of its surroundings.In particular, the policy contains t allow the robot to identify all objects that are visible from its current is context, generating the desired policy from the user's demonstration asons: has complex control ow, with three nested loops and an if statement icates.Going back to our running example, the following regular expressions all match the string U (3 dened above: (1) (G A>>< A >?4=,18= (A 6A01,B⌘44C A ?DC,B⌘44C,18= ) ⇤ A 2;>B4,18= ) ⇤ (2) (G A>>< A >?4=,18= (((A 6A01,B⌘44C A ?DC,B⌘44C,18= )?) ⇤ ) ⇤ A 2;>B4,18= ) ⇤ (3) (G A>>< (A >?4=,18= )?((A 6A01,B⌘44C A ?DC,B⌘44C,18= )?) ⇤ A 2;>B4,18= ) ⇤ Note that there is an obvious parallel between regex operators and the program's control ow Since Kleene star denotes repetition, it naturally corresponds to a looping construct in our DS Similarly, since the optional operator (i.e., (A )?) denotes choice, it is naturally translated into a if statement in our DSL.Thus, given a candidate regex A for the demonstrations, our approac translates it into a sketch by utilizing syntax-translated translation rules to convert regex operator to the target program's control ow.Additionally, because the program should not manipulate an objects before it perceives them, our sketch generation procedure also inserts any necessary le bindings and scan operations.
Going back to our running example, regular expression (2) presented above is translated int the sketch shown in Figure 3b.Observe that the three nested loops in the sketch correspond t the the three nested Kleene stars in the sketch, and the optional nested regex gives rise to the statement in the innermost loop.Finally, observe that the program sketch generated from regex ( also contains an inferred let binding because the bin object at line 4 cannot be acted upon before being rst perceived by the robot.Joydeep: It has not been stated before that the robot cannot ac on objects before perceiving them.If that is an assumption, we should state it as such.People wi wonder why the sketch does not scan for bin here, per the explanation.Joydeep: Yes Sketch completion.For each sketch generated in the rst step, our method tries to nd a completio that matches the user's demonstration.In practice, many of the sketches generated in the rst ste are infeasible.For example, consider regular expression (1) presented above, that matches the strin abstraction of the demonstration.Intuitively, sketches generated from this regex will be infeasib because, without an if statement in the sketch, the resulting programs would end up grabbing bot sheets in room r1, rather than only the single dirty sheet.Hence, the second phase of our techniqu considers multiple sketches and tries to nd a completion of any sketch that is consistent with th given demonstrations.
As mentioned earlier, our sketch completion algorithm is based on top-down enumerative searc but (a) utilizes an LLM to guide exploration and (b) trace compatibility checking procedure t quickly detect dead-ends .In particular, starting from a sketch, the synthesis algorithm maintains worklist of partial programs containing holes to be lled.When dequeuing a partial program from the worklist, we consider its probability according to the LLM, so more promising partial program are prioritized compared to less likely ones.Going back to our example, consider a partial program where we test the color of a bedsheet without putting it in the bin.Because the concepts "laundr bin" and "cleanliness" are much more related than than "color" and "laundry bin", our techniqu prioritizes a program that includes the conditional checkProp(38AC~, ...) over one that is based o color.As mentioned earlier, this strategy not only allows faster synthesis but also makes it mor likely that the generated program will match the user's demonstrations.
In addition to utilizing an LLM, our sketch completion method also leverages the demonstratio to reason about compatibility between the partial programs and the demo trace.As an exampl consider the variant of the sketch from Figure 3b but without the conditional at line 7, and suppos that the algorithm has already instantiated the question marks at 1 and 6 with A>>< and B⌘44 respectively.Because none of the actions in the sketch modify object locations and because ther are two sheets in each room, any completion of this partial program would end up performing th 6A01 action at least four times but there are only three 6A01 actions in the trace.Hence, this parti that can be translated into the target program's control ow.Second, even when the user provides a single demonstration 3, regex learning facilitates generalization because we can nd simpler regexes than U (3) that match the string U (3).Thus, even in the single demonstration setting, regex learning essentially forces the algorithm to generalize the demonstration to a more abstract form.Going back to our running example, the following regular expressions all match the string U (3) dened above: A464G 1 := (G A>>< A >?4=,18= (A 6A01,B⌘44C A ?DC,B⌘44C,18= ) ⇤ A 2;>B4,18= ) ⇤ A464G 2 := (G A>>< A >?4=,18= (((A 6A01,B⌘44C A ?DC,B⌘44C,18= )?) ⇤ ) ⇤ A 2;>B4,18= ) ⇤ A464G 3 := (G A>>< (A >?4=,18= )?((A 6A01,B⌘44C A ?DC,B⌘44C,18= )?) ⇤ A 2;>B4,18= ) ⇤ Note that there is an obvious parallel between regex operators and the program's control ow: Since Kleene star denotes repetition, it naturally corresponds to a looping construct in our DSL.Similarly, since the optional operator (i.e., (A )?) denotes choice, it is naturally translated into an if statement in our DSL.Thus, given a candidate regex A for the demonstrations, our approach translates it into a sketch by utilizing syntax-translated translation rules to convert regex operators to the target program's control ow.Additionally, because the program should not manipulate any objects before it perceives them, our sketch generation procedure also inserts any necessary let bindings and scan operations.
Going back to our running example, regular expression (2) presented above is translated into the sketch shown in Figure 3b.Observe that the three nested loops in the sketch correspond to the the three nested Kleene stars in the sketch, and the optional nested regex gives rise to the if statement in the innermost loop.Finally, observe that the program sketch generated from regex (2) also contains an inferred let binding because the bin object at line 4 cannot be acted upon before it being rst perceived by the robot.Joydeep: It has not been stated before that the robot cannot act on objects before perceiving them.If that is an assumption, we should state it as such.People will wonder why the sketch does not scan for bin here, per the explanation.Joydeep: Yes Sketch completion.For each sketch generated in the rst step, our method tries to nd a completion that matches the user's demonstration.In practice, many of the sketches generated in the rst step are infeasible.For example, consider regular expression (1) presented above, that matches the string abstraction of the demonstration.Intuitively, sketches generated from this regex will be infeasible because, without an if statement in the sketch, the resulting programs would end up grabbing both sheets in room r1, rather than only the single dirty sheet.Hence, the second phase of our technique considers multiple sketches and tries to nd a completion of any sketch that is consistent with the given demonstrations.
As mentioned earlier, our sketch completion algorithm is based on top-down enumerative search but (a) utilizes an LLM to guide exploration and (b) trace compatibility checking procedure to quickly detect dead-ends .In particular, starting from a sketch, the synthesis algorithm maintains a worklist of partial programs containing holes to be lled.When dequeuing a partial program from the worklist, we consider its probability according to the LLM, so more promising partial programs are prioritized compared to less likely ones.Going back to our example, consider a partial program where we test the color of a bedsheet without putting it in the bin.Because the concepts "laundry bin" and "cleanliness" are much more related than than "color" and "laundry bin", our technique prioritizes a program that includes the conditional checkProp(38AC~, ...) over one that is based on color.As mentioned earlier, this strategy not only allows faster synthesis but also makes it more likely that the generated program will match the user's demonstrations.
In addition to utilizing an LLM, our sketch completion method also leverages the demonstration to reason about compatibility between the partial programs and the demo trace.As an example, consider the variant of the sketch from Figure 3b but without the conditional at line 7, and suppose that can be translated into the target program's control ow.Second, even when the user provides a single demonstration 3, regex learning facilitates generalization because we can nd simpler regexes than U (3) that match the string U (3).Thus, even in the single demonstration setting, regex learning essentially forces the algorithm to generalize the demonstration to a more abstract form.
Going back to our running example, regular expression (2) presented above is translated into the sketch shown in Figure 3b.Observe that the three nested loops in the sketch correspond to the the three nested Kleene stars in the sketch, and the optional nested regex gives rise to the if statement in the innermost loop.Finally, observe that the program sketch generated from regex (2) also contains an inferred let binding because the bin object at line 4 cannot be acted upon before it being rst perceived by the robot.Joydeep: It has not been stated before that the robot cannot act on objects before perceiving them.If that is an assumption, we should state it as such.People will wonder why the sketch does not scan for bin here, per the explanation.Joydeep: Yes Sketch completion.For each sketch generated in the rst step, our method tries to nd a completion that matches the user's demonstration.In practice, many of the sketches generated in the rst step are infeasible.For example, consider regular expression (1) presented above, that matches the string abstraction of the demonstration.Intuitively, sketches generated from this regex will be infeasible because, without an if statement in the sketch, the resulting programs would end up grabbing both sheets in room r1, rather than only the single dirty sheet.Hence, the second phase of our technique considers multiple sketches and tries to nd a completion of any sketch that is consistent with the given demonstrations.
As mentioned earlier, our sketch completion algorithm is based on top-down enumerative search but (a) utilizes an LLM to guide exploration and (b) trace compatibility checking procedure to quickly detect dead-ends .In particular, starting from a sketch, the synthesis algorithm maintains a worklist of partial programs containing holes to be lled.When dequeuing a partial program from the worklist, we consider its probability according to the LLM, so more promising partial programs are prioritized compared to less likely ones.Going back to our example, consider a partial program where we test the color of a bedsheet without putting it in the bin.Because the concepts "laundry bin" and "cleanliness" are much more related than than "color" and "laundry bin", our technique prioritizes a program that includes the conditional checkProp(38AC~, ...) over one that is based on color.As mentioned earlier, this strategy not only allows faster synthesis but also makes it more likely that the generated program will match the user's demonstrations.
In addition to utilizing an LLM, our sketch completion method also leverages the demonstration to reason about compatibility between the partial programs and the demo trace.As an example, consider the variant of the sketch from Figure 3b but without the conditional at line 7, and suppose  become aware of its environment.In particular, the function tify all objects of type g that are visible from its current location and relations.This is crucial for eective LfD, since the given environment where they were provided and the robot must be he state of objects in unseen environments.
ext, generating the desired policy from the user's demonstration mplex control ow, with three nested loops and an if statement forming perceptual actions (like scanning for objects and testing esponding actions in the demonstration.ing about concepts (such as being dirty or being on top of some dicated in the demonstration. kles these challenges through two novel ideas: First, given the generates a set of sketches of the target program by (largely) em to regular expression synthesis.Second, our method employs ut the missing pieces.es inferred from the given demonstration for the running examures the high-level control ow structure of the target program and all primitive commands (such as actUnary).However, the ils that are hard to glean from the user's demonstration, such as st for inside the conditional.Our method next employs a sketch issing pieces.Because sketch completion requires synthesizing the user's demonstration, this second step is primarily based o become aware of its environment.In particular, the function ntify all objects of type g that are visible from its current location and relations.This is crucial for eective LfD, since the given environment where they were provided and the robot must be the state of objects in unseen environments.
text, generating the desired policy from the user's demonstration mplex control ow, with three nested loops and an if statement rforming perceptual actions (like scanning for objects and testing responding actions in the demonstration.ing about concepts (such as being dirty or being on top of some dicated in the demonstration. ckles these challenges through two novel ideas: First, given the h generates a set of sketches of the target program by (largely) lem to regular expression synthesis.Second, our method employs ut the missing pieces.hes inferred from the given demonstration for the running examtures the high-level control ow structure of the target program , and all primitive commands (such as actUnary).However, the ails that are hard to glean from the user's demonstration, such as st for inside the conditional.primitives that enable the robot to become aware of its environ scanObj(g) allows the robot to identify all objects of type g that a and reason about their properties and relations.This is crucial demonstrations are specic to the environment where they were able to observe and reason about the state of objects in unseen e Synthesis challenges.In this context, generating the desired pol is challenging for several reasons: • First, the desired program has complex control ow, with three containing multiple predicates.• The desired program requires performing perceptual actions (lik for properties) that have no corresponding actions in the demo • The target policy requires reasoning about concepts (such as be other object) that are also not indicated in the demonstration.
Our approach.Our approach tackles these challenges through user demonstrations, our approach generates a set of sketches o reducing the sketch inference problem to regular expression synth a sketch completion step to gure out the missing pieces.
Figure 3b shows one of the sketches inferred from the given dem ple 3 .As we can see, the sketch captures the high-level control o by introducing loops, conditionals, and all primitive commands ( sketch omits several important details that are hard to glean from variable names and predicates to test for inside the conditional.O completion step to gure out the missing pieces.Because sketch terms that are hard to infer from the user's demonstration, thi • We implement these ideas in a tool called prolex and evaluate its efficacy in the context of 120 benchmarks involving 40 unique household chores in three environments.prolex can complete the synthesis task within 2 minutes for 80% of the benchmarks, and, for 81% of the completed tasks, prolex is able to learn the ground truth program from just a single demonstration.

Motivating Example
In this section, we present a motivating example to illustrate our programmatic LfD approach.Imagine a hotel worker who wants to instruct a robot to collect dirty sheets from guest rooms and place them in a laundry bin in the room.The goal of LfD is to teach this task through demonstrations rather than explicitly programming the robot.We formalize user demonstrations in a form that is amenable to be captured using smart hand-held devices, similar to existing end-user robots like the iRobot Roomba [iRobot 2023] and Amazon Astro [Lee et al. 2023].User Demonstration.For the above task, suppose that the hotel worker performs a demonstration consisting of the 12 actions shown in the left side of Figure 2a.The demonstration takes place in two rooms,  1 and  2 ; Figure 3 shows the state of these two rooms before the demonstration takes place.Each room contains a large number of objects, including a bed, a laundry bin, and a few sheets on the bed.In Programming-by-Demonstration for Long-Horizon Robot Tasks (Extended Version) 18:5 particular, there is a clean sheet ( 2 ) and a dirty sheet ( 1 ) on the bed in  1 , and there are two dirty sheets ( 3 and  4 ) on the bed in  2 .Note that the complete representation of the rooms includes many more object types and properties, which are omitted from the figure due to space constraints.The first five steps of the demonstration sequence shown in Figure 2a (left) correspond to the actions performed in  1 , and the remaining seven steps indicate the actions performed in  2 .Specifically,  () indicates going to the location , and  (, ) indicates performing a specific action  on objects .Hence, in our example demonstration, the user first visits room  1 , where they open bin  1 , grab and place sheet  1 in that bin, and finally close the bin.Next, they go to the second room,  2 , and repeat a similar sequence of actions with the bin  2 , and sheets  3 and  4 .Desired Output.Since our goal is to perform programmatic LfD, we wish to learn a programmatic robot execution policy from the provided demonstration.Figure 2d shows the desired policy that generalizes from the user's only demonstration.Intuitively, this policy encodes that the robot should go to each room (line 2), identify a bin and all beds present in that room, collect all dirty sheets from the top of each bed (lines 7-8), and place them in the bin (line 9).The program also contains perception primitives that enable the robot to become aware of its environment (lines 1, 3, 5, and 6).
In particular, the function scanObj() allows the robot to identify all objects of type  that are visible from its current location and reason about their properties and relations.Function scanLoc() is similar but returns all locations of type .Synthesis of appropriate perception primitives is crucial for effective LfD, since the robot must be able to observe and reason about the state of objects in new and unseen environments.
Synthesis Challenges.In this context, generating the desired policy from the user's demonstration is challenging for several reasons: (1) the desired program has complex control flow, with three nested loops and an if statement with multiple predicates, (2) the desired program requires performing appropriate perception actions that have no correspondence in the given demonstrations, and (3) the desired program requires reasoning about high-level concepts that are also not indicated in the demonstrations, such as being dirty or being on top of some other object.Additionally, observe that the synthesized program needs to refer to object types (e.g., bed) that are not involved in the demonstration.Hence, the synthesizer cannot only consider those objects in the demonstration, as the desired program could refer to any of the entities in the environment.This means that the difficulty of the synthesis task is inherently sensitive to the complexity of the environment.
Our Approach.Our algorithm first generates a set of sketches of the target program based on the demonstration.These sketches capture the control flow structure of the target program but contain many missing expressions ("holes").The goal of the subsequent sketch completion step is to fill these holes in a way that scales to complex environments.
Sketch generation.To generate a sketch, our method first abstracts the user's demonstrations as a set of strings: Figure 2a (right) shows the string abstracted from the user's sole demonstration using an abstraction function , where G and A denote  and  in the demonstration, and the subscripts indicate the type of their arguments.For example, even though the demonstration specifies that the user grabbed specific sheets (namely,  1 ,  3 , and  4 ), the string abstraction omits such details and represents all three instances of this action using A ,ℎ .Each character in the abstract string is highlighted using a different color to visually aid the reader.The idea of converting the demonstration to a more abstract form is a crucial first step towards generalization.Next, our approach utilizes existing techniques to synthesize a regex that matches the string encoding of the demonstrations: Figure 2b presents three regular expressions that all match the string abstraction of the given demonstration.Note that there is an obvious parallel between regex operators and the program's control flow: Since Kleene star denotes repetition, it naturally Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 18. Publication date: January 2024.
18:6 N. Patton, K. Rahmani, M. Missula, J. Biswas, and I. Dillig corresponds to a looping construct.Similarly, since the optional operator (i.e., ( )?) denotes choice, it is naturally translated into an if statement.Thus, given a candidate regex  for the demonstrations, our approach translates it into a sketch in a syntax-directed way.For instance, the third regular expression from Figure 2b is translated to the sketch shown in Figure 2c, where the three nested loops and the conditional block are marked using the same colored dashed lines both in the regex and in the sketch.Additionally, because the program should not manipulate any objects before it perceives them, our sketch generation procedure also inserts any necessary perception primitives to the sketch.For instance, the sketch in Figure 2c contains an inferred let binding (labelled P1), since the object of type  at line 4 cannot be acted upon before it is first perceived by the robot.

Sketch completion.
For each sketch generated in the first step, our method tries to find a completion that matches the user's demonstration.In practice, several of these sketches are unrealizable, meaning that there is no completion that will match the user's demonstrations.For example, consider the first regex presented above in Figure 2b that matches the string abstraction of the demonstration.This regex does not include the optional operator ?used in the correct regex.Intuitively, sketches generated from this regex will be infeasible because, without an if statement in the sketch, the resulting programs would end up grabbing both sheets in room  1 , rather than only the single dirty sheet.Hence, the second phase of our technique considers multiple sketches and tries to find a completion of any sketch that is consistent with the given demonstrations.
Our sketch completion algorithm is based on top-down enumerative search but (a) utilizes a novel unrealizability checking procedure to quickly detect dead-ends and (b) leverages an LLM to guide exploration.In particular, starting from a sketch, the synthesis algorithm maintains a worklist of partial programs containing holes to be filled.When dequeuing a partial program from the worklist, we consider its probability according to the LLM, so more promising partial programs are prioritized over less likely ones.Going back to our example, consider a partial program where we test the color of a bedsheet before putting it in the bin.Because the concepts "laundry bin" and "cleanliness" are much more related compared to "color" and "laundry bin", our technique prioritizes a program that includes the conditional checkProp(, ...) over one that is based on color.Our sketch completion method also leverages the demonstration to prove unrealizability of a given synthesis problem.As an example, consider the partial program shown in Figure 4. We can prove that there is no completion of this partial program that will be consistent with the user's demonstrations: Because none of the actions in the partial program modify object locations and because there are two sheets in each room, any completion of this partial program would end up performing the  action at least four times but there are only three  actions in the trace.Hence, this partial program is a dead end, and our approach can detect unrealizability of such a synthesis sub-problem.To do so, it statically analyzes the partial program to infer upper and lower bounds on the number of loop executions.It then uses this information to construct a regex, shown at the bottom of Figure 4, that summarizes all possible traces that this partial program can generate.Since the demo trace from Figure 2

Robot Execution Policies
In this section, we introduce a domain-specific language (DSL) for programming robots and provide a formal definition of the robot learning from demonstrations (LfD) problem.

Syntax
The syntax of the our DSL is presented in Figure 5.A robot program () contains functions to perform various operations on a single object (actUnary) or a pair of objects (actBinary).The robot can move between locations using the goto function.The robot becomes aware of its location by scanning the environment using scanLoc, and it becomes aware of objects in its current location using the scanObj function.The result of running a scan operation is an ordered list of location or object instances (denoted by ) of the specified type .For example, scanObj() yields all plates at the current location of the robots.Specific elements in the scan result can be bound to variables using a restricted let binding of the form let  := getNth(, ).This expression introduces a new variable  and assigns the 'th element of list  to .As standard, the DSL also contains typical conditional and looping control structures.Conditional expressions check properties of objects (checkProp), relationships between them (checkRel), as well as their Boolean compositions.
The abstractions for robot actions and perception in our DSL is similar to widely used accepted symbolic abstractions for classical planning [Aeronautiques et al. 1998;Fox and Long 2003], and more recently, symbolic robot policies [Liang et al. 2022].
Note that the DSL presented in Figure 5 is parametrized over a set of domain-specific terminals, indicated by an asterisk.For example, location types   are not fixed and can vary based on the target application domain.For example, for robot execution policies targeting household chores, locations might be kitchen, living room, basement etc.Similarly, object types, properties, relations, and actions are also domain-specific and can be customized for a given family of tasks.

Operational Semantics
In this section, we present the operational semantics of our robot DSL using the small-step reduction relation ⇒ shown in Figure 6.This relation formalizes how the robot interacts with its environment while executing the program.Specifically, the relation ⇒ is defined between tuples of the form (, E, , ), where  is a program, E is the robot's execution environment,  is a valuation (mapping variables to their values), and  is a program trace.In more detail, the environment E is a quadruple (L, O, ℓ, I) where L is a set of typed location identifiers; O is a mapping from (typed) object identifiers to their corresponding location; ℓ is the current location of the robot; and I is an interpretation for all the relation symbols.That is, for a relation , I () yields the set of tuples of objects for which  evaluates to true.Given object type   and a location , we write E.objs(,   ) to denote the list of all objects that are at location  and that have type   .Similarly, given a location type   , the list of locations of this type is denoted by E.locs(  ).Finally, a trace is a sequence of actions performed by the robot.Robot actions are denoted using  (, ō) and  (), where  is Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 18. Publication date: January 2024.
18:8 N. Patton, K. Rahmani, M. Missula, J. Biswas, and I. Dillig  an action that was performed on objects ō, and  is a location that the robot visited.Note that robot execution policies are effectful programs: for example, the location of the robot or some properties of an object can change after executing .
With the above notations in place, we now explain the operational semantics from Figure 6 in more detail.The first rule, labeled (seqence), defines how the robot takes a step by executing the first statement in the program.The next rule (skip) defines the semantics of executing a skip statement, which has no effect on the execution state.The rules (if-t) and (if-f) describe the flow of the program when a conditional statement if(){ } is executed.First, the Boolean expression  is evaluated to ⊥ or ⊤, depending on the result, the program either skips or executes .Boolean expressions are evaluated using the relation ⇓ defined in Figure 7.This relation is parameterized by the execution environment E and valuation .
The (act-unary) and (act-binary) rules specify the outcomes of executing a unary and a binary action, respectively, using the auxiliary relation → ⊆ E × E. Given an environment E, an action  and affected object instance(s), the relation → formalizes how the environment E is modified based on the semantics of action .Since our DSL is parameterized over the set of actions, we do not discuss the → relation in detail in the main body of the paper and refer the interested reader to Appendix A.1 for a representative subset of actions used in our evaluation.
The goto rule defines the effect of executing a goto() statement, where the environment is updated to reflect the robot's new location, and a new trace element  () is generated and appended to the existing trace, where  is the location stored in variable .Next, the rules let-obj and let-loc, define the semantics of the let  := getNth(, ) statement, which assigns to variable  the  ℎ element of list  obtained via either scanObj or scanLoc.Specifically, scanObj(  ) yields all objects of type   that are present at the robot's current location, and scanLoc(  ) yields all locations of type   .The rules (foreach-obj) and (foreach-loc) describe the semantics of loops of the form foreach( ∈ ){ }, where  is the result of a scan operation.As expected, these rules iteratively bind  to each of the elements in  and execute the loop body  under this new valuation.
Finally, we use the ⇒ relation to define the semantics of executing a policy  on environment E. Given E and robot execution policy , we write  (E) =  iff , E, Nil, Nil ⇒ skip, _, _,  where Nil denotes an empty list/mapping.

Problem Statement
In this section, we formally define the LfD problem that we address in this paper.Informally, given a set of demonstrations D, the LfD problem is to find a robot execution policy  * (in the DSL of Figure 5) such that  is consistent with D. To make the notion of consistency more precise, we represent a demonstration  as a pair (E, ) where E is the initial environment and  is a trace of the user's demonstration in this environment.Definition 3.1.(Consistency with demonstration) We say that a robot execution policy  * is consistent with a demonstration  = (E, ), denoted  |= , iff  (E) = .
We also extend this notion of consistency to a set of demonstrations D, and we write  |= D iff  |=  for every demonstration  in D. We can now formalize our problem statement as follows: Definition 3.2.(Programmatic LfD) Given a set of demonstrations D, the programmatic LfD problem is to find a robot execution policy  * such that  * |= D.

Synthesis Algorithm
In this section, we present our synthesis technique for solving the programmatic LfD problem defined in the previous section.We start by giving an overview of the top-level algorithm and then describe each of its key components in more detail.

Top-Level Algorithm
Our top-level learning procedure is presented in Algorithm 1.This algorithm takes as input a set of demonstrations D and returns a policy  such that for all  ∈ D, we have  |= .If there is no programmatic policy that is consistent with all demonstrations, the algorithm returns ⊥.
The synthesis procedure starts by constructing an abstraction of each demonstration  ∈ D as a string over the alphabet Σ = {G  , A , , A ,, ′ } where ,  ′ indicate location and object types (e.g., Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 18. Publication date: January 2024. 18:10 N. Patton, K. Rahmani, M. Missula, J. Biswas, and I. Dillig , ℎ) and  denotes a specific type of action (e.g., ).This abstraction is performed at line 2 of Algorithm 1 using the function , defined as follows: where   ,   denote the type of location  and object  respectively.In other words, when abstracting a trace as a string, the algorithm replaces specific object instances with their corresponding types.Intuitively, this abstraction captures the commonality between different actions in the trace, allowing generalization from a specific sequence of actions to a more general program structure.
Next, given the string abstraction  of the demonstrations D, the loop in lines 3-12 alternates between the following key steps: • Regex synthesis: The GetNextRegex procedure at line 4 finds a regular expression  matching all strings in .Intuitively, this regex captures the main control flow structure of the target program and can be used to generate a set of program sketches.• Sketch generation: The inner loop in lines 6-10 translates a given regex to a set of program sketches.As shown in Figure 8, a sketch has almost the same syntax as programs in our DSL except that the arguments of most constructs are unknown, as indicated by question marks.
In particular, note that (1) the types of objects and locations being scanned are unknown, (2) predicates of if statements are yet to be determined, and (3) the specific objects and locations being acted on are also unknown (although their types are known).• Sketch completion: Given a candidate program sketch , line 9 of the algorithm invokes CompleteSketch to find a completion  of  that is consistent with the demonstrations.If CompleteSketch does not return failure (⊥), the synthesized policy is guaranteed to satisfy all demonstrations; hence, Synthesize returns  as a solution at line 10.
In the remainder of this section, we describe sketch generation and sketch completion in more detail.Because learning regexes from a set of positive string examples is a well-understood problem, we do not describe it in this paper, and our implementation uses an off-the-shelf tool customized to our needs via some post-processing (see Section 5).

Sketch Inference
Given a regex  over the alphabet Σ = {G   , A ,  , A ,  , ′  }, the goal of sketch inference is to (lazily) generate a set of program sketches.The inputs to the sketch inference procedure are regular expression of the following form: Given such a regex, sketch inference consists of two steps: (1) Syntax-directed translation: In the first step, sketch inference converts the given regex to control flow operations using syntax-directed translation.Intuitively, string concatenation is translated into to sequential composition; Kleene star corresponds to loops; and, optional regexes translate into conditionals.(2) Perception inference: While the sketches generated in step (1) are syntactically valid, they may lack essential perception operations (i.e., scanObj and scanLoc).Hence, in the second step, our sketch inference procedure inserts these perception operations such that the resulting sketch is perception-complete, meaning that it contains at least the minimum number of required  scan operation.However, since the target program may require additional scan operations, the second step of sketch inference yields a set of sketches that only differ with respect to the placement of these perception operations.Figure 9 presents our syntax-directed translation rules for converting regular expressions to a syntactically valid sketch using judgments of the form  ⊲ , meaning that regex  is translated to sketch .As expected, characters G  , A , , A ,, ′ are translated to goto, actUnary, and actBinary constructs respectively.The Kleene star operator is translated into a looping construct, but may iterate either over locations or objects.Finally, regex concatanation is translated into sequential composition, and ( )? is translated into a conditional with an unknown predicate.
Recall that our DSL also allows let bindings that assign a new variable to the result of a perception operation.Since program traces (and, hence, the inferred regexes) do not contain these perception operations, the last two rules in Figure 9 allow inserting let bindings at arbitrary positions in the sketch.In particular, if  can be translated into a sketch , then the last two rules of Figure 9 state that  can also be translated into a sketch of the form ;  where  is a new let binding which assigns a fresh variable  to an entity that is obtained by scanning objects or locations.
In general, observe that a regex can give rise to a large number of program sketches, as we do not a priori know where to insert let bindings.To tackle this problem, our lazy sketch inference procedure first translates a regex into a sketch without using the last two rules in Figure 9 for inserting let bindings.In a second step, it infers where perception operations are needed and inserts let bindings according to the results of this analysis.
This second step of our sketch inference procedure is formalized using the notion of perception completeness.Intuitively, a sketch is perception complete if the program perceives (using scan operations) all objects that it manipulates, before it manipulates them.If a sketch is not perception complete, it can never be realized into a valid program; hence, it is wasteful to consider such sketches.We formalize the notion of perception completeness using the following definition: Definition 4.1.(Perception Completeness) Let ≺ denote a standard partial order between program points.2A sketch  is said to be perception complete if the following conditions are satisfied: (1) for all  1 := actUnary(, ?? :   ) in , there exists a  2 := scanObj(  ) such that  2 ≺  1 .
Our sketch inference algorithm leverages this notion of perception completeness to lazily enumerate program sketches as follows: First, it translates a given regex into a set of sketches using the inference rules shown in Figure 9 but without using the last two rules.It then infers a minimal set of applications of the last two rules needed to make the sketch perception complete and then augments the resulting sketches with the inferred let bindings.Finally, because additional let bindings may be needed, it lazily inserts more let bindings (up to a bound) if the current sketch does not produce a valid completion.

Sketch Completion
We now turn our attention to the sketch completion procedure, shown in Algorithm 2, for finding a sketch instantiation that satisfies the given demonstrations.Given a sketch  and demonstrations D, CompleteSketch either returns ⊥ to indicate failure in finding a policy  that is consistent with all demonstrations.Note that the sketch completion procedure is parameterized over a statistical model  for assigning probabilities to possible sketch completions.
CompleteSketch is a standard top-down search procedure that iteratively expands partial programs until a solution is found. 3However, our sketch completion procedure has two novel aspects: First, it assigns probabilities to partial programs using a large language model, and second, it uses a novel compatibility checking procedure for proving unrealizability of synthesis problems.
In more detail, the sketch completion procedure initializes the worklist to a singleton containing the input sketch , with corresponding probability 1.0 (line 2).It then enters a loop (lines 3-14) where each iteration processes the highest probability partial program  in the worklist.If the dequeued partial program  is complete, meaning that it has no holes (line 5), the algorithm checks whether all demonstrations are satisfied (line 6).If so,  is returned as a solution; otherwise, it is discarded.If  is incomplete, the algorithm performs a compatibility check at line 8 to avoid solving an unrealizable synthesis problem.Next, if  is a compatible with the demonstrations, the algorithm considers one of the holes ℎ in  and all well-typed grammar productions that can be used to fill that hole.In particular, given a hole ? , the procedure Fill yields a set of expressions  1 , . . .,   such that (1)  →   is a production in the grammar, and (2) replacing ℎ with   can result in a well-typed program.Hence, for each such expression   , we obtain a new partial program  ′ at line 11 by replacing hole ℎ in  with expressions   . 4However, since some completions are much more likely than others, our algorithm assigns probabilities to completions using the statistical model  .Hence, when dequeuing partial programs from the worklist, the algorithm prioritizes programs that are assigned a higher probability.

Proving Unrealizability
A key component of our sketch completion procedure is a novel technique for proving unrealizability of a programming-by-demonstration (PbD) problem.While if( ( ) ∉  ) return false # check if the trace is not accepted by the over-approximating regex 6: return  there has been significant prior work on proving unrealizability in the context of programming-byexample (PbE) [Feng et al. 2018a;Hu et al. 2019;Kim et al. 2023], such techniques only consider the input-output behavior rather than the entire execution trace.In contrast, our goal is to prove unrealizability of synthesis problems where the specification is a set of demonstrations (i.e., traces).
Given a partial program  (representing a hypothesis space), our key idea is to generate a regex  such that the language of  includes all possible traces of all programs in the hypothesis space.
Hence, if there exists some trace  ∈ D where  () is not accepted by  , this constitutes a proof that the synthesis problem (, D) is unrealizable.
Our algorithm for checking compatibility between partial programs and traces is presented in Algorithm 3. At a high level, this algorithm iterates over all demonstrations (lines 2-5) and returns false if it can prove that  is incompatible with some demonstration (E, ) in D. To check compatibility with (E, ), the algorithm first partially evaluates  on the initial environment E to obtain a simplified program  * , as done in existing work [Feng et al. 2017].The novel part of our technique lies in constructing a regex abstraction of the partial program  under a given environment E. Specifically, our compatibility checking procedure constructs a regex  that overapproximates the possible behaviors of  under initial environment E. In particular, the regex  is constructed at line 4 in such a way that if  () is not accepted by  , then no completion of  can be compatible with (E, ) (see, Theorem 4.2).
Hence, the crux of the compatibility checking algorithm is the ProgToRegex procedure (formalized as inference rules shown in Figures 10 and 11) for generating a regex that over-approximates the behavior of  under environment E. These rules utilize the notion of an abstract environment which is a triple Ê := (CurLoc, Locs, Objs) where (1) CurLoc is a set containing all possible locations that the robot could currently be at; (2) Locs is a mapping from location types to the set of locations of that type; (3) Objs is a mapping from each location to the set of objects of each type at that location (or ⊤ if unknown).Because statements in our DSL can modify the environment, this notion of abstract environment is used to conservatively capture (the possibly unknown) side effects of partial programs on the environment.The inference rules shown in Figures 10 and 11 formalize the ProgToRegex procedure using two types of judgments: (1) Scan rules (shown in Figure 10) are of the form Ê ⊢ scan(...) : Θ, indicating that the cardinality of the set returned by scan must be some element of Θ.For example, if Θ = {1, 4}, this means that the number of objects/locations returned by scan is either 1 or 4. On the other hand, if Θ includes the special ★ element, then the number of elements is unknown.(2) Partial program rules (shown in Figure 11) are of the form Ê ⊢  : Ê′ ,  , meaning that, under initial abstract environment Ê, (a) the behavior of  is over-approximated by regex  , and (b) Ê′ is a new environment that captures all possible environment states after executing  on Ê.
Before we explain these rules in detail, we first describe the high level idea, which is to encode (a) atomic actions using characters drawn from the alphabet {G  , A , , A ,, ′ }, (b) if statements using optional regexes, and (c) loops using regexes of the form   where  denotes the number of times the loop will execute (or as  * if the number of loop iterations is completely unknown).As  mentioned in earlier sections (and, as we demonstrate empirically in Section 6), static reasoning about the number of loop iterations improves the effectiveness of our unrealizability checking procedure.With this intuition in mind, we now explain the rules shown in Figures 10 and 11.
Scan rules.There are two sets of rules for scan operations, (loc-known) and (loc-unknown) for scanning locations, and (obj-known) and (obj-unknown) for scanning objects.For a scanLoc operation, if its argument is a known location type   , we simply look up the number of locations of that type from the given abstract environment.If it is unknown, we take the union over all possible location types.The rules for scanObj are similar: The (Obj-Known) rule handles the case where the argument is a known object type   .In this case, we consider all the locations that the robot could be currently at and take the union of the number of objects of type   for all of those locations.In the Obj-Unknown rule, we additionally take the union over all possible object types, since the argument of the scanObj operation is unknown.
Atomic actions.The rule labeled (atomic) in Figure 11 deals with goto, actUnary, and actBinary statements and serves two roles.First, it abstracts the performed action as a letter in our regex alphabet using the abstraction function .Second, it produces a new abstract environment Ê′ by considering all possible effects of the action on the input environment via the UpdateAbsEnv function.Since the UpdateAbsEnv function is domain-specific and depends on the types of actions of interest, we do not describe in detail but provide a set of representative examples in Appendix A.4.
Sequence.Sequential composition is abstracted using regex concatanation, and its final effect on the environment is captured by threading the environment through the two premises of the rule.
Conditionals.As expected, if statements are abstracted using the optional operator (( )?).Furthermore, since we do not know whether the predicate evaluates to true or not5 , we take the join of the two abstract environments.Intuitively, the join of abstract environments Ê and Ê′ , denoted by Ê ⊔ Ê′ , is the smallest environment that over-approximates both Ê and Ê′ .An abstract environment Ê is said to over-approximate an abstract environment Ê′ , denoted by Ê′ ⊑ Ê, if and only if: Ê′ .Locs ⊆ Ê.Locs ∧ ∀  ∈ Ê′ .Locs ∀  ∈ObjTypes( Ê′ ) .Ê′ .Objs(, ) ⊆ Ê.Objs(, ) Then, we can define a join operator on abstract environments as follows: Intuitively, Ê′′′ is the result of joining Ê and Ê′ if it is the smallest abstract environment that over-approximates both.
Loops.The last rule in Figure 11 summarizes the analysis of loops.First, since the environment may be modified in the loop body, this rule first computes an inductive abstract environment, Ê′ , for the loop (see Section 5 for further details).In particular, the premise Ê ⊑ Ê′ ensures that Ê′ over-approximates the initial environment, thereby establishing our base case.Second, the premise Ê′ ⊢  : Ê′ ,  ensures that Ê′ is preserved in all iterations of the loop body.Furthermore, because Programming-by-Demonstration for Long-Horizon Robot Tasks (Extended Version) 18:15 Ê′ is an over-approximation of the environment that the loop body  operates in, the regular expression  also over-approximares the behavior of .Finally, to over-approximate the behavior of the entire loop, we determine the possible number of loop executions using the rules from Figure 10 under the initial environment Ê.If   can yield  different objects, then the behavior of the loop is captured as   .However, since we may not be able to compute the exact number of objects returned by a scan operation, we consider all possible cardinalities  1 , . . .,   of the resulting set.Hence, the behavior of the loop is captured by the disjunction of the regexes   1 , . . .,    .
The following theorem states that none of the completions of an unrealizable partial program is consistent with all demonstrations (see Appendix A.5 for the proof).An immediate corollary is the bounded completeness of our search algorithm, i.e., that Algorithm 1 always finds a program consistent with the demonstrations if it exists within the bounds of the search space.Theorem 4.2.Let  be a partial program and let D be a set of demonstrations.Then, for any complete program  that is a completion of , we have:

Using LLM for Sketch Completion
We conclude this section with a discussion of how prolex infers a probability distribution over partial programs by prompting a large language model.Our prompting approach is inspired by the success of LLMs in addressing the "Fill in the Middle (FIM)" challenges in NLP literature [Liu et al. 2023].We reduce the sketch completion task to a FIM problem, as described below.
Given a partial program  with a hole ℎ to fill, our approach encodes the context of ℎ in  as a natural language prompt with unknown masks [Devlin et al. 2019].The prompt includes the set of valid completions for each mask, chosen by the Fill(.)procedure in Algorithm 2, to ensure the resulting program is well-typed.The LLM is then instructed to infer a probability distribution over the set of completions (represented by  in Algorithm 2).This procedure is repeated whenever a partial program is expanded into a set of partial programs, to prioritize the enumerative search towards the intended program.
To illustrate how prolex prompts the LLM, Figure 12a (top) shows a partial program containing six unfilled holes, denoted as ??  .Figure 12b (top) shows the prompt used for completing hole ?? 1 , which essentially corresponds to a natural description of the program. 6To generate such a prompt, our approach translates control-flow constructs to natural language in a syntax-directed way and replaces some of the holes (?? 1 and ?? 2 in this example) with masks.Because the remaining holes ?? 3 −?? 6 will be replaced with synthetic variable names like  1 ,  2 , they are not meaningful to the LLM; so our approach simply uses the types of these holes rather than masks when translating the partial program to natural language.As we can see from the LLM output in Figure 12b, the highest likelihood completion of ?? 1 is deemed to be bed by the model, so the sketch completion algorithm prioritizes this completion over other alternatives such as chair or mug.
As another example, consider the process of filling hole ?? 2 , and suppose that the algorithm has already refined ?? 2 to the conjunct checkProp(?? 2 ,  4 ) ∧ checkRel(?? 2 ,  4 ,  3 ) as shown in the 6 Note that the demonstration is implicitly encoded as part of the partial program using type information for each hole.12a.When generating the prompt for ?? 2 , both holes ?? 2 and ?? 2 are filled with masks, and the LLM outputs dirty as the most likely completion for ?? 2 .When querying the remaining hole (?? 2 ), ?? 2 has already been filled with dirty, so the prompt only contains a single mask, and the LLM outputs on-top-of as the most likely completion.As illustrated by these examples, the LLM-guided search strategy allows the sketch completion engine to quickly home in on the right concepts (such as bed, dirty, and on-top-of in this example) and therefore allows the search procedure to focus on the most promising sketch completions.

Implementation
We implemented the proposed approach in a tool called prolex written in Python.In this section, we discuss salient aspects of prolex that are not covered in the technical sections.
Regex Learner.Our implementation leverages an open-source library to learn regular expressions from positive samples [Nordmann 2014].However, since our sketch generation procedure does not allow arbitrary regex operators, our implementation post-processes the synthesized regexes by applying a set of rewrite rules.For instance, because our sketch generation procedure does not allow arbitrary disjunction, one of the rewrite rules (( |) → (?)) replaces disjunction with an optional operator.The full list of our rewrite rules can be found in Appendix A.3.Large Language Model.Our sketch completion module utilizes the BERT large language model [Devlin et al. 2019], with pre-trained weights obtained from the HuggingFace library [Wolf et al. 2020].Since our algorithm generates masked language modeling (MLM) queries, we chose to use the bert-base-uncased model.This model is primarily fine-tuned for tasks that make use of the whole sentence, potentially with masked words, to make decisions.It is a lightweight model, and in our experiments, it consistently returns responses in less than 70  on average.Parallel Sketch Completion.The inherent independence of program sketches naturally lends itself to parallelization of the search process.To take advantage of this, prolex spawns a new process to execute sketch completion (Algorithm 2) for each generated sketch.
Computing Loop Invariants.Recall that our procedure for proving unrealizability relies on an inductive abstract environment for translating loops to regular expressions.Our implementation conservatively models the effect of any statement with holes by assuming that the hole could

Evaluation
In this section, we present experiments designed to answer the following research questions: (RQ1) How effective is prolex at learning policies from human demonstrations?(RQ2) What is the relative significance of each of the key components in our synthesis algorithm?(RQ3) How does prolex compare against relevant baselines in terms of learning policies that match the user's intention?

Benchmarks and Experimental Set-up
Tasks.We evaluate prolex on a set of 40 programmatic LfD problems involving long-horizon service tasks in typical household environments.We gathered these tasks from two sources, which we describe below.First, we use 25 tasks from the Behavior Project [Srivastava et al. 2022].This project is an interactive platform designed for a virtual embodied AI agent operating within simulated household environments.The Behavior Project offers simulation videos of the agent carrying out a range of long-horizon tasks.These demonstrations are produced by human users who use a joystick to control the agent's movements and actions in the simulation environment.Out of 100 tasks provided in the Behavior project, we observed that 75 required precise low-level motion control of the robot's arms, like detailed cleaning of a car's surfaces.As our DSL can trivially execute these tasks without any conditionals or loops, we excluded them from our evaluation.The demonstration videos for the remaining 25 tasks were transcribed using our action alphabet by an expert annotator -one of the authors-who is knowledgeable about the skill abstractions used by the robot. 7ext, to broaden the scope of our evaluation and include more realistic tasks, we surveyed students at our institution and collected 15 additional tasks to evaluate prolex on.We did not impose any requirements regarding the users' technical expertise, as our aim was to gather a list of household tasks that a typical end-user would consider beneficial to have automated A full list of the above 40 tasks and their ground-truth programs is provided in Appendix A.6. Figure 13a presents an overview of the ground-truth programs for these tasks, ordered by program size.In more detail, the right y-axis represents program size (measured in terms of the number of nodes in the abstract syntax tree) and the left y-axis represents the number of loops, conditionals and perception primitives in the program.Tasks from the Behavior project are labeled as B1 . . .B25, and surveyed tasks are labeled as S1 . . .S15. Environments.All of our benchmarks are defined in three household environments, summarized in Figure 13b.Because the difficulty of the synthesis task depends crucially on the number of object types and objects in the environment, we classify the three environments as Easy, Medium, and Hard based on these numbers.As we can see from Figure 13b, these environments contain up to thousands of objects and over ten thousand properties.
Full benchmark set.Overall, we evaluate prolex on a total of 120 benchmarks, with 40 unique tasks and 3 different environments.For each of the 40 tasks, we manually write the ground truth program in our DSL and obtain a demonstration by running the ground truth program.Figure 14 presents statistics regarding the length of these demonstrations.On average, demonstrations in the easy, medium, and hard environments consist of 11, 13, and 24 actions, respectively.Some demonstrations in the hard environment exceed 100 actions due to the large number of object instances that need to be handled.
Experimental set-up.Our experiments were conducted on a server with 64 available Intel Xeon Gold 5218 CPUs @2.30GHz, 264GB of available memory, and running Ubuntu 22.04.2.We use a time limit of 120 seconds per task in all of our experiments.

Main Results for Prolex
Our main results are presented in Figure 15, with Figure 15a showing the percentage of completed tasks against synthesis time.We consider a task to be completed if prolex is able to synthesize a policy within the time limit, and solved if the learned program also matches the user's intent.We determine if a task is solved by comparing it against the ground truth program (written manually) and checking if the learned program is semantically equivalent.Since the manually written policy is intended to work in all environments, tasks classified as "solved" generalize to unseen environments.
Running time.In Figure 15a, the three lines indicate the percentage of benchmarks (y-axis) completed with a given time limit (x-axis) for each of the three environments (Easy, Medium, Hard).Across all environments, prolex is able to complete 36% of the tasks within the first 5 seconds, 28% of the tasks within 6-30 seconds, and 16% of the tasks within 31-120 seconds.Overall, prolex is able to complete 80% of the tasks within the 2 minute time limit.
Figure 15 also provides a more detailed look at these statistics by showing the percentage of completed tasks with respect to task complexity.In particular, Figure 15b shows the rate of completion according to the size of the ground truth program.As expected, the learning problem becomes harder as the complexity of target policy increases.However, even for the most complex programs, prolex is still able to complete the learning task for 68% of the benchmarks.
Generalizability.As mentioned earlier, completing a task is not the same as "solving" it, since the synthesized policy may not match user intent despite being consistent with the demonstrations.We manually inspected all programs synthesized by prolex and found that it is equivalent to the ground truth program in 81% of the completed cases.Interestingly, as shown in Figure 15c, we found that prolex's generalization power improves as environment complexity increases.Intuitively, the more complex the environment, the more objects there are with different properties, so it becomes harder to find multiple programs that "touch" exactly the same objects as the demonstration.
Search Statistics.As explained in Section 4, sketches in prolex are enumerated based on increasing order of complexity.In each iteration, a finite number of completions are considered for the current sketch.If none of these completions aligns with the demonstrations, the algorithm advances to the next sketch.This process continues until either a successful match is found or the time limit is reached.Figure 16 presents the average number of sketches encountered before finding the solution or timing out for each task in all three environments.For almost all tasks, prolex considers fewer than 50 sketches before termination.One task (S5) is exceptionally challenging and requires 143 sketches before a solution is found.Failure analysis.As discussed above, there are two reasons why prolex may fail to solve a task: (1) it fails to find a policy consistent with the demonstrations within the time limit, or (2) the synthesized policy does not adequately capture the user intent.
We have manually inspected both classes of failure cases and report on our findings.The main cause of prolex's timeouts is due to perception operations.Many of our environments contain a large number of object types, all of which can be arguments of scan operations.Our approach tries to overcome this issue by using an LLM to guide search, but in some cases, the LLM proposes the wrong object type to scan for.This causes the synthesizer to go down a rabbit hole, particularly in cases when the proposed object type has many properties associated with it.We believe more advanced LLMs that can reason about finer-grained properties between the environment and the context of the task can potentially mitigate this issue.
We also inspected the cases where prolex finds a robot execution policy consistent with the demonstrations, but the synthesized policy does not generalize to different environments (i.e., it completes the task but fails to solve it).There are two main reasons for this, both due to the inadequacy of the demonstrations with respect to the desired task.Specifically, if there is only one instance of a particular object type in the environment, the synthesizer may not return a program with a foreach loop over that object type, even though the ground truth program contains such a loop.Likewise, prolex is capable of inferring conditional blocks only if the demonstrations are branch complete.This means that, in the demonstration environment, some instances of a manipulated object type must satisfy the intended property, while others must not.If all (or none) of the object instances of that type are acted upon, the synthesizer cannot learn the conditional block for such manipulations.As noted earlier, this generalizability issue becomes less of a problem in more complex environments with many instances of an object type.RQ1 Summary.Given a 2 minute time limit, prolex is able to find a policy consistent with the demonstations for 80% of the benchmarks.Furthermore 81% of the synthesized programs correspond to the ground truth, meaning that they can generalize to any unseen environment.

Ablation Studies
As mentioned throughout the paper, there are three key components underlying our approach, namely (1) learning control flow structures (i.e., sketches), (2) use of LLMs to guide sketch completion, and (3) new technique for proving unrealizability.To better understand the relative importance of each component, we present the results of an ablation study where we disable each component or of components.Specifically, for our ablation study, we consider the following variants of prolex: • prolex-NoSketch: This is a variant of prolex that does not generate program sketches using regex learning.• prolex-NoLLM: This is a variant of prolex that does not utilize LLMs for sketch completion.
• prolex-NoPrune: This is a variant of prolex that does not utilize the unrealizability checking procedure (Algorithm 2) for pruning the search space during sketch completion.• prolex-SketchOnly: This is a variant of prolex that infers control flow sketches (through regex learning) but neither utilizes LLM nor unrealizability checking during sketch completion.• prolex-NoLoopBound: This is a variant of prolex summarizes loops using the Kleene star operator during the unrealizability checking procedure.That is, it does not reason about the number of loop iterations; however, it is the same as the full prolex tool otherwise.
Figure 17a shows the results of our ablation study in the form of a Cumulative Distribution Function (CDF).The x-axis represents the cumulative running time, while the y-axis shows the percentage of benchmarks solved in all environments.The results indicate a significant gap between the number of the tasks solved by prolex and its variants defined above.In particular, prolex is able to solve 8% more tasks than NoLLM, 19% more tasks than NoPrune, 24% more tasks than SketchOnly, and 8% more tasks than NoLoopBound variants.However, the results for the NoSketch variant are particularly poor, with none of the tasks solved.RQ2 Summary.All of the key components of our proposed synthesis algorithm contribute to the practicality of our learning approach.The most important component is regex-based sketch generation, without which none of the tasks can be solved.The unrealizability checking procedure helps solve an additional 18% of the tasks, and LLM guidance increases success rate by another 8%.

Comparison with Alternative Approaches
In this section, we report on our experience comparing prolex against alternative approaches.While there is no existing off-the-shelf LfD approach that targets our problem domain (see Section 6.1), we compare prolex against the following two baselines: • CVC5: We formulate our learning problem as an instance of syntax-guided synthesis and use a leading SyGuS solver, CVC5 [Barbosa et al. 2022], as a programmatic policy synthesizer.• GPT-Synth: We use an LLM as a neural program synthesizer in our domain.To this end, we consider a baseline called "GPT-Synth" that synthesizes programs in our DSL from demonstrations.
Case Study with CVC5.Our programmatic LfD task can be reduced to an instance of the syntax-guided synthesis (SyGuS) problem [Alur et al. 2015a], which is the standard formulation for synthesis problems.To compare prolex against state-of-the-art SyGuS solvers, we encoded our tasks as instances of SyGuS and leveraged an off-the-shelf solver, namely, CVC5 [Barbosa et al. 2022], which is the winner of the most recent SyGuS competition.
To perform this comparison, we defined our DSL using the syntactic constraints in SyGuS, and we incorporated semantic constraints based on the initial and final environment states in the demonstrations.Note that SyGuS solvers are unable to perform synthesis from demonstrations, as demonstrations correspond to intermediate program states, which are not expressible in the SyGuS formulation.Hence, when comparing against CVC5, we only use the initial and final environments and consider a task to be completed if the solver returns a policy that produces the desired environment.Furthermore, we only compare the performance of prolex and CVC5 on the "easy" environment as encoding the environments in SyGuS requires significant manual labor.
When we tried to use CVC5 to perform synthesis from scratch (i.e., without a sketch), it was not able to complete any task within a reasonable time limit.Hence, for this evaluation, we manually provided CVC5 with the ground truth sketch for the specific task.The results of this comparison are presented in Figure 17b.As in Figure 17a, this figure plots the cumulative distribution of the percentage of synthesized programs against solver time.Overall, CVC5 only solves 25% of the tasks, compared to 88% solved by prolex.
Case Study with GPT-Synth.For this experiment, we use the GPT 3.5 LLM8 to generate programmatic policies from demonstrations.This study aims to evaluate the effectiveness of LLMs as an end-to-end program synthesizer, henceforth referred to as GPT-Synth.
The prompts used in this study were developed using the "template" pattern as described by White et al. [2023].This pattern has been shown to be highly effective in situations where the generated output must conform to a specific format that may not be part of the LLM's training data.In particular, for each benchmark task, we created a prompt that includes a description of our DSL, and a set of example demonstrations and the environment states, along with the correct programs for the desired task.For a novel task, we provide the language model with the prompt, the demonstrations and environment states for the novel task, and ask it to generate a corresponding program.
To use GPT 3.5 as an end-to-end synthesizer, we adopt the following methodology, as done in prior work [Chen et al. 2021b].If the first program returned by the LLM is consistent with the demonstration, GPT-Synth returns that program as the solution.Otherwise, it asks the model to produce another program, for up to 10 iterations.Unfortunately, we found that GPT-Synth is unable to solve any of the benchmarks when we provide both the demonstrations and the environment.This behavior seems to be caused by the large number of entities and relations in the environment -prior work has reported similar results in other domains (e.g., planning [Mahowald et al. 2023]) where LLMs were used to perform tasks in non-trivial environments.
To gain more intuition about how the language model scales with the environment size, we report on our experience with using GPT-Synth on five representative tasks involving toy environments.We construct these toy environments by incorporating only the objects (and their properties) required for that task plus some additional objects and properties.Figure 18a shows how the success rate of GPT-Synth scales with respect to environment size.Here, the x-axis shows the number of additional objects (and their properties) in the environment and the y-axis shows the success rate.As we can see from Figure 18a , GPT-Synth works well if it is given only the relevant objects (which is not a realistic usage scenario), but, as environment size increases, its success rate drops dramatically.In fact, when the environment contains only 10 additional objects -a tiny fraction of our "Easy" environment -the success rate of GPT-Synth already drops to zero.
The reader may wonder if the environment is actually necessary for GPT-Synth to learn the correct program.To answer this question, we perform an additional experiment where we provide GPT-Synth with only the demonstration, but not the environment.The results of this evaluation are presented in Figure 18b, where we classify tasks into two categories as "Forall" and "Conditional".The former class of tasks does not involve branching, whereas the latter does.Here, "Completed" shows the percentage of tasks for which GPT-Synth finds a program consistent with the demonstration, and "Solved" shows the percentage of tasks for which GPT-Synth returns a program that also matches the ground truth.As we can see, GPT-Synth returns a program consistent with the demonstration for 45% of all tasks, but it is only able to identify the ground truth program in 25% of the cases.Furthermore, as one might expect, GPT-Synth is much more effective at the much simpler "Forall" category of tasks that involve acting on all instances of a particular type.In contrast, the "Conditional" category is much more challenging without having access to the environment, and the success rate of GPT-Synth is only 11% for this category.Intuitively, without knowing which objects have what properties, GPT-Synth has little chance of knowing that there should be a conditional and what its corresponding predicate should be.We came across a few cases where GPT-Synth is able to "hallucinate" the right predicates after several rounds of interaction; but, in general, guessing the user's intent without knowing the environment, is at best a matter of sheer luck.RQ3 Summary.prolex performs significantly better than the CVC5 and GPT-Synth baselines.Even when given the ground truth sketches, CVC5 is only able to return a program consistent with the final environment in 25% of the cases.On the other hand, the GPT-based synthesizer cannot solve any tasks when provided with both the demonstration and the full environment, but it is able to solve 25% of the tasks when it is given only the demonstration.Robot Learning from Demonstrations.Our approach builds upon a substantial body of literature on the use of Learning from Demonstration (LfD) techniques to learn robot execution policies [Argall et al. 2008[Argall et al. , 2009;;Sosa-Ceron et al. 2022].This literature can be broadly categorized into two approaches: (1) learning neural models to represent robot behaviors [Ho and Ermon 2016;Kober et al. 2013;Ly and Akhloufi 2021;Rusu et al. 2017;Sünderhauf et al. 2018;Taylor and Stone 2009;Xiao et al. 2021;Ziebart et al. 2008], and (2) synthesizing programmatic representations of execution policies [French et al. 2019;Holtz et al. 2021Holtz et al. , 2020a;;Niekum et al. 2015].The most well-established techniques for learning neural models from demonstrations include behavior cloning within the framework of imitation learning [Ho and Ermon 2016;Ly and Akhloufi 2021] and (deep) reinforcement learning (RL) methods [Kober et al. 2013;Sünderhauf et al. 2018;Ziebart et al. 2008].Empirical studies have demonstrated the efficacy of these neural policies in perception tasks and their ability to perform well in unknown or ill-defined environments.However, such neural models lack robust interpretability and generalization capabilities -as a testament to this, there exist no neural LfD algorithms to date capable of leveraging the user demonstrations in the Behavior benchmarks [Srivastava et al. 2022].The field of transfer learning [Pan and Yang 2010;Rusu et al. 2017;Taylor and Stone 2009] aims to resolve generalization problems to some degree and also enhance data efficiency.A related setting to our work is applying reinforcement learning (RL) by specifying the task via formal specification of the goal conditions; in fact, the Behavior benchmark set [Srivastava et al. 2022] reports the results of such an approach.However, even in the simplest 12 activities, the RL algorithms are unable to complete the tasks even when initiated close to the goal states.Further, even when the actions are abstracted into symbols in Behavior-1K [Li et al. 2023], RL approaches demonstrate very poor performance.These results mainly highlight the complexity of the tasks that we tackle in this paper.
Recently, there has also been growing interest in developing techniques to enhance the transparency and reliability of RL systems through formal explanations [Glanois et al. 2021;Krajna et al. 2022;Li et al. 2019;Topin and Veloso 2019].These techniques aim to explain different aspects of the learned models, such as inputs and transitions, by finding interpretable representations of neural policies, such as Abstracted Policy Graphs [Topin and Veloso 2019] or structures in a high-level DSL [Verma et al. 2018].More recently, there has also been interest in utilizing program synthesis methods [Holtz et al. 2020b;Xin et al. 2023] to learn robot execution policies from demonstrations as an alternative to neural model learning [Holtz et al. 2021[Holtz et al. , 2020a]].These approaches provide improved interpretability, generalizability [Holtz et al. 2018], and data efficiency.prolex falls into the same class of techniques as these approaches but broadens their applicability in several ways: First, it can learn policies to handle long-horizon tasks; second, it can synthesize programs with complex control flow, such as loops with nested conditionals and loops; and, third, it can handle environments with a large number of objects and properties.
Lastly, recent work proposes a methodology for automated learning of robotic programs for long-horizon human-robot interaction policies from multi-modal inputs, including demonstrations and natural language [Porfirio et al. 2019[Porfirio et al. , 2023]].The contributions of this work are largely complementary to ours: their main focus is an interface for allowing end-users to draw the navigation path of a robot on a 2D map of the environment and inferring symbolic traces from this raw path representation.Similar to our regex-learning-based sketch inference algorithm, the synthesizer in this work utilizes an automata learning approach to generalize user-provided traces into Mealy automata, based on the approach in [Neider 2014].However, the control structures in their learned programs only admit restricted loops with no nesting.Additionally, this synthesizer does not Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 18. Publication date: January 2024.
18:24 N. Patton, K. Rahmani, M. Missula, J. Biswas, and I. Dillig produce loops over objects and locations in the environment and does not address challenges related to perception, i.e. inference of objects not present in the demonstrations.
Among these, our method is mostly related to the synthesizers that enable programming by demonstration (PbD) [Chasins et al. 2018;Lau et al. 2003].Existing PbD techniques generalize programs either from sequences of user actions [Chasins et al. 2018;Dong et al. 2022] or sequences of program states [Lau et al. 2003].Our approach is similar to the former, specifically similar to WebRobot [Dong et al. 2022], which can synthesize challenging programs with multiple nested loops from sequences of user actions.However, WebRobot cannot synthesize programs with conditional blocks, which are essential for successfully performing our tasks.Moreover, WebRobot is targeted at web process automation tasks and does not address robotics-related challenges, such as perception and environment complexity, that play a big role in this work.

Synthesis of Control Structures.
A main focus of this paper is on the problem of inferring control structures from demonstration traces [Bar-David and Taubenfeld 2003;Barthe et al. 2013;Biermann et al. 1975].This is a recognized and challenging problem that is studied in various synthesis methodologies, including approaches for code synthesis from black-box reference implementations [Heule et al. 2015;Jha et al. 2010b;Ji et al. 2023] and human-in-the-loop approaches for program learning [Ferdowsifard et al. 2021;Newcomb and Bodík 2019;Pu et al. 2022].For instance, LooPy [Ferdowsifard et al. 2021] is a recent human-in-the-loop method that relies on the programmer to act as an oracle and identify properties of consecutive iterations within the body of the target loop.Unlike these approaches, prolex only requires a small number of demonstrations with no additional hints from the end-user.
There are relatively few synthesis algorithms that can infer nested loops with branching, and they typically rely on domain-specific simplifying assumptions to address these challenges.For instance, Rousillon [Chasins et al. 2018] is a recent synthesizer that deals with loops and is specifically designed for extracting tabular data from web pages.While Rousillon supports nested loops, it can only generate side-effect-free programs intended for information retrieval purposes.FrAngel [Shi et al. 2019] is a component-based synthesizer that also handles nested control structures but necessitates users to provide numerous examples, including base and corner cases.Lastly, there is a line of work focusing on loop unrolling and rerolling for low-level hardware and software optimization [Ge et al. 2022;Rocha et al. 2022;Sisco et al. 2023], using techniques such as termrewriting [Nandi et al. 2020].We believe that our proposed sketch generation and unrealizability proving technique could be generally useful in any setting that (a) requires synthesizing complex control flow structures and (b) where the algorithm has access to execution traces.
Reactive Program Synthesis.This paper is also related to a long line of work on reactive synthesis, where the typical goal is to synthesize finite state machines (FSM) from temporal specifications [Baumeister et al. 2020;Choi 2021;Finkbeiner et al. 2019;He et al. 2017;Krogmeier et al. 2020].Traditionally, reactive synthesis has been viewed as a game between two players -the controlled system and its environment -and solving the reactive synthesis problem boils down to Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 18. Publication date: January 2024.
Programming-by-Demonstration for Long-Horizon Robot Tasks (Extended Version) 18:25 finding a winning strategy for the controlled system.The reactive synthesis problem is computationally intractable for general classes of specifications, such as monadic second order logic or full linear temporal logic [Pnueli and Rosner 1989], but work by Bloem et al. [2012] has shown that this problem can be made tractable by restricting the logic to a subclass known as GR(1) specifications.Another successful approach to enhance reactive synthesizers is bounded synthesis, where the number of states of the synthesized implementation is bounded by a constant [Finkbeiner and Klein 2018;Finkbeiner and Schewe 2013].Generally speaking, existing reactive synthesis method differ from our work in three major ways: First, they take as input temporal logic specifications rather than demonstrations, and, second, they are based on deductive rather than inductive synthesis.Third, the focus of reactive synthesis is on synthesizing reactive systems represented as finite state machines that take some input from the environment and respond with an output (e.g., action) for a single time step.Thus, while reactive synthesis is well-suited to low-level motor controllers (e.g., robot social navigation), they are a poor fit for long-horizon tasks, where programs need to reason about actions that may take many time-steps to complete and where the program must relate properties of the initial state to the chosen sequence of long action sequences (e.g., when a shelf has limited access, deciding to put away the groceries that go at the back of the shelf first, before those in the front).
Recent work has expanded the scope of reactive synthesis in several ways.For example, recent work by Choi [2021] proposes to combine reactive synthesis with SyGuS (syntax-guided synthesis) to synthesize a broader class of programs that both interact with the environment (i.e., are reactive) and that can also perform data processing.This technique, however, takes as input temporal stream logic (TSL) specifications modulo some background first-order theory.Furthermore, the algorithmic focus of that work is very different in that they show how to combine classical reactive synthesis with SyGuS, whereas our focus is on inductive synthesis for learning generalizable long horizon policies from a small number of demonstrations.
Another recent related work is [Das et al. 2023], which introduces a functional reactive synthesis algorithm for learning programs that match a sequence of observed grid frames and corresponding user actions.This work is similar to ours in that it also learns programs from observed traces rather than temporal specifications.They perform synthesis by combining standard functional synthesis techniques with an automata synthesis approach to discover time-varying latent state in the program.However, the focus of that work is to discover causal mechanisms in Atari-style, time-varying grid worlds.As such, their Autumn DSL is used for specifying how the next state should be computed upon the occurrence of relevant events, whereas our DSL is used to express a sequence of robot actions over a time horizon.As a result, the underlying synthesis techniques are also different: For example, they use automata learning to discover latent state, whereas we employ automata learning for the entirely different purpose of learning complex control structures.
Finally, reactive functional programs have also been integrated with probabilistic programming features to design simulators for human-robot interaction, enabling the sampling of test scenarios [Chung and Cakmak 2022].These simulators generate complex event streams of human actions based on distributions learned from demonstrations.The synthesis strategy employed in this work involves providing a sketch of the target program as input and using probabilistic inference techniques, such as MCMC, to complete the unknown parameters.
Enhancing Synthesis using ML Models.Machine learning has proven to be highly effective for improving time and accuracy of synthesis [Cambronero et al. 2023;Kalyan et al. 2018;Pailoor et al. 2021;Rahmani et al. 2021;Verbruggen et al. 2021;Ye et al. 2020].For example, neural generators trained on partial programs (i.e., sketches) have been shown to accurately predict the full body of a method from just a few API calls or data types [Chen et al. 2020;Murali et al. 2017;Nye et al. 2019  In addition, LLMs have been utilized to guide program search [Jain et al. 2022].For example, the GPT-3 language model has been applied to mine program components and their distributions for multi-modal program synthesis tasks [Rahmani et al. 2021].Our work is similar in approach and leverages an LLM to improve program synthesis.However, to the best of knowledge, prolex is the first approach to leverage the LLM's prior knowledge of the semantic relations between real-world entities and actions to guide the search towards reasonable completions.
A related field of research, neurosymbolic programming, seeks to combine advances in end-to-end machine learning techniques with program synthesis by leveraging compositional programming abstractions as a means of reusing learned modules across various tasks [Bowers et al. 2023;Chaudhuri et al. 2021;Chen et al. 2021a;Huang et al. 2020;Inala et al. 2020;Mao et al. 2019;Sun et al. 2022;Verma et al. 2019;Witt et al. 2023;Zhan et al. 2021].Because our current approach is based on a symbolic environment representation, it does not require a neurosymbolic DSL.
Program Sketching.Program sketches have been introduced as a syntactic framework to guide the generation of candidate programs during a search process.This approach was initially presented in [Solar-Lezama 2008] and has since been widely used [Bornholt et al. 2016;Dong et al. 2022;Solar-Lezama 2008;Solar-Lezama 2009;Solar-Lezama et al. 2006;Wang et al. 2019b;Yaghmazadeh et al. 2017].While some approaches utilize program sketches that are crafted by the user, others automatically generate a sketch based on natural language [Chen et al. 2020;Yaghmazadeh et al. 2017] or reference implementation [Wang et al. 2019b].Our method also decomposes the synthesis task into two separate sketch generation and sketch completion step but utilizes regex learning to find a sketch that is likely to be a consistent generalization of the user demonstrations.
Unrealizability of Program Synthesis.Many prior techniques enhance program synthesis by establishing that a synthesis sub-problem is unrealizable [Feng et al. 2018b;Hu et al. 2019;Lee et al. 2016;Mechtaev et al. 2018;Tiwari et al. 2015;Vechev et al. 2010;Wang et al. 2018a].Existing methods typically rely on domain-specific static analysis and logical reasoning to establish unrealizability for the task of Programming by Example (PbE).For instance, various approaches have reduced this problem to a SMT instance and leveraged external solvers to find a proof of unrealizability [Farzan et al. 2022;Feng et al. 2018bFeng et al. , 2017;;Hu et al. 2019;Kim et al. 2021;Polikarpova et al. 2016].While the general problem of unrealizability is undecidable [Madhusudan et al. 2018], some recent approaches have used abstract interpretation techniques to establish unrealizability more effectively [Chen et al. 2020;Lee et al. 2016;Wang et al. 2018a].Recently, Kim et al. [2023] proposed a Hoare-style reasoning system to formally define, establish, and explain the unrealizability of a problem, aiming to unify existing methods in this domain.Our synthesis technique also utilizes program abstractions to establish the unrealizability of a search path and prune partial programs.However, to the best of our knowledge, our approach is the first to check compatibility between user demonstrations and partial programs in order to establish unrealizability in the PbD setting.A different notion of trace compatibility has been proposed for synthesis-based transpilation [Mariano et al. 2022]; however, that work differs from ours in several ways.First, they define compatibility between traces of two different programs, while ours is defined between a program and user demonstration.Second, their technique for checking compatibility between traces is very different from ours and relies on a collecting semantics [Cousot and Cousot 1977] of the programming language.

Conclusions and Future Work
We proposed a new programmatic LfD approach, based on program synthesis, for learning robot execution policies for long-horizon tasks in complex environments.Our approach first generates a program sketch capturing control flow structure by generating a string abstraction of the given demonstrations and inferring a regular expression that matches those strings.In the second Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 18. Publication date: January 2024.
Programming-by-Demonstration for Long-Horizon Robot Tasks (Extended Version) 18:27 sketch completion phase, our algorithm performs LLM-guided top down search and utilizes a novel procedure for proving unrealizability of partial programs.The latter algorithm for proving unrealizability can be easily adapted to other PBD settings: the key idea is to generate, via static analysis, a regular expression that captures all traces of a partial program and then check whether the string representation of the demonstrations belongs to this language.
We have evaluated our implementation, prolex, on 120 benchmarks and show that prolex is able to synthesize complex policies with several (nested) loops and conditionals and that it scales to large environments containing thousands of objects and dozens of distinct objects types.Overall, given a 120 second time limit, prolex is able to find a program consistent with the demonstrations for 80% of the benchmarks.Furthermore, for 81% of the completed tasks, prolex can learn the ground truth program from a single demonstration.To put these numbers in context, we also compare prolex against two baselines, including a state-of-the-art SyGuS solver and a neural LLM-based synthesizer, and show that prolex significantly outperforms both of them.
In future work, we are interested in deploying this technique on real robots in physical environments.To this end, we plan to integrate a semantic-aware perception frontend like Kimera [Rosinol et al. 2021] to extract the symbolic state of the world as a semantic scene graph, as such a representation would be directly compatible with the prolex DSL.We are interested in building a web-based graphical interface to our robots to gather user demonstrations for deployments, building on existing robot deployment management systems like RoboFleet [Sikand et al. 2021].

A Supplementary Material
This section contains the supplementary definitions referenced throughout the paper.

A.1 Auxiliary Relation →
The auxiliary relation → that defines the effect of actions on an environment is given in Figure 19.similar to the relation used in Figure 7, however, the partial variable store (Γ) replaces the variable complete store ().

A.4 Definition of UpdateAbsEnv
In Figure 21 we provide domain-specific rules of how atomic action may effect an abstract environment.We assume Ê , − − → Ê′ is defined similarly to the auxiliary relation in Figure 19, i.e., abstract environments too contain an interpretation of relations and properties.Any atomic action will update the abstract environment similar to a concrete environment, however, since variables may refer to a set of potential objects or locations, the UpdateAbsEnv calculates an over-approximation reflecting the effects of action taken on all potential objects.In the rules, we assume Ê [] represents Here we provide the proof of the Theorem 4.2 from Section 4. We begin by restating the theorem.
Let  be an arbitrary partial program and let D be the given set of demonstrations.We must prove that for any complete program  that is a completion of , the following holds: Based on the definition of the Compatible function given in algorithm 3, we reduce statement (1) above, by factoring out the universal quantifier on demonstrations, to the following: where the compatibility of  to a single demo is defined as follows: Compatible(, E, ) :=  () ∈ ProgToRegex(PartialEval(, E), Ê) where Ê =  (E), i.e., the abstraction of the environment E as defined below.As mentioned in section 4 the abstract environments admit a set of possible current locations for the robot, and drop the interpretation of relations since they are not needed for establishing unrealizability: (E) := ({ℓ }, E.locs, E.objs) Hence, we can reduce the theorem 4.2 to the following: ∀ E .∀ .∀ ∈Comp() . ( (E)) ∈ ProgToRegex(PartialEval(, E),  (E)) (4) We now construct a proof of (4) by first proving two lemmas.The first lemma states a similar proposition to (4), however, without considering the effects of partial evaluation.The second lemma clarifies the relationship between completions of a partial program  and completions of PartialEval(, E) for a particular environment E. The two lemmas straightforwardly result in a proof of (4).
Lemma A.1.For an arbitrary environment E, partial program  and a completion of it,  ∈ Comp(), the abstract trace produced by running  on E is in the language of the over-approximating regular expression of .Formally, ∀ E .∀ .∀ ∈Comp() . ( (E)) ∈ ProgToRegex(,  (E)) Proof.We prove this lemma by induction on the derivation tree of  = ProgToRegex(,  (E)), using rules presented in Figure 11.
ATOMIC Rule.In this case  is assumed to be an atomic action (i.e., actUnary, actBinary, or goto), hence,  is already a complete program with no holes and the only completion of  is itself, i.e.,  = .Furthermore, based on the rule atomic, the derived regex in this case is  =  (): , K. Rahmani, M. Missula, J. Biswas, and I. Dillig
Trace of demonstrated actions (left) and the abstracted string using function  (right) Program Synthesis for Robot Learning from Demonstrations

,
Vol. 1, No. 1, Article .Publication date: April 2023.(b) Regexes learned from the string abstracted from the demonstration.The correct regex (bottom) is highlighted to show looping and conditional structures.

Fig. 5 .
Fig. 5. DSL syntax where  denotes variables and  is a natural number.Rules marked with * domain-specific.

Fig. 6 .
Fig. 6.Operational semantics.Relation ⇓ is defined in Figure 7, and the definition of relation → can be found in Appendix A.1.

(
Fig. 10.Over-approximation of robot perception used in ProgToRegex function.

Fig. 12 .
Fig.12.Partial programs during synthesis and the generated LLM prompts to choose the next completion bottom part of Figure12a.When generating the prompt for ?? 2 , both holes ?? 2 and ?? 2 are filled with masks, and the LLM outputs dirty as the most likely completion for ?? 2 .When querying the remaining hole (?? 2 ), ?? 2 has already been filled with dirty, so the prompt only contains a single mask, and the LLM outputs on-top-of as the most likely completion.As illustrated by these examples, the LLM-guided search strategy allows the sketch completion engine to quickly home in on the right concepts (such as bed, dirty, and on-top-of in this example) and therefore allows the search procedure to focus on the most promising sketch completions.

Fig. 13 .
Fig. 13.Overview of the Benchmark be filled with any value, and it performs standard least fixed point computation under these conservative semantics.

A. 2
Partial EvaluationRules for partial evaluation of programs on an environment is given in Figure20.Γ is a partial store mapping a subset of variables to concrete values.Γ, E ⊢  →  ′ denotes partial evaluation result under Γ in environment E. Function PartialEval(, E) (referenced in algorithm 3) returns  * if and only if Nil, E ⊢  →  * .Boolean expression reduction relation ⇓ used in partial programs is Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 18. Publication date: January 2024.
, K. Rahmani, M. Missula, J. Biswas, and I. Dillig Fig.19.Auxiliary relation → for updating environments following robot actions.These rules are specific to each domain and model the dynamics of the robot's environment.A subset of rules implemented for prolex, modeling a typical household environment, are shown above.Remember that an environment is defined as E := (L, O, ℓ, I).A special location   is used for objects that are being carried by the robot.

Table 2 .
Details of the Ground Truth Programs and the Main Synthesis Results Proc.ACM Program.Lang., Vol. 8, No. POPL, Article 18. Publication date: January 2024.