Data Extraction via Semantic Regular Expression Synthesis

Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.


INTRODUCTION
Regular expressions (or regexes) are a convenient and versatile mechanism for extracting information from textual data.Because of their wide applicability, many programming languages provide builtin support for regular expressions, allowing developers to perform textual pattern matching.Further, because regular expressions have numerous applications in user-facing applications like spreadsheets, recent years have seen an explosion in the number of new techniques for learning regular expressions from examples and/or natural language [Chen et al. 2020;Lee et al. 2016].
Despite their general practicality, regexes are mainly applicable in settings where the desired data extraction task is purely syntactic in nature.For example, regexes are very well-suited to tasks like describing phone numbers and dates because such concepts can be described in terms of a specific syntactic format (e.g., +D-DDD-DDD-DDDD or DD/DD/DDDD).However, many data extraction tasks of practical relevance are not so easy to describe using a purely syntactic pattern.As a simple example, consider the task of extracting business emails from a text file.Any email address must follow a certain syntactic format, but this task also involves a semantic component in that it requires determining whether some text in the email describes a business entity.As another example, consider the problem of extracting zip codes that fall within a certain range.In addition to checking whether a string syntactically matches a zip code pattern (DDDDD or DDDDD-DDDD), it requires interpreting part of the string as a number and then performing a semantic range check, which is difficult to do using regexes.Based on this observation, this paper proposes the concept of a semantic regex as a mechanism for combining the strengths of syntactic pattern matching with semantic reasoning.Our proposed semantic regexes generalize standard regular expressions in that they provide a semantic pattern matching construct which accepts strings that (a) belong to a category  (e.g., business, location, person) and (b) satisfy a predicate  when interpreted as an instance of type .For example, this construct can be used to match strings that (a) correspond to a City (type ), and (b) further satisfy some additional criterion, such as being in the United States or in the state of California (predicate ).Under the hood, semantic pattern matching employs large language models like GPT-3 [Brown et al. 2020;Chowdhery et al. 2022] to test membership in some category  but further allows refining the query result using a logical predicate .In this sense, one can view our semantic regexes as deciding membership in a refinement type and then combining the matching strings using standard regex operators.
Beyond proposing the notion of semantic regexes, another key contribution of this paper is a new synthesis algorithm for learning semantic regexes from positive and negative examples.The learning problem in this context is more challenging than traditional regex synthesis because semantic regexes are much more expressive than standard regexes.As a result, the hypothesis space in this setting is very large, which has two important consequences: • First, the semantic regex learning problem cannot be solved using a purely search-based approach due to the sheer size of the search space.In fact, the search space is theoretically not even bounded because our semantic regex language does not restrict the types  to a pre-defined vocabulary.• Second, due to the extremely large hypothesis space, there are typically many semantic regexes consistent with a small number of examples.Hence, to find the intended semantic regex, our learning algorithm must have a strong inductive bias towards user intent.
The synthesis technique proposed in this paper surmounts these challenges using a novel combination of three key ideas: (1) Neural sketch generation: Our learning algorithm uses a large language model (GPT-3) to generate a sketch of the desired semantic regex.Our key observation is that LLMs are wellsuited to this task because they are effective at identifying semantic commonalities between the positive examples and inferring appropriate types to be used within the semantic pattern matching constructs.(2) Compositional synthesis: Our learning algorithm decomposes the synthesis task into multiple simpler sub-problems.Because the holes (i.e., unknowns) in the generated sketches are typed, the synthesis technique lends itself to a compositional solution, where we can synthesize each hole largely (though not entirely) independently.
(3) Type-directed search: The presence of type information in the sketches makes it possible to fill each hole in a type-directed way.Specifically, we utilize a type system with subtype polymorphism to infer the space of valid completions of a hole.
Figure 1 shows the workflow of our proposed learning approach, which first utilizes the provided examples to generate a semantic regex sketch using GPT-3.In the next step, our approach searches for completions of the sketch by (a) decomposing the overall problem into several subproblems and (b) using type-directed synthesis to solve each subproblem.If the sketch has a valid completion, the resulting semantic regex is returned to the user.Otherwise, our approach analyzes the root cause of failure and uses this information to query the language model for a more accurate sketch.
We have implemented the proposed technique in a tool called Smore and evaluated it on information extraction tasks involving several different datasets.Our evaluation shows that these data extraction tasks can be successfully automated using our proposed semantic regexes and that our learning algorithm is quite effective for automating the desired data extraction task.In particular, our approach achieves an average  1 score of 0.87 on the test data, while prior data extraction techniques achieve a maximum  1 score of 0.65.
To summarize, this paper makes the following contributions: • We propose semantic regular expressions to combine the flexibility of syntactic pattern matching with semantic queries involving types and logical predicates.• We describe a new learning technique for synthesizing semantic regexes from positive and negative examples.Our approach combines the power of large language models with typedirected synthesis for effective automation of data extraction tasks.• We evaluate our tool, Smore, on representative data extraction tasks and show that semantic regexes are useful for these tasks and that our learning approach outperforms other data extraction techniques in terms of average  1 score.

OVERVIEW
In this section, we illustrate our technique using the motivating example shown in Figure 2, which contains information about artworks exhibited at a museum.Given this dataset, suppose that a user wants to extract all European artists who were born before the 20th century and whose name contains Thomas.This data extraction task is challenging because it requires both syntactic and semantic reasoning: • Syntax: In order to retrieve the desired information from this dataset, we first need to perform pattern matching over the syntax of the "Artist Bio" column.In particular, because this column contains information of the form "Name, Country, Birth Year -Death Year", we first need to syntactically parse the input string into its four constituent fields and check whether the first field (corresponding to the artist name) contains "Thomas".• Semantics: After performing syntactic pattern matching, we then need to perform semantic reasoning about the contents of each row to understand whether (a) the first field describes a name, (b) the artist's nationality is European and (c) they were born before the 20th century.

Semantic Regexes
Our proposed semantic regex concept is a natural fit for the data extraction task illustrated in this example.Semantic regexes combine the convenience of regexes for syntactic pattern matching with the power of semantic reasoning about data types.In addition to supporting the standard regex operators (concatenation, disjunction, Kleene star), semantic regexes provide the following semantic pattern matching construct, written using a refinement-type-like notation: This construct matches any string that is semantically of type  and that further satisfies the (optional) logical qualifier .For instance, going back to our example, recall that we need to pattern match strings that correspond to a European country.This can be expressed using the semantic regex { : Country |  ∈ Europe}, which, for example, matches the strings "France", "Britain" and "North Netherlands", but fails on the strings "United States", "Korea" etc.Similarly, we can express the desired constraint on the artists' birth year using the following semantic regex: which matches strings that (a) correspond to a year and (b) whose value is less than or equal to 1899.Putting all of this together, our desired data extraction task can be accomplished using the following overall semantic regex: In other words, this semantic regex matches all strings of the form "X, Y, Z-W" where  is a name containing Thomas,  is a European country,  is a year before 1900, and  is any year.

Synthesizing Semantic Regexes
While semantic regexes provide a useful mechanism for information extraction, they can nonetheless be non-trivial for end-users to construct.Motivated by this problem, another key contribution of this paper is a new technique for synthesizing semantic regexes from a small number of positive and negative examples.We now illustrate how our technique can be used to automate the data extraction task for our running example.Suppose that the user describes the target data extraction task using the following positive and negative examples:

Positive Examples Negative Examples
John Thomas Young Gilroy, Britain, 1898-1985Alma Thomas, United States, 1891-1978Thomas Hudson, Britain, 1701-1779Sandro Botticelli, Italy, 1470-1561Thomas Couture, France, 1815-1879Thomas Nölle, Germany, 1948-2020 Here, the positive examples correspond to the artist biographies that should be extracted, while the negative examples are those that should be ignored.In particular, the first negative example does not conform to the "European country" restriction; the second negative example does not contain "Thomas" in the artist's name; and the third one fails the criteria "born before the 20th century".We will now describe how our approach synthesizes the target semantic regex given only these examples.
At the heart of our learning approach lies the notion of a typed sketch, which captures the general syntactic structure of the target semantic regex.In addition, the holes (i.e., unknowns) in the sketch are annotated with types capturing commonalities in the positive examples.Returning to our running example, our synthesis approach generates an initial candidate sketch by querying a large language model (GPT-3) with user-provided positive examples.Suppose that GPT-3 returns the following sketch: Here, the symbol {□ : } denotes an unknown expression, and the notation {□ : t} indicates that any string matched by {□ : } should be a subtype of t.
Starting with the GPT-3-synthesized sketch, our method decomposes the synthesis problem into multiple sub-problems, one for each hole in the sketch, and performs a type-directed search to complete each hole.For this example, our synthesis method infers the following positive examples for each hole: Note that it is not possible to propagate negative examples for individual holes, as it suffices for the synthesized regex for one hole to reject its corresponding string, but we do not a priori know which one.In particular, for this example, it would not be accurate to deduce that "Alma Thomas", "Sandro Botticelli", and "Thomas Nölle" as negative examples for the first hole.
Given this decomposition, our approach tries to synthesize a regex   for each hole {□ :   }  such that (a) the type of   is a subtype of   and (b)   matches all of its corresponding positive examples.For this example, our synthesis algorithm can immediately deduce that the sketch is incorrect since no subtype of Year can match the corresponding positive examples for the third hole.
To repair the sketch, our learning algorithm localizes parts of the sketch for which synthesis failed (in this case, Year) and synthesizes a different sketch for the failing part.In the next iteration, suppose that we consider the following correct sketch: Our synthesis algorithm tries to independently find the completion of each hole with the appropriate type and satisfy the corresponding decomposed positive examples.As before, the positive examples are used to prune the search space: for example, since the second hole must match the strings "Britain" and "France", the synthesizer can rule out completions such as { : Country |  ∈ Asia} and { : Country |  ∈ Asia ∧ . ..}.Similarly, type information in the sketch is critical, enabling the synthesizer to avoid enumerating useless sub-programs.For instance, when synthesizing the last hole in the sketch, the synthesizer would not enumerate programs such as { : Month | . ..}∪{ : Date | . ..}, since this regex can match strings that are not of type Year.It would, however, consider regexes of the form { : Year |  ≤ . ..}, as the strings that are matched by this regex would be a subtype of year.After independently synthesizing each hole, the algorithm checks whether the resulting regex  rejects all negative examples and, if so, returns  as a solution.Otherwise, it generates a different regex by looking for a different completion for at least one of the holes.

SEMANTIC REGULAR EXPRESSIONS
In this section, we describe the syntax and semantics of our proposed semantic regular expression language.At a high level, semantic regexes combine standard regular expression operators with pre-trained neural networks that identify semantic types and provide knowledge about the world.DSL Syntax.The syntax of our semantic string matching language is presented in Figure 3.A semantic regex  takes as input a string  and returns a boolean indicating whether there is a match.Semantic regexes include all the standard regular expression constructs, including constant strings , character classes like letters and numbers (denoted ), concatenation (•), complement (¬), union (∪), intersection (∩), and Kleene star ( * ).Additionally, the notation  { 1 } denotes repetition of   1 times and  { 1 ,  2 } denotes  repeated between  1 to  2 times.As standard,  ?indicates an optional occurrence of  , and  + denotes one or more occurrences of  .
Fig. 3. Semantic string matching language. is a constant string,  is a character class (e.g.letters).  is a built-in base type, and   is an arbitrary base type in our type system.Also,  ∈ Z,  ∈ R, and  ∈ Attributes, where Attribute is type-dependent.
In addition to these standard regex constructs, Figure 3 includes two semantic pattern matching constructs, denoted as { :  (  )} and { :  (  ) |  }, where  is an (optional) built-in function,   is a built-in type (Integer, Month, etc) and   is an arbitrary (user-defined) type.Note that the DSL does not place any restrictions on   , so the user can provide any arbitrary string to define their own type.However, we only allow a logical qualifier  to be used for built-in types.
In the most basic form, the construct { :  } matches strings that are semantically of type , where  can either be a built-in or user-defined type.For example, { : Place} matches any string that corresponds to a geographical location.The optional function  used in this construct allows refining the query result by performing additional semantic-preserving string processing.For example, { : toUpper(Place)} matches any string that corresponds to a location name in upper case letters (e.g., "NEW YORK").More generally, { :  ()} matches a string  if  is equal to  ( ′ ) where  ′ is a string of type .As another example, { : abbreviate[.](Place)} matches the strings "N.Y. ", "S.F." etc. because the function abbreviate[] abbreviates a string through initialism, using the character  as a separator.
When performing semantic pattern matching using built-in types   , one can additionally use a logical qualifier .In particular, { :   |  } matches those strings that are of type   and additionally satisfy predicate .To check whether a string  satisfies ,  is first parsed as an instance  of type   and then checked for conformance against .Note that these semantics justify why logical qualifiers are only allowed with built-in types: because we need to parse the string as an instance of   , there must be some built-in mechanism for deserializing the string, which only makes sense for pre-defined types.As an example, the semantic regex { : Float |  < 0.1} matches strings that can be interpreted as a floating point number whose value is less than 0.1 (e.g., 0.0051).As another example, { : toUpper(City) |  ∈ Europe} matches strings, such as "ROME" that (a) correspond to European cities and (b) are in upper case letters.DSL Semantics.Figure 4 presents the formal semantics of our DSL for semantic string matching, where  denotes the set of all strings that  matches. 1 Observe that the semantics of the DSL is parametrized by a helper function called SemanticType, which is implemented by a pre-trained neural network and which is used to check whether the type of a string  is .Hence, the construct { :  (  )} matches all strings  such that (a)  =  ( ′ ) for some string  ′ , and (b) where SemanticType( ′ ) =   .Similarly, { :  (  ) |  } matches all strings  such that (a)  =  ( ′ ) for 1 Semantics of functions are provided in the appendix.
Proc.ACM Program.Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018.some string  ′ , (b)  ′ is an instance of built-in type   , and (c) when  ′ is parsed into an object  of type   , o satisfies predicate .

Data Extraction via Semantic Regular Expression Synthesis
Example 3.1.The semantic regex { : Date | .month = 5} matches all strings that represent dates in May.In particular, any string matching a Date is first parsed into a datetime object and its month field is checked for being equal to 5. Examples of strings matched by this regex include "May 2023" and "2023-05-01".

OVERVIEW OF THE TYPE SYSTEM
While our semantic regex DSL is not explicitly typed, our approach utilizes a type system to facilitate effective synthesis.In this section, we give an overview of the type system.

Type Syntax
The syntax of our type system is shown in Figure 5, where Any corresponds to the top element in the type system and CharSeq indicates any string without semantic meaning, such as "1a2b3c", ",.3d,." etc.The type Semantic(  ) indicates strings that can interpreted as instance of   (e.g., Date).
In addition, the type Optional() includes both  (empty string) as well as any string of type .Semantic types   include both built-in types   (e.g., Integer, Float, Date) as well as user-defined types   .Hence, the type syntax is not fixed a priori and is parametrized over any user-defined types that occur in the program.

Subtyping
Our type system supports subtype polymorphism because there is a natural subtyping relation between many entities of interest.We formalize the subtyping relation in Figure 6 using the standard judgment ⊢  1 <:  2 , indicating that  1 is a subtype of  2 .In Figure 6, the first three rules are straightforward and establish Any as the top element of the type system.Semantic().The last two rules for Optional are also standard: Optional-Width states that any type  is a subtype of Optional() and the last rule lifts the subtyping relation to optional types.Finally, the last rule handles subtyping between user-defined types.If the set of objects represented by  1 is a subset of those represented by  2 , we have  1 <:  2 .In practice, we perform this check by querying a semantic ontology (specifically, DBPedia [Bizer et al. 2009] in our implementation).

Typing Rules
We present the typing rules for assigning types to DSL terms in Figure 7.These rules derive judgments of the form ⊢  :  indicating that term  has type .Note that Figure 7 only shows a representative subset of the typing judgments; the full set is presented in the Appendix under supplementary materials.

Data Extraction via Semantic Regular Expression Synthesis
1:9 our implementation) and assign CharSeq if the oracle does not return a semantic type.2Character classes only have semantic meaning for numbers, so we assign the Semantic(Number) type if the character is a number, and CharSeq otherwise.
Semantic matching.The MatchSem rules present the typing rules for the semantic matching construct.The type of the expression is identical to the type specified as part of the program syntax.
Union and intersection.The typing rules for union and intersection presented in the Union and And rules, respectively.These rules utilize the ∨ and ∧ operators, which are defined in Figure 8.At a high level, the meet and join of two types are determined as the least upper bound (⊔) and the greatest lower bound (⊓), respectively, in the corresponding type lattice.However, there is a special case for the CharSeq type: Intuitively, taking the intersection of a semantic type  and CharSeq further refines the objects of type  by placing an additional syntactic restriction; hence, Semantic() ∧ CharSeq is defined as Semantic().In contrast, the join of Semantic() and CharSeq is the top element Any, as expected.
Not and concatenation.The Not and Concat are two cases where specific types cannot be inferred.Even though the type of their arguments is known, the resulting type cannot be determined, resulting in an output type of Any.

LEARNING SEMANTIC REGEXES FROM EXAMPLES
In this section, we describe our synthesis algorithm for solving the semantic string matching problem from examples.Our method involves two main steps: generating a typed sketch from the positive examples and completing the sketch using an enumerative search-based synthesizer.If sketch completion fails, our method refines the sketch and performs synthesis using the new sketch.
In the rest of this section, we first provide some preliminary information, then present our top-level learning algorithm, and then describe each of its key components.

Sketch Language
Our learning algorithm crucially relies on the notion of a typed sketch whose syntax is shown in Figure 9.At a high level, the sketch language extends our semantic regex DSL by allowing a "typed hole" (denoted {□ :  }) which represents an arbitrary expression of type .Given a sketch , we use the notation  to denote the set of all semantic regexes that can be obtained by completing holes in  by valid expressions of the corresponding type.Figure 9 also defines sketch semantics in terms of the space of all programs they represent.9. Sketch syntax and its semantics.Here f refers to any construct in the DSL defined in Figure 3.   ← ⊥; 3: while HasMoreSketch(E + ) do 4: while HasDecomp(, E + ) do 6: Ψ ← GetNextDecomp(, E + ); ← ; Example 5.1.Consider the sketch {□ : Organization} • ".", which represents the space of semantic regexes that match strings consisting of an organization name followed by the string constant ".com".Possible completions of this sketch include, but are not limited, to the following semantic regexes: (1) { : Company}•".", (2) { : Institution}•".",and (3) ({ : Institution}∪ { : Company}) • ".".

Top-level algorithm
Our top-level algorithm is outlined in Figure 10.Given a set of positive examples E + and a set of negative examples E − , Synthesize returns a semantic regex that accepts all positive examples and rejects all negative examples.At a high level, the algorithm repeatedly generates a new sketch using a large language model, then attempts to find a valid instantiation of that sketch, and continues this process until it finds a regex that is consistent with all user-provided examples.Intuitively, each candidate sketch serves as a possible generalization of the positive examples, and the goal of the synthesizer is to determine whether that sketch is a suitable generalization.
In more detail, the Synthesize procedure first calls GetNextSketch, which queries GPT-3 to produce a sketch  that is likely to satisfy the positive examples.Then, for a given sketch , GetNextDecomp infers a decomposition Ψ, which is a mapping from each hole in  to a set of positive examples for that hole.Then, for a given decomposition Ψ, the algorithm calls Synthe-sizeFromDecomp to perform compositional synthesis based on the inferred specification Ψ.
If the call to SynthesizeFromDecomp returns a non-empty mapping , which maps each hole in  to a concrete regex  , we find a solution that is consistent with the specification and returns the synthesized regex by replacing the holes in  with the corresponding solution in .Otherwise, if the call to SynthesizeFromDecomp yields ⊥, there are two possibilities: Either the decomposition Ψ is incorrect (recall from Section 2 that there is ambiguity in how to assign positive examples to holes), or the sketch  itself is incorrect.In the former case, the algorithm considers a different decomposition, which maps at least one of the holes in the sketch to a different set of examples.If the algorithm exhausts all possible decompositions, this means that the sketch must be incorrect and the algorithm repairs the current sketch by performing fault localization and querying GPT-3 to produce a different generalization of the positive examples.This process continues until the algorithm finds a globally consistent regex with all (positive and negative) examples or runs . Procedure for GetNextDecomp(, E + ).OverApprox() returns a concrete regex that overapproximates .Merge returns ⊥ if one of its argument is ⊥, otherwise it disjointly unions all its arguments.
out of possible sketches.In the following discussion, we explain each of the three components (decomposition, type-directed synthesis, and sketch repair) in more detail.

Decomposing the Specification
To perform compositional synthesis, our learning algorithm decomposes the global specification into a set of specifications, one for each hole in the sketch.In this section, we describe the Get-NextDecomp procedure for specification decomposition using the inference rules in Figure 11, which derive judgments of the following shape: The meaning of this judgment is that, given positive examples E + , Ψ is a possible decomposition that maps each hole in the sketch to its corresponding positive examples.As mentioned earlier, the decomposition is, in general, not unique, so there can be multiple decompositions Ψ 1 , . . ., Ψ  for a given sketch .
We now explain the decomposition rules from Figure 11 in more detail.The first rule, labeled Sketch-Match, considers a program sketch with top-level operator  (e.g., concatenation or intersection) and sub-sketches  1 , . . .,   .To infer a specification for each hole in , we first generate a regex  ★ that over-approximates  (via the call to OverApprox).Intuitively, OverApprox generates a regex  ★ such that for any  ∈  ,  ★ accepts every string that is accepted by  .Because our over-approximation approach is exactly the same as used in prior work [Chen et al. 2020;Lee et al. 2016], we do not formally present it, but the basic idea is to replace each hole that appears under an even (resp.odd) number of negation symbols by the regex .* (resp.∅).This method guarantees that the resulting regex  ★ will accept every string that is accepted by any instantiation of .Furthermore, note that  ★ is a standard regex without any semantic pattern matching constructs, as all holes have been replaced by either the universal or the empty set.
Next, once we generate the over-approximation  ★ , we infer positive examples for each subsketch  1 , . . .,   used in .To do so, for each positive example , we use a standard regex matching tool to find a parse of  into the format  ( 1 , . . .,   ) with corresponding sub-strings   for each sub-sketch   .After propagating each example   to nested sketch   and recursively applying the inference rules, we obtain the decomposed specifications Ψ 1 , . . ., Ψ  for each of the sub-sketches in .These mappings are finally combined via the call to the Merge function, defined as follows: where the notation ⊎ indicates disjoint union.The next rule, labeled Sketch-NoMatch, corresponds to an infeasible sketch or decomposition.Because every string accepted by  ∈  must also be accepted by the over-approximation  ★ , the algorithm yields ⊥ to indicate a failure when  ★ doesn't match at least one of the positive examples.
The remaining rules correspond to the base cases of the recursive decomposition algorithm.Specifically, the rules prefixed with Concrete consider the case where the sketch is a concrete regex  without a hole.Specifically, we check the feasibility of  by testing whether it matches all of the positive examples.If so, the sketch is feasible, and the algorithm returns the empty mapping ∅.Otherwise (the Concrete-Infeasible case), the algorithm returns ⊥ to indicate failure.
The final two rules correspond to base cases for a hole and utilize the fact that sketches are typed.In particular, given a hole of type , if there exists a positive example  ∈ E + whose type is not , this indicates a conflict and the algorithm returns ⊥ in the Hole-Infeasible rule.Otherwise, in the Hole-Feasible rule, the constructed specification maps this hole to the input positive examples E + .
Example 5.2.Consider the positive examples from Section 2 and the following sketch: The over-approximation for this sketch is the following regex: We conclude this subsection by stating the theorem about the soundness of decomposition: Theorem 1.Consider the synthesis problem with positive examples E + .Let  be a candidate sketch and let  be a completion of  mapping each hole ℎ  in  to a semantic regex   .If  satisfies all positive examples E + , then there exists some Ψ ∈ GetNextDecomp(, E + ) such that every   satisfies Ψ[ℎ  ].

Compositional Type-Directed Synthesis
Next, we explain our compositional learning technique for synthesizing a semantic regex for a given sketch and decomposed specification.This algorithm, called SynthesizeFromDecomp, is shown in Figure 12.Given a sketch , specification Ψ, and negative examples E − , the recursive SynthesizeFromDecomp procedure lazily generates possible sketch completions until it finds a regex that is globally consistent with the top-level specification.
To perform synthesis for a given specification, the algorithm starts by choosing one of the holes ℎ in the sketch (line 2) and synthesizes a completion  for that hole only by calling GetNextCompletion at line 4.Then, the loop in lines 6-10 tries to find a completion for the remaining holes.In particular, in each iteration of the nested loop, the algorithm recursively calls SynthesizeFromDecomp to fill all remaining holes, assuming that ℎ is replaced by  .If synthesis fails (i.e.,  ≡ ⊥ at line 8), the algorithm moves on to a different completion of ℎ. return ⊥; The final missing piece for our sketch instantiation algorithm is the GetNextCompletion procedure shown in Figure 13 which performs synthesis for a single hole.At a high level, this algorithm performs top-down enumerative search and uses a combination of types [Frankle et al. 2016;Osera and Zdancewic 2015;Polikarpova et al. 2016] and observational equivalence [Morris 1968] to prune the search space.As standard in top-down search, this algorithm utilizes the notion of partial programs [Feng et al. 2018[Feng et al. , 2017]], which can be thought of as an abstract-syntax tree where some of the nodes are labeled with non-terminals to be expanded later.
In more detail, the hole synthesis algorithm utilizes a worklist W, which is initialized to a partial program  0 with a single node (lines 2-3).Each node in the partial program is annotated with a grammar symbol (in this case, the start symbol  G ) and its corresponding type (in this case,  ℎ ).Then, in each iteration of the loop in lines 4-16, the algorithm dequeues one of the partial programs  in the worklist and processes it.If the partial program is complete (meaning that all nodes are labeled with terminal symbols), the algorithm performs the following checks: (1) Type consistency: If the type of  is not  ℎ ,  clearly does not have the intended type and is rejected (line 7).
(2) Consistency with examples: If  does not satisfy all positive examples E + , it does not satisfy the specification and is also rejected at line 7.
(3) Observational equivalence: If  rejects the exact same set of strings as a program the algorithm has previously encountered, it is redundant to consider , as it is observationally equivalent to another solution  ′ that has been rejected.Hence, the algorithm only yields  as a solution if it is observationally different from a previously encountered solution (lines 8-9).
On the other hand, if the current partial program  is incomplete (meaning it has at least one "open" node labeled with a non-terminal), the algorithm chooses one of the open nodes and expands it using the available productions in the grammar (line 11).In particular, given an open node  labeled with a non-terminal  , the Expand procedure considers each production of the form  →  and adds new nodes where each new node with a grammar symbol and its corresponding (inferred) type.However, because a resulting expansion  ′ may not necessarily be feasible, the algorithm performs two additional checks before adding  ′ to the worklist at line 16: • Type-directed feasibility check: For each complete subprogram   of  ′ , the algorithm checks if the actual type of   is a subtype of its annotated goal type (line 12).If this type feasibility check fails for any node , then program  ′ is pruned from the search space, and none of its expansions are considered.• Feasibility check using over-approximation: Additionally, the algorithm constructs an overapproximating regular expression  ★ that accepts every string that is accepted by any  ∈  ′ using the same OverApprox procedure from Section 5.3.If this over-approximation  ★ fails to match one of the positive examples,  ′ is infeasible and therefore pruned away at lines 14-15.
Otherwise,  ′ is added to the worklist, and the search process continues until a solution is found.
Theorem 2. Let  be the set of solutions returned by GetNextCompletion( ℎ , E + , E ★ ).We have: • Soundness: Every  ∈  is a solution to the hole synthesis problem, meaning (1)  has type  ℎ and (2) satisfies examples E + • Completeness: If  ∉ , then  is either not a solution or is observationally equivalent to some  ′ ∈  for strings E ★ .

Sketch Generation
In the final part of this section, we describe our technique for generating typed sketches from examples.In particular, we employ few-shot prompting and build our sketch generator on top of GPT-3 [Brown et al. 2020].
By showing LLMs a few examples of a task to perform and then giving them a test example, LLMs can perform that task on the test example via in-context learning, without retraining or fine-tuning the model's parameters.The user only needs to provide a few examples and invoke the model's next-word prediction capabilities (repeatedly taking the most likely next token under the model).To give a concrete example, consider the task of transforming numbers in strings to texts, a task that GPT-3 has not specifically been trained on.Figure 14 shows a typical usage scenario of GPT-3 when performing such a task: here, line 1 provides the task description, lines 2-4 provides a few examples, line 5 is the query, and the output of the model is highlighted in red.output: A new sketch that has not been generated so far. 2: while True do 4: if   ≡ ⊥ then 5: ← GetSketch(E + ); return ⊥; Positive examples: -(David J. Alexander), Marc Henri Sempere and Jocelyn Bulow -(Connie Wong), Sai Wong -(Amin Abughosh) and Joseph Abugosh and Abeer Elafifi Sketch: -\({??: Person}\) ((&|and|,) {??: Person})+ …… (7 more examples) Summarize the structure of the following positive examples in the form of a regular expression sketch.
Use {??: <semantic type>} to represent the unknown part of the sketch.
Positive examples: -… Sketch: - 5.5.2Querying LLM for Sketches.To obtain typed sketches, our approach prompts GPT-3 with suitable queries. 3As shown in Figure 15, the GetNextSketch procedure takes as input positive examples E + and an optional infeasible sketch   , which is used in later iterations of the algorithm for sketch repair.Initially, the algorithm starts by querying GPT-3 for a sketch using the GetSketch procedure, as illustrated in Figure 16.The prompt to GPT-3 contains a task description, a manuallycurated set of representative examples (in the form of a query and its desired output), and, finally, the prompt itself (lines 12-17 in Figure 16).The GetSketch procedure then attempts to parse the model's output into a typed sketch; however, there is no guarantee that the GPT-3 output will Sketch-Single-Fail belong to our sketch grammar.Hence, if parsing fails, the GetSketch procedure keeps prompting GPT-3 for a new sketch until the model's output is parseable. 4 In future invocations of GetNextSketch, this procedure may be invoked with an infeasible sketch   that needs to be repaired.Lines 8-11 of Figure 15 deal with this sketch repair aspect of the algorithm.Specifically, given the infeasible sketch   and positive examples E + , LocateError produces a repair specification, which consists of a so-called meta-sketch S and a specification Ψ.A meta-sketch is like a sketch except that it contains untyped "meta-holes" that need to be instantiated with a typed sketch.The specification Ψ maps each meta-hole in S to a set of positive examples.Such a meta-sketch is instantiated into a regular sketch by querying GPT-3 via the GetSketch procedure for each of the meta-holes ℎ  in S and its corresponding examples Finally, we turn our attention to the LocateError procedure, which is presented as inference rules in Figure 17.These rules derive judgments of the following shape: meaning that (S, Ψ) is a repair specification for infeasible sketch  and examples E + .The fault localization rules in Figure 17 largely resemble GetNextDecomp for performing decomposition in that they use over-approximations.We explain these rules in more detail below.
Sketch-Single-Fail.This rule applies to a sketch  of the form  ( 1 , . . .,   ) where (1) there is at least one positive example that is not matched by the over-approximation of  (premise on the first line) and (2) where only one of the sub-sketches   is faulty.To determine whether condition (2) holds, this rule replaces the entire sub-sketch   with a single hole and then checks whether the over-approximation of the resulting sketch can accept all positive examples.If so, it Suppose the synthesizer concluded this sketch to be infeasible since the string "1898-1985" cannot be identified as a year and sends this as a failed sketch to the sketch generator.To repair this sketch, we follow the Sketch-Nested-Fail rule to recursively traverse through each part of the sketch until we locate the faulty hole, {□ : Year}.We then gather the positive examples that should be matched by this hole, which are "1898-1985", "1701-1779" and "1815-1879", and replace the faulty typed hole with a new hole with no type (rule Hole-Fail).With the generated repair specification, we query GPT-3 to generate a new sketch for the faulty hole, and it returns a new sketch {□ : Year} • " − " • {□ : Year}.

IMPLEMENTATION
We have implemented our synthesis algorithm in a new tool called Smore written in Python.In this section, we provide implementation details about different components of Smore.
Implementation of the semantic matching construct.Our tool heavily relies on the use of GPT-3 to identify the semantic meanings of strings.5Our few-shot prompt (following the discussion in Section 5.5) to accomplish this is shown in Figure 18.The input begins with a task description that asks the model to identify all possible substrings of a particular semantic type, and we instruct the model to return "none" if it does not find any.Following the task descriptions, we provide 8 examples,6 each of which shows the structure of a query: the first line provides the string of interest, and the second line specifies the semantic type of interest.Furthermore, we provide sample outputs for each example in the expected output format.Implementation of checking observational equivalence.In the GetNextCompletion procedure (Figure 13), we use the set E ★ to prune out programs that are observationally equivalent to previously synthesized programs.In Figure 12, E ★ corresponds to all substrings of the negative examples E − , but this set might contain too many strings in practice, leading to considerable overhead in the observational equivalence check.To address this issue, we only obtain the substring of the negative examples that are relevant to the specific hole under consideration.Specifically, we identify the relevant substrings of the negative examples using the overapproximation of the sketch.If a negative example can already be rejected by the overapproximation of the sketch, it is safe to conclude that any instantiation of the sketch can reject this negative example and therefore that this example is irrelavant.For those negative examples that can be matched by the overapproximation, we identify substrings that might be matched by each hole of the sketch and use those to check observational equivalence.This strategy provides the full benefits of checking observational equivalence but significantly reduces overhead in some cases.We illustrate this discussion through the following example: Example 6.1.Consider a synthesis task with the following positive and negative examples: Suppose that the generated sketch is {□ : Integer}[+]{□ : Integer}.Using the overapproximation (. * ) [+] (. * ), we can already reject the negative example "7-12", so that negative example is not relevant for selecting different instantiations of the sketch.To find the rest of the relevant strings, notice that the overapproximation decomposes the first negative example by sending "1" to the first hole and "18" to the second hole (as the negative example for each of the holes).Following the same procedure, we obtain "1" and "2" as the relevant substring for the first hole.Now, considering the two synthesized programs { : Integer |  > 4} and { : Integer |  > 5} for the first hole, we can safely conclude that these two programs are observationally equivalent with respect to the negative examples since both programs reject the same set of negative examples (specifically, example "1+18" and "2+6").
Ranking heuristic.Because there are often multiple semantic regexes that are consistent with the provided examples, it is important to use a ranking heuristic to choose between possible solutions.To this end, our method prioritizes sketches that maximize the number of type annotations, and it prefers decompositions that minimize the number of holes that are assigned empty strings as

Exhibition
Dimension of item between 10 and 50 inches Item that is associated with at least three categories positive examples.Finally, when choosing between multiple regexes for a given hole, our algorithm prefers those with smaller ASTs, first ranked by height and then by the number of nodes.
Hyperparameters.The Smore system has a hyperparameter that controls the maximum depth of the synthesized programs for each hole, which is set to 4 by default.For GPT-3 hyperparameters, we set the temperature to 0 (corresponding to greedy inference) and maximum length to 256.7

EVALUATION
In this section, we describe the results of our experimental evaluation, which is designed to answer the following research questions: • RQ1.How does our proposed data extraction approach compare against existing approaches?• RQ2.How does our synthesis algorithm compare to relevant baselines?• RQ3.How important are the different components of our synthesis algorithm for successfully solving these benchmarks?• RQ4.Do semantic regexes help humans more effectively solve data extraction tasks compared to standard regexes?
Benchmarks.To answer these questions, we evaluate Smore on 50 data extraction tasks involving 10 different datasets, which cover a wide range of domains like sales, science, and art.These datasets contain many different string formats and involve a large variety of entities.Out of 50 tasks, 34 of the tasks require at least one built-in semantic type and 33 of the tasks require at least one custom semantic type.We consider an average of 5 data extraction tasks for each dataset and manually label a subset of the strings in each dataset as positive or negative for each task.Specifically, we use 6 of the manually labeled examples for training and the rest for testing.Table 1 describes some example tasks for each domain.
Experimental Setup.All of our experiments are conducted on a machine with an Apple M2 Max CPU and 32GB of physical memory, running the macOS 13.2.1 operating system.We run GPT-3 through the OpenAI API.For each task, we set the timeout to 60 seconds (excluding the time to query OpenAI).

Comparison with Other Automated Data Extraction Techniques
There are several techniques that can be used to automate data extraction tasks.To answer our first research question, we compare Smore against the following alternative data extraction approaches: FlashGPT, that can query GPT-3 in addition to performing syntactic transformations and pattern matching.For our third baseline, we also compare against FlashGPT by giving it positive and negative examples and then using it to synthesize a program in their DSL.
Main results.Our main results are summarized in Table 2.We evaluate each tool in terms of precision, recall, and F1 score on the test set as well as synthesis time and number of benchmarks solved.The P, R, and F 1 columns represent the precision, recall, and  1 score on the test set.Smore achieves the highest precision, recall, and  1 score among all the alternative data extraction approaches.In particular, Smore outperforms the second best approach, namely ChatGPT-Exec, by 22% in terms of  1 score.While ChatGPT-Exec and FlashGPT have fairly high recall, they have low precision.ChatGPT-Regex-Synth has similar precision to ChatGPT-Exec but has very low recall on the test set.Finally, FlashGPT and Smore are close in terms of recall, but Smore significantly outperforms FlashGPT in terms of precision (for benchmarks that both tools can synthesize within the time limit).
Next, the column labeled "# Finished" in Table 2 shows the number of tasks that each tool is able to solve.For Smore and FlashGPT, solving a benchmark means they were able to find a program consistent with the positive and negative examples within the 60-second time limit.Solving a benchmark for ChatGPT-Regex-Synth means finding a regex consistent with the examples within 10 iterations.9Since ChatGPT-Exec does not perform synthesis, this column is not applicable to it.Among all the synthesis-based approaches, Smore terminates for 48 out of 50 tasks, which is around twice as many as ChatGPT-Regex-Synth and around 3 times as many as FlashGPT.Finally, the column labeled "Synth time" shows the synthesis time in seconds for FlashGPT and Smore.Since we exclude the time to query OpenAI from synthesis time (this only takes at most a few seconds), this column is not applicable to ChatGPT-Regex-Synth.As we can see from this column, the synthesis time of Smore is around 5 seconds, so it takes slightly longer than FlashGPT (which takes around 3 seconds) for the 14 tasks that both of the tools can solve.However, Smore is able to synthesize a program for three times as many tasks as FlashGPT.
Failure Analysis for the baselines.To provide some insight into the shortcomings of existing approaches, we briefly discuss the failure cases of the baselines.As expected, ChatGPT-Regex-Synth struggles with tasks that are hard to represent as regular expressions, such as matching all businesses that are in California.Although FlashGPT combines neural and symbolic constructs, its neural component processes positive and negative examples rather than semantic types.In other words, the neural constructs directly query GPT with positive and negative examples rather than querying whether a string matches a certain type.As a result, it frequently generates trivial programs that directly invoke GPT with the training examples as input.Hence, it ultimately ends up sharing the same limitations as ChatGPT-Exec.
Failure analysis for the Smore.We examined instances where Smore is unable to complete the synthesis task within the allotted time and found that it encounters difficulties in tasks that demand a higher level of granularity from semantic pattern matching.For example, consider a task that involves finding restaurant names containing a person's name.For the positive example "Alice Chinese Bistro", the entity matcher may fail to recognize "Alice" as a person's name, causing Smore to fail to synthesize a program consistent with all examples.

Comparison with Other Semantic Regex Synthesis Techniques
To answer our second research question, we compare the neural-guided synthesis algorithm of Smore against the following two purely-neural or purely-symbolic baselines: • ChatGPT-Synth [OpenAI 2022]: To evaluate whether a purely neural synthesizer can solve these benchmarks, we use ChatGPT to create a synthesizer for semantic regexes.Specifically, our ChatGPT-Synth baseline queries ChatGPT to synthesize a semantic regex that matches all positive examples and rejects all negative examples.If the generated semantic regex is inconsistent with the examples, we query it again for a different one.We repeat this process for up to 10 times, as done with our ChatGPT-Regex-Synth baseline in the previous subsection.• Smore-NoSketch: To evaluate a semantic regex synthesis without neural sketch generation, we create a variant of Smore that does not start with a sketch (i.e., it uses {□ : Any} as the sketch).The results of this comparison are presented in Table 3.As we can see from the "# Finished" column, ChatGPT-Synth can synthesize a semantic regex consistent with the examples for only 6 of the 50 benchmarks within 10 iterations.On the other hand, Smore-NoSketch times out on the majority of benchmarks and only finds a consistent regex for 12 of the 50 benchmarks.Furthermore, for semantic regexes that both Smore-NoSketch and Smore can synthesize, Smore is significantly faster.when evaluated on the test data.In particular, among all tasks that can be solved by both Smore and ChatGPT-Synth, Smore achieves an  1 score of 0.94 versus 0.71, and, among tasks that can be solved by both Smore and Smore-NoSketch, Smore achieves an  1 score of 0.88 versus 0.84.

Ablation Study
In this section, we describe two ablation studies to assess the relative impact of different components of Smore: one evaluates the impact of the synthesis techniques proposed in Section 5.2-5.4,and the other one evaluates the impact of generating sketches rather than concrete regexes.
Ablations of components of the synthesis techniques.To evaluate the effectiveness of the proposed synthesis techniques, we consider the following ablations: • Smore-NoDecomp: A variant of Smore that does not perform compositional sketch completion.
In particular, this variant does not infer positive examples for each hole.• Smore-NoTypedHole: A variant of Smore that does not use typed sketches.That is, each hole in the sketch is annotated with type Any.• Smore-NoLocateError: A variant of Smore that does not perform error localization for sketch repair.Instead, it queries GPT-3 for a new sketch through sampling.• Smore-NoTypeSystem: A variant of Smore that does not perform type-directed synthesis.
The results of this ablation study are presented in Figure 19, which shows the number of benchmarks completed (x-axis) within the given time limit (y-axis).As we can see from the gap between the five lines, Smore is significantly faster than all other variants and achieves a speedup of 14× compared to the second-fastest baseline, Smore-NoTypedHole.Hence, this ablation study shows that all algorithmic components proposed in this paper are important for speeding up the synthesis.
Ablations of sketch generations.To understand the significance of generating sketches as opposed to concrete semantic regexes, we introduce a new baseline named ChatGPT-Synth-Repair.This baseline extends the ChatGPT-Synth baseline from Section 7.2 with program repair.Specifically, it first generates a concrete program using ChatGPT (using a similar prompt as the ChatGPT-Synth baseline).If the generated program does not satisfy all the positive and negative examples provided, it then performs the error localization and repair strategies presented in Section 5.5.
The results of this ablation study are presented in Table 4.For clarity, we also include the results of ChatGPT-Synth and Smore from Section 7.2 to show the difference evaluation results.This ablation leads to the two following observations: • ChatGPT-Synth-Repair solves 6 more benchmarks compared to ChatGPT-Synth.This shows that our sketch repair technique can also be generalized to concrete program repair.• Although ChatGPT-Synth-Repair exhibits superior performance over ChatGPT-Synth, it is not comparable to the performance of Smore, which leverages sketches for synthesis.This underscores the pivotal role sketches play in enhancing the tool's efficacy.Upon analysis, we find that ChatGPT-Synth-Repair is able to accurately locate the error when it does not generate the desired program.However, ChatGPT struggles to generate a new program that precisely separates positive from negative examples.In contrast, with Smore, since we produce sketches, ChatGPT only needs to generate segments of the program it is confident about, delegating the uncertain parts or those demanding intricate reasoning to the program synthesizer.

User Study
We conducted a user study to assess the efficacy of semantic regexes in aiding humans with data extraction tasks compared to standard regexes.We recruited 13 participants, consisting of 3 CS undergraduate students, 6 CS graduate students, and 4 professional software engineers who regularly use regexes in their work.We asked each participant to complete 4 data extraction tasks by writing a regex.The participants were given 5 minutes for each task and asked to write standard regexes for two randomly chosen tasks (out of the 4 total tasks) and semantic regexes for the other two.The four tasks used in the study are simplified versions of the benchmarks used in our evaluation -we intentionally simplified the tasks so that they are doable within 5 minutes.

Setup.
To conduct this user study, we developed a command-line interface for Smore.For each task, the interface initially displays the prompt for the task (including 3 positive and negative examples) and then asks the user to input their answer.The tool randomly determines whether the answer should be a standard or semantic regex and only accepts user answers in the correct format.Upon entering a regex, the interface evaluates it against the test set and informs the user of their regex's performance, allowing unlimited attempts to enter a new regex within the 5-minute time limit.The details of the user study protocol are provided in the supplementary material.
Results.We evaluate the quality of the regexes in terms of their  1 score on the test set.For each task, Table 20 presents  1 scores for (a) manually-written standard regexes ("Manual-Regex"), (b) manually-written semantic regexes ("Manual-SemRegex"), and (c) semantic regexes generated automatically by Smore (the "Smore" column).Since some of the manually-written regexes have a precision or recall score of 0, the  1 score is undefined.In Table 20, we only show average  1 score across regexes for which the  1 score is defined.
As we can see from Figure 20, manually-written semantic regexes achieve a better overall  1 score (0.78) compared to standard regexes, for which the  1 score is 0.54.We ran a two-way ANOVA to find the most significant factor affecting the  1 score.In particular, we model the  1 score as the dependent variable and the type of tool and task as independent variables.The ANOVA analysis shows that the "task" variable has a high p-value of 0.57, which indicates it does not have a significant impact on the  1 score.On the other hand, the "type of tool" variable has a low p-value of 0.003, suggesting that the type of tool used has a significant impact on user performance.The analysis result indicates that participants are more effective at performing these types of data extraction tasks using semantic regexes than with standard regexes.Another interesting aspect of Figure 20 is that the semantic regexes learned by Smore seem to be even more effective than manually-written semantic regexes.In particular, for these four tasks, Smore learns regexes that achieve an overall  1 score of 0.92 compared to the  1 score (0.78) of manually-written semantic regexes.This result suggests that our proposed learning technique has the potential to improve productivity even for expert users who are generally comfortable with writing regexes.

RELATED WORK
In this section, we survey related work on program synthesis and data extraction.

Learning regexes from examples.
There is a large body of prior research on learning regular expressions from positive and negative examples [Alquezar and Sanfeliu 1994;Angluin 1987;Firoiu et al. 1998;Gold 1978;Parekh andHonavar 1996, 2001;Rivest and Schapire 1989].Our work builds on existing works that prune partial programs by evaluating the examples with respect to overand under-approximations [Chen et al. 2020;Lee et al. 2016;Ye et al. 2021].In this work, we not only use the over-approximations for pruning but also for decomposing the synthesis tasks.
Information Extraction from Semi-Structured Data.Past work has investigated similar extraction tasks, particularly for extracting lists from web sources [Chen et al. 2021a;Lin et al. 2020;Pasupat and Liang 2014;Raza and Gulwani 2020], answering questions based on tables [Pasupat and Liang 2015], and general information extraction from tabular data [Le and Gulwani 2014;Wu et al. 2018].Recent work has specifically employed LLMs to extract information from tables [Cheng et al. 2023] or raw text [Dunn et al. 2022].Despite the prevalence of neural-based approaches that emphasize data semantics, our work uniquely targets the integration of both semantic and symbolic aspects of the data structure.
Neurosymbolic DSLs.Recent work has considered so-called neurosymbolic DSLs with both standard language constructs and neural components [Andreas et al. 2016a,b;Bastani et al. 2022;Chen et al. 2021a;Cheng et al. 2023;Gaunt et al. 2017;Huang et al. 2020b;Jiang et al. 2021;Shah et al. 2020;Valkov et al. 2018;Verbruggen et al. 2021].Among these, most relevant to our approach are FlashGPT [Verbruggen et al. 2021] and Binder [Cheng et al. 2023].FlashGPT augments the DSL used in Flashfill [Gulwani 2011] with semantic transformation operators that can be used to reason about the semantic properties of the input.However, FlashGPT relies on in-context examples and does not utilize explicit semantic types, which hinders its ability to reason about combined semantic and symbolic properties.On the other hand, Binder [Cheng et al. 2023] proposes a new program structure that extends programming languages, such as SQL, with a function that allows querying large language models (in particular, Codex).However, the constructs proposed in Binder focus mainly on SQL-related tasks and do not transfer well to the string-matching domain.
Program Synthesis Using LLMs.The growing interest in leveraging LLMs for program synthesis [Austin et al. 2021;Chen et al. 2021b;Cheng et al. 2023;Nijkamp et al. 2023;Zhou et al. 2023] stems from general-purpose models like ChatGPT and Codex demonstrating code generation capabilities from various specifications, including natural language and input-output examples.However, these models often generate code that violates syntactic and semantic rules due to their limited understanding of program syntax and semantics.To address this, several approaches [Jain et al. 2022;Poesia et al. 2022;Rahmani et al. 2021] integrate LLMs with symbolic methods like program analysis to improve code quality.In our work, we use LLMs to generate sketches and introduce a sketch repair technique to handle cases where the LLM fails to generate accurate sketches.
Compositional program synthesis.Various approaches have been proposed for compositional program synthesis [Bansal et al. 2023;Feser et al. 2015;Huang et al. 2020a 2015].Among these works, both  2 [Feser et al. 2015] and FlashMeta [Polozov and Gulwani 2015] perform compositional PBE by inferring input-output examples for sub-programs using the inverse semantics.In another example, Raza et al. [Raza et al. 2015] rely on the natural language description to decompose the synthesis problems into smaller sub-problems.Furthermore, Zhang et al. [Zhang et al. 2021] decompose the synthesis task into simpler sub-problems in the domain of UDF-to-SQL translation using a dataflow graph.Our work differs from prior research by presenting a new decomposition strategy on a typed sketch in the context of synthesizing string-matching programs.While our decomposition approach helps reject incorrect programs using inferred positive examples, the full result must still be tested against the negative examples to ensure correctness.
Semantic Checks for String Matching.There has been prior work in combining string matching with semantic matching [Greenberg et al. 2022;Kozen 1997]; for example, Kleene algebra with tests (KATs) [Kozen 1997] combines Kleene and Boolean algebra.While our semantic matching construct can be conceptually viewed as a semantic guard for string matching, one key difference is that the predicate (i.e. the "test") part of the language in KATs is restricted to boolean algebra, whereas our vocabulary of predicates is much richer, including function invocations and machine learning models.Furthermore, the intended application domains are quite different: our proposed semantic regexes are intended for textual data extraction, whereas KATs have traditionally been used in the context of verification.

CONCLUSION
We have presented Smore, a new synthesis-powered system for data extraction.The key idea behind Smore is the concept of semantic regexes, which augments the syntactic pattern matching capabilities of regexes with a semantic pattern matching construct of the form { :  |  } which matches strings that have entity type  and that satisfy logical predicate  when interpreted as an instance of .As shown in our user study from Section 7.4, semantic regexes allow users to more easily perform data extraction tasks that are hard to do using standard regular expressions.
In addition to proposing semantic regexes, we have also described a learning algorithm that can synthesize semantic regexes from examples.Our synthesis algorithm is neural-guided and uses a LLM to generate a typed sketch where unknown parts of the regex have useful type annotations that are used to guide the search.Our synthesis algorithm is compositional and uses type-directed reasoning to find a completion of each hole in the sketch.Our evaluation shows that our proposed approach outperforms alternative data extraction techniques in terms of precision, recall, and  1 score.Our evaluation also shows the advantages of combining neural-guided sketch generation with type-directed compositional synthesis in terms of synthesis time.
A PROOFS Lemma 3. Let  be a semantic regex of type  and  be a arbitrary string such that SemanticType() ≠ CharSeq, if  () evaluates to True, then Semantic(SemanticType()) <: .
Proof.We prove this lemma by doing structural induction on  .
Base Case 2:  = , where  is a character class.Here, we only focus on the case where  is < Num >.Following the typing rule CC-Num, we know  = Semantic(Number).Since < Num > matches a number of length 1, we know  has the semantic type Number.Therefore, we have Semantic(SemanticType()) <: .

Inductive case:
We show that all programs constructed using the programs in the inductive hypothesis also satisfy this lemma by considering all the possible top-level constructs in the grammar.
•  = ¬ 1 .Using the typing rule Not, we know  has the type Any.Since any string has a semantic type   such that   <: Any, we conclude Semantic(SemanticType()) <: .•  =  1 * .Using the typing rule Star-2 in Figure 22, we derive the type of to be Any.Since any string has type   such that   <: Any, we conclude Semantic(SemanticType()) <: .•  =  1 ?. Using the typing rule Optional, we know  has the type Optional( 1 ), where  1 is the type of  1 .Using the inductive hypothesis, we know that Semantic(SemanticType()) <:  1 .
•  =  1 ∪  2 .From the typing rule Or, we know  has the type  1 ∨  2 , where  1 is the type of  1 and  2 is the type of  2 .Using the semantics of the Or operator, we know that  can either be matched by  1 or  2 .If  is matched by  1 , then using the inductive hypothesis, we know Semantic(SemanticType()) <:  1 ; if  is matched by  2 , then using the inductive hypothesis, we know Semantic(SemanticType()) <:  2 .Following the definition of type union, we can conclude that Semantic(SemanticType()) <: From the typing rule And, we know  has the type  1 ∧  2 , where  1 is the type of  1 and  2 is the type of  2 .Using the semantics of the And operator, we know that  can be matched by both  1 and  2 .Using the inductive hypothesis, we then know Semantic(SemanticType()) <:  1 and Semantic(SemanticType()) <:  2 .Following the definition of type intersection, we can conclude that Semantic(SemanticType()) <: From the typing rule Concat, we know  has the type Any.Since any string has a semantic type   such that   <: Any, we conclude that Semantic(SemanticType()) <: .

□
Lemma 4. Let  be a semantic regex of  and  be an arbitrary string with no semantic meaning (i.e.SemanticType() = CharSeq), if  () evaluates to True, then SemanticType() <: .Base Case 1:  = {□ :  }.Let  be a completion of  that satisfies E + .Since  is a single hole, we apply either Hole-Feasible or Hole-inFeasible rule for doing decomposition.
• if ∃  ∈ E + .SemanticType() ≮: , then using Lemma 5, we know that there does not exist a  such that it accepts E + , which contradicts the assumption.
Base Case 2:  =  , where  is a concrete regex.Since there is no hole in this sketch, and we know  satisfies E + , following the rule Concrete-Feasible, we obtain an empty decomposition so this case is vacuously true.
Inductive case: We show that this theorem holds for all sketch that is constructed using the sketch in the inductive hypothesis.Let  be a completion of the  =  ( 1 , • • •   ) that satisfies E + .Also let  ⋆ = OverApprox(), we prove this part of the theorem by dividing it into two cases: , following the definition of the over-approximation and assuming we have the precise inverse semantic of construct  , then there must exist we obtain a set of decomposition Ψ  for each   using the examples E +  .-Assuming that for each sketch   , we have a completion   such that   matches E +  .Then, following the inductive hypothesis, there exists a decomposition Ψ  such that every    in   where  represents the regex for the th hole,    satisfies Ψ  [ℎ  ].Once we obtain such a decomposition, we compose the decomposition of , Ψ, by merging the decomposition Ψ  for each   using the Merge.Since  is not a hole, all holes in  are included in Ψ.Therefore, following the inductive hypothesis, we prove that the decomposition Ψ we obtained has a completion,  , where the part of the  that corresponds to ℎ  satisfies Ψ[ℎ  ].
-Assuming that there exists a   with no completion   such that   matches any one of the possible E +  .Following this assumption, as well as the assumption that E +  is derived from the precise inverse semantics of  , we know that we can never compose a program  using the completion of  1 , . . .,   such that  can match the full positive examples E + .We can therefore conclude  as well does not have a completion  such that  matches E +  , which contradicts the premise of the theorem.
• If Match( ⋆ , E + ) ≡ ∅, and since  ⋆ is an over-approximation, there does not exist a concrete regex  ′ in the language such that  ′ can match all the positive examples in E + , which as well contradicts the premise of the theorem.

□
Lemma 7. Let  ℎ , E + , E ⋆ be the inputs to GetNextCompletion, and let  be a partial program such that there exists some complete program  ′ with the most precise type   that can be derived from  such that  <:  ℎ .If  ′ can accept all the positive examples E + , then GetNextCompletion will add  to the worklist W.
Proof.We prove this lemma by inducting on the number of terminals  in the AST of program .
Base Case: m = 0.The only such program  0 with 0 terminals is a partial program with one hole that is annotated with the goal output type  ℎ .This program is added to W on line 3 of Figure 13.Inductive Hypothesis: Assume this lemma holds for all programs whose ASTs have less than  terminals, where  ≥ 0.
Inductive Case: Suppose  +1 has  + 1 terminals.Then there is some program   ′ with  ′ ≤  terminals and some production  such that expanding   ′ with  produces  +1 .
Since  ′ can be derived from  +1 and therefore can also be derived from   ′ .By inductive hypothesis,   ′ is added to W. Then at some point   ′ will be dequeued from W on line 5 of Figure 13.The Expand procedure on line 8 will identify  as a possible production, and will expand   ′ to  +1 .Since  ′ <:  ℎ , then assuming the soundness of the type propagation rules, we know that for all nodes in  ′ , TypeOf( ′ ()) <: GoalType().Since there exists a completion from  +1 to  ′ , we know ∀ ∈ Nodes( +1 ).IsComplete( +1 ())∧ ⊢ TypeOf( +1 ()) <: GoalType(), and therefore pass the check on line 12.
In addition, we also check if the over-approximation of  +1 can match all E + .Since  +1 will be instantiated to a program that accepts E + , and over-approximation of a partial program will accept E + if any instantiation of the partial program can accept E + ,  +1 passes the check on line 15 and eventually be added to the worklist W on line 16.□ Theorem 8. Let  be the set of solutions returned by GetNextCompletion( ℎ , E + , E ★ ).We have: • Soundness: Every  ∈  is a solution to the hole synthesis problem, meaning (1)  has type  ℎ and (2) satisfies examples E + • Completeness: If  ∉ , then  is either not a solution or is observationally equivalent to some  ′ ∈  for strings in E ★ .
Proof.We first prove the soundness of the algorithm.To return a concrete program , we check if  has a type that subtypes  ℎ on line 7 of Figure 13, which proves point (1); Furthermore, we also check if  matches all the positive examples on line 7 of Figure 13, which proves point (2).
We now prove the completeness of the algorithm.By Lemma 7, any partial program  that might expand to a solution (i.e.(1)  has type  ℎ and (2) satisfies examples E + ) is added to the worklist W. Note that the termination criteria for GetNextCompletion is when the worklist W is exhausted.Thus,  will be dequeued at line 5 at some time during the synthesis procedure.In line 7, we first check to ensure that  is indeed a correct solution, and then in line 8, we also check if  is observationally equivalent to some  ∈  with respect to the set E ⋆ .Since line 9 is the only place we return , we obtain all the solutions that are (1) correct and (2) not observationally equivalent to other programs in the set when GetNextCompletion terminates.□

B ADDITIONAL SEMANTICS OF THE DSL
We provide the semantics of the string transformation part of the DSL in Figure 21.

D USER STUDY PROCEDURE
In this section, we describe our user-study protocol in more detail.
User study sessions.Our user study was completed in 13 sessions, one for each participant.The participants used the same laptop, with the tool installed, across all sessions.
Participant introduction.We started each user study session by first giving a general description of the task and the goal of the task.In particular, we asked them to complete 4 tasks using either standard regexes or semantic regexes.The specification for each task is a set of positive examples and a set of negative examples and the goal is to write a generalizable program that differentiate positive examples from negative examples.In order to minimize the effect of knowledge transfer, we randomly determined whether a participant was first given a task using standard regex or using semantic regex.
Task selection.We randomly selected 4 tasks from all the tasks we have in the benchmark.To ensure the tasks can be finished within 5 minutes, we slightly simplified the task.The description of the tasks and the provided sample positive and negative examples are presented in Table 5.
Training.We start the training procedure walking the user through a regex "cheatsheet" that contains the syntax and semantics for both standard regexes and semantic regexes, as well as some sample programs in each representation.After users are comfortable with both types of regexes, we demonstrate the workflow of a task by walking the user through a following training task:

Fig. 2 .
Fig. 2. Dataset about pieces of art exhibited in a museum.

Fig. 6 .
Fig. 6.Subtyping relations. () is the concretization function denoting the set of objects represented by .

1:
procedure Synthesize(E + , E − ) input: A set of positive E + and negative examples E − .output: A program that is consistent with the examples.2: Fig. 10.Top-level synthesis algorithm.Here,  [] means replacing each hole ℎ ∈  with  [ℎ].

Fig. 14 .
Fig.14.Sample input for a few-shot string transformation to GPT-3 and its output is highlighted in red.1: procedure GetNextSketch(E + ,   ) input: A set of positive examples E + , and an optional infeasible sketch   .output: A new sketch that has not been generated so far.

Fig. 15 .
Fig. 15.Sketch generation procedure.GetSketch(E + ) prompts the neural model for a new sketch, as illustrated in Figure 16.1 2 3 4 5 Fig.16.GPT-3 input structure for generating a sketch for the semantic string matching task.

Example 5. 3 .
Consider the positive examples from Section 2 and the following sketch: {□ : Name} • ", " • {□ : Country} • ", " • {□ : Year} Fig.18.GPT-3 input structure for identifying substring of specific semantics.[New string] is a placeholder for the string we are querying about, and [Semantic Type] is the semantics we are asking the model to identify.

id
Fig. 21.Semantics of string transformation part of the DSL.
The following rules(until Trans)show the subtyping relation involving built-in semantic types.For example, according to these rules, Year, Month, and Day are all subtypes of the more generic Date type.The Trans rule states the transitivity of the subtyping relation and the Semantic rule lifts the subtyping relation to Using our decomposition technique, we infer the following positive examples for each hole: {□ : Name}{□ : Country} {□ : Year} 1 {□ : Year} 2 Otherwise, it checks if the current solution (which is obtained by instantiating  with  ∪ [ℎ ↦ →  ]) rejects all negative examples, and if so, returns this solution.procedure SynthesizeFromDecomp(, Ψ, E − ) input: A sketch , a specification Ψ, a set of negative examples E − .output: A sketch completion consistent with all examples.

Table 1 .
Description of the sample tasks used in the evaluation.

Table 2 .
Evaluation results for Smore and data extraction baselines.P means precision and R means recall.

•
ChatGPT-Regex-Synth [OpenAI 2022]: One way to automate data extraction is to synthesize standard regexes from positive and negative examples.To evaluate this approach, we use ChatGPT to synthesize standard regexes.If the synthesized regex rejects the positive examples or accepts the negative examples, we ask ChatGPT to synthesize a different regex for up to ten iterations. 8• ChatGPT-Exec [OpenAI 2022]: Another way to automate data extraction is to directly use ChatGPT.To evaluate this approach, we provide ChatGPT with positive and negative examples and then query it about strings in the test set.Hence, this approach does not require synthesizing a program; instead, it invokes ChatGPT on every test example.• FlashGPT [Verbruggen et al. 2021]: Recent work has proposed an extension of FlashFill, called

Table 3 .
Evaluation results for our tool and synthesis baselines.P means precision and R means recall.

Table 4 .
Table 3 also shows that Smore outperforms both of these synthesizers in terms of  1 -score Evaluation results for our tool and the no-sketch variant.P means precision and R means recall.