Efficient Matching of Regular Expressions with Lookaround Assertions

Regular expressions have been extended with lookaround assertions, which are subdivided into lookahead and lookbehind assertions. These constructs are used to refine when a match for a pattern occurs in the input text based on the surrounding context. Current implementation techniques for lookaround involve backtracking search, which can give rise to running time that is super-linear in the length of input text. In this paper, we first consider a formal mathematical semantics for lookaround, which complements the commonly used operational understanding of lookaround in terms of a backtracking implementation. Our formal semantics allows us to establish several equational properties for simplifying lookaround assertions. Additionally, we propose a new algorithm for matching regular expressions with lookaround that has time complexity O(m · n), where m is the size of the regular expression and n is the length of the input text. The algorithm works by evaluating lookaround assertions in a bottom-up manner. Our algorithm makes use of a new notion of nondeterministic finite automata (NFAs), which we call oracle-NFAs. These automata are augmented with epsilon-transitions that are guarded by oracle queries that provide the truth values of lookaround assertions at every position in the text. We provide an implementation of our algorithm that incorporates three performance optimizations for reducing the work performed and memory used. We present an experimental comparison against PCRE and Java’s regex library, which are state-of-the-art regex engines that support lookaround assertions. Our experimental results show that, in contrast to PCRE and Java, our implementation does not suffer from super-linear running time and is several times faster.


INTRODUCTION
Since their introduction in the 1950s, regular expressions [Kleene 1956] and finite-state automata [Rabin and Scott 1959] have found applications in numerous domains to describe patterns over sequences.They have been used for the lexical analysis of programs [Johnson et al. 1968] during compilation, the search of words and patterns in text editors [Thompson 1968], and bibliographic search [Aho and Corasick 1975].Regular patterns are also used in network security [Yu et al. 2006] to search for intrusion signatures in network traffic, in bioinformatics [Roy and Aluru 2016] for only need to perform a memory lookup in the DFA transition table for each input symbol.The problem with DFA-based implementations is that the size of the DFA can be exponential in the size of the regex in the worst case.NFAs, on the other hand, can be exponentially more succint than DFAs.Moreover, every regular expression that uses only the classical regular combinators can be translated into an NFA whose state space is linear in the size of the expression.The problem with NFA-based algorithms, compared to DFA-based algorithms, is that they may need to perform Θ() computation steps for each input symbol.
The current state of affairs regarding the support of lookaround in regex engines is rather disappointing.Existing automata-based regex engines (grep, RE2, Hyperscan) do not support lookaround at all.While it is known that the membership problem for regular expressions with lookaround can be solved using finite-state automata (see, for example, [Morihata 2012] and [Miyazaki and Minamide 2019]), these automata are very large due to the succinctness of lookaround.A DFA of doubly exponential size is needed in the worst case (and therefore an NFA of exponential size).These observations suggest that simple solutions based on automata that encode the entire pattern will suffer from high complexity (with respect to the size of the pattern).Berglund et al. [2021] consider the construction of alternating finite automata (AFA) from regular expressions with lookahead assertions.In this construction, the number of states of the AFA is linear in the size of the regular expression.A consequence of this, which is not explicitly discussed in [Berglund et al. 2021], is that membership can be decided in  ( • ) time with a right-to-left pass over the input string that simulates the AFA execution "in reverse".This approach does not handle lookbehind assertions.Moreover, while the simulation of AFA execution can decide membership, it is not applicable to match extraction.This is because the states for lookaheads are not distinguished from the states for the "main" part of the regex, and therefore a disambiguation policy cannot be expressed.Several backtracking-based regex engines support general lookahead, but only very restricted forms of lookbehind.For example, the widely used PCRE library only supports bounded lookbehind, which can only refer to a bounded amount of "past" text.As we will see in Section 6, the use of lookaround in existing backtracking-based engines can easily trigger catastrophic backtracking.This means that there is currently no efficient implementation of lookaround in the context of regular pattern matching.This is the main problem that we tackle in this paper.Our approach is fully general in that we allow the arbitrary nesting of unrestricted lookahead and lookbehind assertions.
The key idea of our approach is that it is possible to decompose the overall computation for pattern matching with lookaround.First, we simplify the problem by making the assumption that all lookaround assertions can be resolved by oracles that know the truth values of both lookaheads and lookbehinds at each position in the text.Under this assumption, we show that it is possible to augment NFA-based algorithms with a special kind of -transition that is guarded by a query to an oracle.The second step of our approach is to replace the oracles by algorithms that compute all necessary truth values.This is done in a bottom-up manner.Consider a sub-pattern  ′ of the overall pattern  whose top-level operator is a lookaround and contains no other occurrence of lookaround.We compute whether  ′ matches the input text at each position.After this computation is performed, then the sub-pattern  ′ can be replaced by an oracle query that uses the pre-computed truth values.By following this bottom-up "compute and replace" process we ensure that the answers for each oracle query have already been computed and are therefore available to the algorithm.
We note that our algorithm is not streaming in the general case because it performs right-to-left passes over the input text to deal with lookahead assertions efficiently.If the regular expression contains no lookahead assertions, then the algorithm can deal with lookbehind in a streaming manner (single left-to-right pass over the input text).
Main contributions.We make the following contributions in this paper: (1) We present a formal semantics for lookaround using a satisfaction relation that relates a string , a location (interval) [, ] within , and a regular pattern  .We show that this is equivalent to an algebraic semantics that generalizes the classical language interpretation of regular expressions (without lookaround).This mathematical semantics complements existing definitions of the lookaround constructs that are operational, i.e., defined in terms of a backtracking matching algorithm.
(2) Using our formal semantics for lookaround, we prove that regular expressions with lookaround satisfy the equivalence properties of Kleene algebra [Kozen 1994].Moreover, we establish a number of equivalences involving lookaround that can be used for simplifying patterns.
(3) We introduce the notion of regular expressions and -NFAs with oracle queries as a way to abstract away lookaround assertions.We call them oracle-regexes and oracle-NFAs respectively.
We provide an algorithm for matching such oracle-regexes, which can be understood as a simulation of oracle-NFA semantics.The time complexity of this algorithm is  ( • ), where  is the size of the oracle-regex and  is the length of the input text.(4) We propose a recursive algorithm for matching regular expressions with lookaround.This algorithm is based on the bottom-up decomposition approach described earlier and makes essential use of the algorithm for oracle-regex matching.Its time complexity is  ( • ), where  is the size of the regular expression and  is the length of the input text.(5) We introduce three performance optimizations to the aforementioned algorithm: (i) common assertion elimination, (ii) one-pass unidirectional evaluation to reduce the memory footprint to  (), and (iii) approximation to avoid the computation of some lookaround assertions.(6) We provide an experimental evaluation of a Rust implementation of our algorithm against PCRE and Java's regex library, state-of-the-art regex engines that support lookaround.Our experiments show (i) that our performance optimizations provide a significant performance benefit, and (ii) that, in contrast to PCRE and Java, our implementation does not suffer from super-linear (in the length of the input text) time complexity.PCRE and Java have worse performance than our implementation over the workloads that we consider.
The current work presents the first tool for matching regular expressions with lookaround that provides strong worst-case complexity guarantees and has competitive performance against stateof-the-art regex engines.

SEMANTICS OF LOOKAROUND
In this section, we present a formal mathematical semantics for regular expressions with lookaround.
There is little prior work on the formal semantics of lookaround.The most common approach is to define it operationally: the meaning of lookaround is described by the backtracking algorithm that performs regex matching.This approach is, however, unsatisfactory because it conflates the specification (which should be simple and easily understandable) with the implementation (which can be very complex) and therefore does not allow one to formally prove the correctness of the matching algorithm.A notable exception to this is the semantic treatment of lookahead in [Miyazaki and Minamide 2019], where the authors use languages of string pairs to interpret regular expressions.These semantic objects are, however, insufficient for giving a semantics in the presence of both lookahead and lookbehind.We provide two equivalent semantic perspectives.The first one is logical and employs a ternary satisfaction relation, which relates a string , a location [, ] within the string , and a regular expression  .The second one is algebraic and uses an algebra of "match-languages" to interpret the regular expressions.A match-language is a set of triples of the form (, , ), which specify a  string  and a location [, ] within it.For each regular construct, we define an associated semantic operation for the algebra of match-languages.
We also consider a natural notion of equivalence between regular expressions, which essentially amounts to equality of their denotations, and establish several useful equivalences.These include the equivalences given by Kleene algebra [Kozen 1994], as well as several other equivalences for simplifying lookaround assertions.Definition 1 (Regular Expressions with Lookaround).Let Σ be an alphabet, and P be a set of decidable predicates over Σ, i.e., functions of type Σ → B, where B = {0, 1}.The set LReg(Σ) of regular expressions (regexes) with lookaround is defined by the following grammar: Expressions of the form (?>  ) and (?≯  ) are called lookahead assertions.Similarly, expressions of the form (?<  ) and (?≮  ) are called lookbehind assertions.Expressions of the form (?>  ), (?≯  ), (?<  ) and (?≮  ) are collectively referred to as lookaround assertions.
We write | | to denote the length of a string .The empty string (i.e., the string of length 0) is denoted by .For  ∈ Σ * , we will call a pair That is, ⟦ ⟧ is the set of all strings that match  .Notice in Fig. 1 that lookaround assertions can only hold at locations of the form [, ], i.e., locations of length 0. Such locations are essentially positions in the string.For this reason, we can think of lookaround assertions as holding (or not) at positions within the string.
A decomposition of a location [, ] (where  ≤ ) is a nonempty finite sequence of locations Proof.The left-to-right direction is shown by induction on  − .For the right-to-left direction, we argue by induction on the size  ≥ 1 of the decomposition.There is one important observation in the proof for the step case: if the last pair [ +1 ,  +1 ] of the decomposition is of length 0 (that is,  +1 =  +1 ), then it can be removed.□ Claim 3 serves as a sanity check for the semantic definition of Fig. 1 for Kleene star.Notice that in the definition of , [, ] |=  * we consider  matching only over locations of length at least 1 (we require that  <  in the location [, ]).Claim 3 establishes that this restriction is without loss of generality.The expression  1 = (?< Σ * Σ * ) has 2 matches in  at locations [6, 8] and [13, 15].Notice that an occurrence of  within  matches  1 only if it is preceded by an occurrence of the letter .Similarly, the expression  2 =  (?> Σ * Σ * ) has 2 matches in  at locations [2, 4] and [6,8].An occurrence of  within  matches  2 only if it is followed by an occurrence of the letter .Finally, the expression  3 = (?< Σ * Σ * ) (?> Σ * Σ * ) has 1 match in  at location [6,8].
Example 5. Suppose that  1 ,  2 , . . .,   are letters of the alphabet.The regular expression   = (?> Σ *   Σ * ) asserts that there is some occurrence of   from the current position to the end of the string.Let us consider now the expression For a string , we have that , [0, 0] |=  iff the string  contains at least one occurrence of each one of the letters  1 ,  2 , . . .,   .This example shows that lookarounds can be used to encode certain kinds of intersection: , [0, 0] |=  iff  matches the regular expression , where ∩ is the intersection operation on regular expresssions (which is, of course, interpreted as intersection on languages).
For a regular expression  , the notation   is an abbreviation for the concatenation  •  • • •  ( times).The notation  {} is also commonly used to describe the repetition of  exactly  times.
Another example that shows the usefulness of lookbehind expressions is the extraction of an email address domain.Suppose  = [0−9−−] is a predicate that contains the alphanumeric characters.One can use the regular expression  * @ * .* to match email addresses.To extract the domain of the email address, one can write the expression (?<  * @) * .* .Interestingly, this regex is not allowed by the PCRE standard, which disallows lookbehinds that could extend over a location of unbounded length.The algorithm we will present later in Section 4 does not have this limitation.
Definition 7 (Algebraic Semantics of Lookaround).If  is a string over Σ and [, ] is a location in , then we call the triple (, , ) a (string) slice over Σ.We define which is the set of all string slices over Σ.A match-language over an alphabet Σ is a subset of Slices(Σ).For match-languages ,  we define the operations of concatenation •, Kleene iteration * and lookaround as follows: For a regular expression  , we define its match-language M ( ) as follows: If (, , ) ∈ M ( ), then we say that the slice (, , ) matches the expression  and also that  recognizes the slice (, , ).We say that the regular expressions  and  ′ are equivalent, and we write  ≡  ′ , if they have equal match-languages, i.e., M ( ) = M ( ′ ).
Lemma 8 (Match-Language).The following hold for the match-languages of regular expressions with lookaround: M () = 1 M , and for every predicate  and all regular expressions , .
The previous observations mean that our lookaround can express PCRE's lookaround and the other way round.Here is a summary of all equivalences: This discussion also establishes that our syntax is more economical than the syntax of PCRE, because the anchors ^and $ can be expressed using our lookaround constructs.

Equational Properties of Lookaround
Informally, the following lemma says that regular expressions with lookaround satisfy the properties of Kleene algebra [Kozen 1994] for the equivalence relation ≡.We write  1 ⊑  2 as abbreviation for |=  2 for all strings  and locations [, ] in ).
Lemma 10 (Equivalences for Regular Expressions).The following properties hold for all regular expressions ,  1 ,  2 ,  3 ∈ LReg(Σ): (1) form a Kleene algebra.This means that ∅, 1 M , ∪, • satisfy the axioms of idempotent semirings, and * additionally satisfies the four axioms of Kleene iteration from [Kozen 1994].As an example, let us consider the equivalence  + • * ≡  * .It can be immediately established by using the interpretation M of regular expressions in the Kleene algebra of match-languages: which holds because match-languages form a Kleene algebra.
(8) Positive and negative lookaheads cannot be matched together: (?>  ) • (?≯  ) ≡ ∅ (9) For predicates  1 and  2 : Proof.These properties can be proved in a straightforward manner using the formal semantics of lookaround expressions.To demonstrate, we prove the first one.
Suppose The intuition for property (5) regarding the flattening of lookarounds is that both expressions describe the requirement that both  and  have a match at location [, | |] (if we interpret them at position ).
We make some further observations about the simplification of lookaround assertions.We have previously stated that the PCRE expression (?=  ) is equivalent to (?>  • Σ * ) in our syntax.On the other hand, we can use equation ( 6) above to establish the equivalences The last expression is the same as the regex (?=  $) in PCRE notation.Thus, our syntax can be translated to PCRE notation by adding the $ anchor.The main observation is that this fact, which we already knew from Observation 9, is established here purely by equational reasoning, using the properties from Lemma 10 and Lemma 11.Note that (?>  • (?> )) cannot be simplified to (?>  • ).For example, (?>  • (?> )) cannot be true at any position  because  has to extend to the end of the string where (?> ) cannot hold.So, this regex cannot be equivalent to (?> ).
It may be possible to give a complete algebraic axiomatization of simple cases of lookaround (e.g., anchors and word boundaries) using the approach of Kleene algebra with extra equations [Grathwohl et al. 2014b;Kozen 1997;Kozen and Mamouras 2014;Mamouras 2015Mamouras , 2017]].The axiomatization of general lookaround is more challenging, as it can encode some kinds of intersection.

ORACLES FOR LOOKAROUND ASSERTIONS
This section is a stepping stone towards the full algorithm for matching regular expressions with lookaround, which we will present in Section 4. Here, we will see how to match a regular expression over a string, assuming that the truth values of lookaround assertions at all positions can be obtained by querying oracles.To formalize this computation, we introduce a notion of regular expression that includes oracle queries instead of lookaround assertions.We call these "oracle-regexes" or "o-regexes" for brevity.Oracle-regexes are matched using a model of automata that is an extension of classical NFAs.We call these automata "oracle-NFAs" or ONFAs.ONFAs contain transitions that query the oracles.An oracle-transition is taken if and only if the oracle responds with "true", but it does not consume a character from the input string.So, we think of them as oracle-guarded -transitions.Matching for o-regexes and ONFAs is not defined w.r.t.plain strings over the input alphabet, because these strings lack information about the responses of the oracles.Instead, we define a semantics w.r.t. to "oracle-strings" or "o-strings", which are pairs of the form ⟨, β⟩, where  is a string over the alphabet and β is a sequence of all oracle responses for all positions 0, 1, . . ., | |.Similar to classical NFAs, the simulation of an ONFA is done in a single left-to-right pass over the input.At each step, a single character and an oracle valuation (for the position right after the character) are consumed.Each step needs  () time, where  is the size of the ONFA.

Oracle Strings and Oracle Regular Expressions
Suppose  is a finite set of oracle names.In the later algorithmic development, we will be using natural numbers as oracles names, because they are convenient for indexing in arrays.A  -valuation is a function of type  → B, that is, a truth assignment for the oracle names  .We also use the notation B  for the set of  -valuations.Informally, an oracle-string is a string together with the responses for the oracle queries at every position.Suppose  =  0  1 . . . −1 and β =  0  1  2 . . .  .The character   has the oracle valuations  −1 and   right before and right after it respectively.
The concatenation operation on O (Σ,  ) needs to be defined carefully so that it matches the way we intend to use o-strings.The concatenation of two elements of O (Σ,  ) is only defined if the oracle valuations agree.Formally, suppose We extend this definition in a natural way to concatenation of sets of o-strings.Kleene iteration of an o-string (or a set of o-strings) is also defined in an analogous manner, respecting the agreement of oracle valuations at concatenation boundaries.
where  ∈ P is a predicate over Σ (character class),  ∈ {+, -} is the sign of an oracle query, and  ∈  is an oracle name.
Instead of lookaround assertions as in LReg(Σ), oracle regular expressions have queries of the form Q +  () (positive queries) and Q -  () (negative queries).Example 14 (O-Regexes).The o-regex  • Q +  (0) describes a pattern that includes the signature  and additionally has to satisfy a positive lookaround assertion right after it.The o-regex does not specify the assertion itself, it only contains a reference to an assertion that should be provided separately.The o-regex 1) describes the pattern ℎ, which additionally has to satisfy two negative lookaround assertions right before ℎ and right after .
Let  be an oracle valuation and Q   () be an oracle query.We say that  satisfies Q   (), and we write  |= Q   (), if ( = + and  [] = 1) or ( =and  [] = 0).Every expression  ∈ OReg(Σ,  ) denotes a language of oracle strings, i.e., a subset of O (Σ,  ), written as ⟦ ⟧.This is defined inductively as follows: for all  ∈ P,  ∈  , and ,  1 ,  2 ∈ OReg(Σ,  ).We also define the satisfaction relation as follows: In the definition above, we are using the o-string slicing operation introduced earlier.

Choosing appropriate oracle valuations
Later in this section, we will prove Lemma 21, which formalizes the connection between OReg and LReg.This is crucial for the efficient matching algorithm described in the following section.Suppose  ∈ LReg(Σ) and  ∈ Σ * .We say that  is a lookaround assertion of  if (1)  is a subexpression of  , (2)  is a lookaround assertion.We say that  is a maximal lookaround assertion of  if (1) it is a lookaround assertion of  , and (2) it does not occur underneath a lookaround operator in  .
The finite type {L2R, R2L} has two inhabitants that are used to indicate the direction of ONFA computation over an o-string.The element L2R (resp., R2L) indicates a left-to-right (resp., right-toleft) pass, which is used for computing lookbehind (resp., lookahead) assertions.
The definition of the "shallow decomposition" of a regex (LReg) that follows (Definition 16) is meant to separate the "main" part of the regex from the maximal lookaround assertions that it contains.The main part is expressed as an oracle-regex (OReg) that contains references to the maximal lookaround assertions.
The oracle-arity oarity( ) of a regular expression  , defined below in Definition 18, is the number of subterms that are lookarounds that are not subterms of lookarounds (i.e., the number of maximal lookarounds).Let  ∈ LReg(Σ) be a regular expression and (, , ) = shallow( ) be its shallow decomposition.Let  ∈ Σ * .We define the oracle matrix for  and , denoted by Mat(, ) : Vect(Vect(B)), to be a vector of  = oarity( ) Boolean tapes of length | | + 1 each.At the (, ) entry of the matrix Mat(, ), we note whether the -th expression matches at position  in the string .More formally, ) Let us use the word  =  considered in a previous example (Example 4) to illustrate the oracle matrix  = Mat(, ).As shown explicitly in the table below, In this case, the first expression is testing for a presence of an  in the prefix, and the second expression is looking for the presence of a  in the suffix.Proof.The proof is by induction on the regular expression.Let us consider the case of a positive lookahead assertion (?>  ).We have that oproj((? because ⟦Q +  (0)⟧ = {⟨, ⟩} with  (0) = 1.We leave the rest of the cases to the reader.□ Lemma 21 says that the problem of matching a regular expression  ∈ LReg(Σ) over a string  can be reduced to matching its oracle-projection oproj( ) ∈ OReg(Σ), assuming we have also computed the oracle matrix Mat(, ).This assumption means that the truth values of all oracle queries are available.

NFAs with Oracles Queries
Now we define a class of acceptors for subsets of O (Σ,  ).These behave like standard nondeterministic finite automata on the part of the o-string that only involves letters from Σ, but additionally has -transitions which are guarded by oracle queries.We will see that these acceptors can recognize the o-string languages that are expressed by oracle-regexes.
Definition 22 (Oracle-NFAs).Let Σ be an alphabet and P a set of predicates over Σ.Let  be a set of oracle names.An oracle-NFA (or ONFA) A over the alphabet Σ and oracle names  is a tuple (, Δ, ,  ), where  is a finite set of states,  ⊆  is a set of initial states,  ⊆  is a set of final states, and We write ONFA(Σ,  ) for the set of all oracle-NFAs over Σ and  .
A path in A is accepting if its first state is initial and its last state is final.The set of o-strings accepted by A is defined as ⟦A⟧ = {⟦⟧ |  is an accepting path in A}.That is, ⟦A⟧ is the union of the denotations of all accepting paths in A.
Proof.A variant of Thompson's construction [Thompson 1968] can be used to construct the desired ONFA.Predicates (character classes) in the regular expression would correspond to transitions in the ONFA that are labeled predicates, and oracle queries in the regular expression would correspond to oracle-guarded -transitions.Combinators like nondeterministic choice, concatenation and Kleene star can be handled in the usual manner.□ Let ,  ′ be states of an ONFA A and  be an oracle valuation.We say that  ′ is -reachable from  if there exists a path  0 →  0  1 →  1 • • • →  −1   in A such that (1)  =  0 and  ′ =   , (2) every   is either  or an oracle query, and (3)  |=   for every   that is an oracle query.
Fig. 2 shows an algorithm for matching an oracle regular expression  by compiling it into an ONFA A and then simulating the execution of the ONFA.We consider both left-to-right and rightto-left matching, as this will be needed later in Section 4 for evaluating lookaround assertions.One important difference between ONFA execution and classical NFA execution is that -closure is not sufficient in the case of ONFAs.We have to consider -transitions that are either unguarded (similar to NFAs) or guarded by (positive or negative) oracle queries.In order to check which oracle-guarded -transitions are enabled, we have to use the oracle valuation for the current position.This is why both Initial and Next in Fig. 2 take an oracle valuation ( : Vect(B)) as an additional argument.
Vect(Vect(B)) β ← transpose( ) // it is not actually necessary to explicitly transpose // β is a sequence of length  + 1 containing   -valuations // ⟨, β ⟩ is an oracle-string over Σ and   We will continue now to prove the main correctness result for the Match algorithm of Fig. 2. Before we can prove this, we need to consider a semantic property of matching "in reverse".The reverse rev( ) of an oracle regular expression  is defined recursively as follows: Proof.The proof is by induction on  .For convenience, we use the following alternative characterization of the satisfaction relation:  Proof.Part (1) can be proved with similar arguments as those that justify the simulation of classical NFAs.The only difference is that we need to consider the current oracle valuation in order to see whether an oracle-guarded -transition is enabled or not.
For Part (2), the main observation is that the execution of Match(R2L, , , β) follows the same ONFA simulation steps as Match(L2R, rev(), rev(), rev( β)) and stores the output bits in reverse order.Let  be the output Boolean tape.From Part (1) we get that  [] =  (rev(), rev( β), [0, ], rev()) for every .From Lemma 24, we get that  The matching algorithm of Fig. 2 proceeds in a single left-to-right (resp., right-to-left) pass over the input string  when dir = L2R (resp., dir = R2L).It performs  () work per step, so the total running time is  ( • ), where  is the size of the o-regex and  is the length of the input text.

EFFICIENT MATCHING
Using the algorithm for oracle-regex matching from the previous section, we will now describe how regular expressions with lookaround can be efficiently matched.Our algorithm operates on the nested structure of LReg.If there are one or more levels of lookaround, our algorithm makes multiple forward or backward passes on the input string to extract the necessary information.
We have seen in Lemma 21 that by choosing appropriate oracle valuations, we can decide membership in expressions with lookaround, by converting them to oracle expressions.In the previous section, we saw that oracle regular expressions can be realized as ONFAs which behave similarly to standard NFAs but additionally have oracle-guarded -transitions.Our algorithm is expressed in terms of a recursive function (EvalAux in Fig. 3) which traverses the regular expression recursively.When a lookaround expression is found, the corresponding oracle tape is computed and the lookaround expression is replaced with an oracle query.Ultimately, the resulting o-regex and oracle tapes, which form an oracle matrix, are passed to an ONFA for matching.Since ONFAs are simulated by maintaining a set of active control states (similarly to NFAs), we are able to compute  (, [0, ],  ) : B for each position , by running the ONFA in a single left-to-right pass.These truth values are useful as they form an oracle tape that could be used in evaluating an ONFA for a larger subexpression.For lookahead expressions (?>  ), the truth values  (, [, | |],  ) : B are required.To compute these values, the ONFA is executed in reverse.
Evaluation Algorithm.The overall algorithm for matching regular expressions with lookaround is shown in Fig. 3.The top-level function is Eval and it uses the auxiliary function EvalAux to recursively traverse the regular expression.EvalAux is similar to the shallow decomposition of Definition 16 and it computes both the oracle-projection and the oracle matrix.It takes four inputs: (1) a regular expression  ∈ LReg(Σ) to evaluate, (2) the input string  ∈ Σ * , Then, the ONFA A ′ is simulated with a left-to-right pass over the input (work proportional to  ′ ).So, the total work performed is proportional to The only pre-processing performed by our algorithm happens in the Match procedure of Fig. 2 (see line 7).It involves Thompson-style constructions to obtain NFAs and ONFAs from oracle-regexes.These constructions can be performed in time  (), where  is the size of the regex.
Extracting Matches.The literature on automata and formal languages generally focuses on the membership problem for regular expressions: given  ∈ Σ * and  ∈ LReg(Σ), is it the case that , [0, | |] |=  ?To answer this question, we can simply look at the last element of Eval(L2R, ,  ).However, regular expressions with lookaround are often used to specify additional constraints on the context in which a substring appears without capturing the context itself.For instance, telephone numbers have the form −−, where  is the area code.One might use the regular expression [0-9]{3}(?=-[0-9]{3}-[0-9]{4}) to extract the area code.For such a task, the match extraction problem is of more interest than the membership problem.A match for  in  is a pair [, ] of indices with 0 ≤  ≤  ≤ | | such that , [, ] |=  .The leftmost longest match is the longest out of the leftmost matches (it can be easily seen that it is unique).The computational problem of extracting matches (and sub-matches) has been considered before (see, e.g, the notes of Cox [2010]).The following two-step procedure uses the algorithm of Fig. 3 to efficiently extract the leftmost longest match for a given regular expression: (1) Find the smallest index  such that , [, | |] |=  • Σ * using the output of Eval(R2L, ,  • Σ * ).
We can also consider match extraction when a match other than the leftmost longest one is preferred.
Lookaround and Temporal Monitoring.The use of lookaround in regular expressions is reminiscent of the use of temporal connectives in temporal logic, which has found applications in runtime verification and online monitoring [Bartocci et al. 2018].More specifically, lookahead (resp., lookbehind) is similar to future-time (resp., past-time) temporal connectives.The problem of (online or offline) temporal monitoring is analogous to the matching problem for regular expressions.It seems possible that the compositional regex matching algorithm of Fig. 3 can be combined with efficient and modular algorithms for temporal monitoring (see, e.g., [Chattopadhyay and Mamouras 2020;Dokhanchi et al. 2014;Maler et al. 2008;Mamouras et al. 2021aMamouras et al. ,b, 2023;;Mamouras and Wang 2020;Thati and Roşu 2005]) in order to support more expressive temporal specification formalisms.

PERFORMANCE OPTIMIZATIONS
The algorithm of Fig. 3, presented in the previous section, provides strong worst-case performance guarantees.The upper bound  ( • ) for the running time is the same as the complexity of Thompson's algorithm [Thompson 1968], which only handles classical regular expressions (i.e., no lookaround).In order to provide a practical implementation, we will introduce in this section three performance optimizations that can reduce both the amount of work and memory needed for some regular expressions.We will see later in Section 6 through an experimental evaluation that these optimizations are significant in practice.

Common Assertion Elimination
In Definition 16, we introduced the concept of shallow decomposition of a regular expression  , which allows us to reduce the evaluation of  to the simulation of an ONFA, assuming that we have access to oracles that resolve the truth values of the lookaround assertions.Computationally, the algorithm of Fig. 3 performs a shallow decomposition with each invocation of Eval.In order to compute the oracle tapes, Eval is applied recursively whenever a lookaround assertion is encountered.The overall effect is a decomposition that goes deeper than what the definition of shallow  suggests.In order to illuminate this concept, we introduce here the concept of a "deep decomposition", which has a close correspondence to the algorithm of Fig. 3.
The deep decomposition separates all lookaround assertions, regardless of whether they are maximal or not.This decomposition does not cause an increase in size, because oracle queries are used to refer to lookaround assertions at all levels.
Definition 28 (Deep Decomposition).For every index  ∈ N, we define the deep decomposition deep  ( * ) = ( * , , ), where (, , ) = deep  ( ) The deep decomposition of a regex  ∈ LReg(Σ) gives us a sequence [ 0 ,  1 , . . .,  −1 ] of  oregexes in topological order with respect to the evaluation dependencies that they have.The means that they can be evaluated in the given order.The output tapes of earlier o-regexes are used as oracle tapes for later o-regexes.This is essentially a reformulation of the algorithm Eval of Fig.The advantage of the formulation of EvalDeep is that we can easily redefine deep in order to avoid the duplication of lookaround assertions.As an example, consider the regex  = (?=a(?<=c))(?=b(?<=c)) = (?> (?< Σ * )Σ * ) • (?>  (?< Σ * )Σ * ).
The algorithm of Fig. 3 computes (?<=c) twice.We can avoid this duplication of work in EvalDeep by modifying the deep decomposition to only create a new o-regex when it encounters a new lookaround assertion.For the example  above, we would then have (, , ) = deep( ), where We call this optimization common assertion elimination (similar to the common subexpression elimination used in compiler optimization).

Improving the Memory Footprint
An important memory-saving optimization is enabled when all assertions are lookaheads or all of them are lookbehinds.When this holds, we say that the regular expression is unidirectional.In this case, we see in Fig. 3 and Fig. 4 that all ONFA simulations are performed in the same direction.For this reason, we do not need to store oracle tapes with intermediate outputs.Instead, we can pipe the output from an ONFA to be used by other ONFAs that depend on it.
This idea is implemented in the function EvalL2R of Fig. 5 for the case where all lookaround assertions are lookbehinds.The case where all assertions are lookaheads is completely symmetric.For every o-regex   of the deep decomposition, the algorithm simulates the corresponding ONFA A  .At every step, the ONFAs are processed in topological order in order to ensure that each ONFA  has the oracle valuation that it needs.The ONFA A for the top-level o-regex  is always processed last, as it may need the output values from all other ONFAs.
The intuition for the algorithm of Fig. 5, when compared to the algorithm of Fig. 4, is that the evaluation of the output matrix proceeds column-by-column instead of row-by-row.Since only the most recent column of the matrix is needed for the next steps, we do not need to store the entire matrix.So, we store only the last column and we update it at every step.This reduces the memory footprint from  ( • ) to  ().

Approximation for Saving Work
We also consider an optimization where the computation of lookaround assertions can be avoided altogether when they are not necessary for producing the output.For example, consider a regex of the form  = Σ * •  •  1 , where  contains several lookaround assertions.If the input text contains no occurrence of the string , then it cannot contain any match for  .In this case, we do not have to compute any of the lookaround assertions, because their values are not needed at all.This idea is made more precise in the algorithm PreEval of Fig. 6.Given a regular expression  ∈ LReg(Σ), we first compute its oracle-projection  ∈ OReg(Σ,   ) where  = oarity( ).We will attempt to compute the output without knowing the truth values of the oracle queries.In order to do this, we will approximate the ONFA A for  using two NFAs.The NFA A ⊤ is obtained from A by replacing each oracle-guarded transition of the form  → Q   ()  ′ by an -transition  →   ′ .So, it over-approximates A, that is, ⟦A⟧ ⊆ ⟦A ⊤ ⟧.The NFA A ⊥ is derived from A by removing all oracle-guarded transitions.So, it under-approximates A, that is, ⟦A ⊥ ⟧ ⊆ ⟦A⟧.We examine cases: (1) A ⊥ accepts: It must also be the case that A and A ⊤ accept.lookaround in Snort and Suricata, 96% and 97% respectively have lookaround depth 1 (which means that 4% and 3% respectively have nested lookaround assertions).
Effect of performance optimizations.Fig. 7 shows the performance of the basic version of our matching algorithm (called ours_base in the figure).This is the implementation of the algorithm of Fig. 3.The version that is called ours_opt in the figure also incorporates the optimizations described in Section 5: avoiding work duplication due to multiple occurrences of the same lookaround assertion, one-pass matching when the lookaround assertions are unidirectional, and the use of approximation to avoid the computation of some oracle truth values.Fig. 7 contains two plots, one for each regex dataset, namely Snort and Suricata.The horizontal axis of each plot shows the length of the input string.The vertical axis shows the average running time of the regex matching algorithm in milliseconds.The average is taken over the entire dataset of regexes.Each point is annotated with error bars that show the standard deviation of the running time (the errors bars are too small to see).A crucial observation is that the running time of both versions of our algorithm is linear in the length of the input string.This behavior is consistent with the time complexity analysis of Theorem 27.The other observation is that the optimized version of our algorithm (ours_opt) is substantially faster than the basic version (ours_base).More specifically, the optimizations result in a speedup of at least 10× across all experiments of Fig. 7.
Comparison with PCRE and Java's regex engine.In Fig. 8 we include the performance of PCRE2 and Java's regex engine.The plots are similar to the ones of Fig. 7.One difference is that the vertical axis in the plots of Fig. 8 is log-scaled.Using logarithmic scale for the running time is necessary due to the big difference in running time between our implementation and the other tools.Our first observation is that the running time of both PCRE2 and Java is superlinear with respect to the length of the input string.This is witnessed by the widening gap between the green and red curves (ours_base and ours_opt respectively) and the blue and purple curves (java and pcre respectively) as the string length grows.The ratio between pcre and ours_opt is at least 250× for text length 1000.It grows to at least 4000× for text length 10000.Similar observations can be made for Java.So, our regex engine is several orders of magnitude faster than PCRE and Java across all experiments of Fig. 8.
Microbenchmarks.We also consider microbenchmarks that focus on cases that do not trigger super-linear behavior for backtracking engines (PCRE and Java).First, we consider the family (  )  ≥2 of regexes of the form   =  (?=  1 ) (?=  2 ) • • • (?=   ), where ,  1 ,  2 , . . .,   are lookaroundfree signatures.We also consider the family ( ′  )  ≥2 of regexes of the form  ′  =  ′ (?= (.{2}) +  # ) (?= (.{3}) +  # ) • • • (?= (.{ }) +  # ), where  # is a signature that has the role of an "end-of-block" marker.The regex family ( ′  )  is inspired from the regex family   in section 3.6 of [Miyazaki and Minamide 2019].The regexes (  )  witness the doubly exponential lower bound for DFAs that encode regexes with lookahead.Finally, we define the regex family (  The regex families   and  ′  use lookahead assertions in a way that encodes a form of intersection.They would pose a substantial challenge on algorithms that construct a single automaton through a product construction, as this would cause an exponential blowup in size.The regex family  ′′  involves a nondeterministic choice over lookahead assertions.The regex families   ,  ′  ,  ′′  correspond to the microbenchmarks called micro1, micro2 and mi-cro3 respectively.Fig. 9 shows experimental results for the performance of our implementation, PCRE and Java's regex engine over these 3 microbenchmarks.The experiments use input text of length 10 6 .The horizontal axis corresponds to the parameter .The vertical axis shows the matching running time in milliseconds.All regex engines seem to have running time that is linear in parameter .Note that the running time of our implementation does not blow up because we do not construct large automata, as explained earlier in Section 4. The plot shown above (on the right) focuses on the performance of our implementation over all 3 microbenchmarks.Observe that the running time of our implementation is linear in .
Experimental Setup & Measurement Methodology.The experiments were executed in Ubuntu 20.04 on a desktop computer equipped with an Intel Xeon W-2295 CPU (18 cores) and 64 GB of RAM.We used version 1.71.0 of the Rust compiler.The PCRE2 library was installed using the libpcre2-dev package through usual repositories.At the time of executing the experiments, the 10.39-3ubuntu0.1 version of the package was used.
Each measurement of running time (for a matching algorithm that is given a regex and a string as input) is taken as the average of 10 trials.The uncertainty in the measurement is quantified using the standard deviation of the 10 trials.

RELATED WORK
Constructs similar to (positive and negative) lookahead assertions have been popular in the construction of parsers.Specifying a lookahead assertion in a parser can be used to reduce ambiguity (and thus limit backtracking, for backtracking implementations).For instance, parsers for contextfree languages are often classified by the number of tokens the parser may need to peek ahead.We also see the use of lookahead in [Sakuma et al. 2012] where it is used to transform nondeterministic transducers into deterministic ones.Regular lookahead is used in the language Bex [Veanes 2015], which is used for specifying string transformations.The so-called "And-predicates" and "Not-predicates" in parsing expression grammars (PEGs) [Ford 2004] correspond to positive and negative lookahead assertions respectively.Miyazaki and Minamide [2021] have proposed extensions of context-free grammars with lookahead.
Lookaround assertions are often used to extract data that arise in specific contexts.The language CDuce [Benzaken et al. 2003] uses regular expression types to extract data from XML documents.The Kleenex language [Grathwohl et al. 2016] uses regular expressions as grammars (types) to describe string transductions that can extract data from streams.This involves a "greedy" disambiguation policy that generalizes greedy regex parsing [Frisch and Cardelli 2004;Grathwohl et al. 2013Grathwohl et al. , 2014a;;Nielsen and Henglein 2011].
The use of derivatives for matching regular languages is popular in functional and formally verified implementations.The simplest form are Brzozowski's derivatives [Brzozowski 1964] and they lend themselves to a natural functional implementation of an implicit DFA of the underlying regular expression.Coquand and Siles [2011] present a formally verified framework for deciding equivalence of regular expressions based on Brzozowski derivatives.Brzozowoski has shown that the number of derivatives are finite if they are simplified using associativity, idempotence and commutativity rules.Recent work [Egolf et al. 2022] shows how these optimizations could be incorporated in practice into a verified implementation.The size of a Brzozowoski derivative can be large.Antimirov [1996] suggested using sets of partial derivatives for a more efficient algorithm.This is also related to the technique of prebases discussed by Mirkin [1966] (see also [Brzozowski 1971] and [Champarnaud and Ziadi 2001]).Partial derivatives have been used for formally verified implementations in [Komendantsky 2012] and [Moreira et al. 2012].Doczkal et al. [2013] have developed a comprehensive formalization of regular languages in Coq which encompasses regular expressions, NFAs, DFAs, and the Myhill-Nerode Theorem.We see another NFA based formalization in [Firsov and Uustalu 2013], where NFAs are simulated using their matrix representations in the Agda formalization.Morihata [2012] studies the translation of regular expressions with lookahead into DFAs of doubly exponential size.A treatment of lookahead using derivatives can be found in [Miyazaki and Minamide 2019].A regular expression with lookahead is interpreted as a set of pairs (, ) of strings, where  is the matching string and  is the remaining string.A lookahead assertion is interpreted as a set of pairs of the form (, ) because it constrains the remaining string without consuming any string symbols.The finite state automata constructed using this derivative-based technique has a similar blow-up to the one considered by Morihata [2012].The authors provide a lower bound argument showing that lookaround assertions can indeed cause a doubly exponential blow-up in some cases when converted to a DFA.Note that, while our algorithm runs in linear time, it is not a streaming algorithm, since it makes both forward and backward passes (in the case of lookbehind and lookahead assertions, respectively) on the input.Berglund et al. [2021] establish the semantics of lookarounds using alternating automata that can make forward or backward passes on the string.This definition is very close to the operational definition used by practitioners.However, alternating automata are a powerful model, and it is not easy to see how they can be simulated efficiently.[Trofimovich 2020] suggests an implementation of regular expression matching (tool RE2C) using automata and tagged transitions and lookahead.The tags are markers which help extract sub-matches.Moseley et al. [2023] consider a derivative-based approach for matching regular expressions with anchors, which are a very restricted form of lookaround assertions that only have a lookahead or lookbehind of at most one symbol.Bando et al. [2012] consider regular expressions with lookahead and lookbehind in the context of deep packet inspection in networks.They propose an FPGA-based implementation and estimate that around 25,000 regexes can be accommodated and a throughput of 34 Gbps can be achieved.Chida and Terauchi [2022] consider the expressiveness of regular expressions with lookaround and backreferences.They conclude that adding lookaround enhances the expressiveness of regular expressions with backreferences.This is in contrast to classical regular expressions (i.e., without backreferences), where adding lookaround assertions does not increase expressiveness.

CONCLUSION AND FUTURE WORK
We have proposed a formal semantics for regular expressions with lookaround.Many commonly used regex engines that support lookaround resort to backtracking search.Algorithms that are based on using one automaton for the entire pattern also seem to incur a non-trivial blow-up.Intuitively, this is because matching lookaround information requires additional contextual information about the remainder of the string.We have presented an algorithm that matches regexes with lookaround in time  ( • ), where  is the size of regex and  is the length of the input string.This time complexity is the same as that of Thompson's algorithm for classical (i.e., lookaround-free) regular expression.We see from our empirical evaluation that the implementation of our algorithm, which is augmented with some performance optimizations, has performance that is substantially better than the state-of-the-art PCRE and Java engines on the real workloads that we have considered.
A worthwhile direction for future work is the extension of our implementation with more advanced operators that are useful in practice.The incorporation of some optimizations for bounded repetition (see, for example, [Kong et al. 2022] and[Le Glaunec et al. 2023]) in our implementation seems to be feasible.Backreferences pose a challenge because they can give rise to nonregularity, but there are special cases (e.g., backreferences to bounded strings, as in the regex (?P<q>[a-z]{3})(?P=q) ) that stay within the realm of regularity.

N
Fig.2.Algorithm for matching oracle regular expressions using ONFA simulation.

N
Fig. 4. Algorithm for matching regular expressions with lookaround assertions.

NN
Fig. 5. Algorithm for matching regular expressions with lookbehind-only assertions.A completely symmetric algorithm handles regular expresssions with lookahead-only assertions.

Fig. 6 .
Fig.6.An approximate algorithm for matching regular expressions with lookaround assertions.If this algorithm indicates that the output is uncertain, then one of the previous algorithms has to be used.
Fig. 7. Comparison between base algorithm and optimized algorithm.
Fig. 8.Comparison of our algorithms with PCRE and Java's regex engine.