Parikh’s Theorem Made Symbolic

Parikh’s Theorem is a fundamental result in automata theory with numerous applications in computer science. These include software verification (e.g. infinite-state verification, string constraints, and theory of arrays), verification of cryptographic protocols (e.g. using Horn clauses modulo equational theories) and database querying (e.g. evaluating path-queries in graph databases), among others. Parikh’s Theorem states that the letter-counting abstraction of a language recognized by finite automata or context-free grammars is definable in Linear Integer Arithmetic (a.k.a. Presburger Arithmetic). In fact, there is a linear-time algorithm computing existential Presburger formulas capturing such abstractions, which enables an efficient analysis via SMT-solvers. Unfortunately, real-world applications typically require large alphabets (e.g. Unicode, containing a million of characters) — which are well-known to be not amenable to explicit treatment of the alphabets — or even worse infinite alphabets. Symbolic automata have proven in the last decade to be an effective algorithmic framework for handling large finite or even infinite alphabets. A symbolic automaton employs an effective boolean algebra, which offers a symbolic representation of character sets (i.e. in terms of predicates) and often lends itself to an exponentially more succinct representation of a language. Instead of letter-counting, Parikh’s Theorem for symbolic automata amounts to counting the number of times different predicates are satisfied by an input sequence. Unfortunately, naively applying Parikh’s Theorem from classical automata theory to symbolic automata yields existential Presburger formulas of exponential size. In this paper, we provide a new construction for Parikh’s Theorem for symbolic automata and grammars, which avoids this exponential blowup: our algorithm computes an existential formula in polynomial-time over (quantifier-free) Presburger and the base theory. In fact, our algorithm extends to the model of parametric symbolic grammars, which are one of the most expressive models of languages over infinite alphabets. We have implemented our algorithm and show it can be used to solve string constraints that are difficult to solve by existing solvers.


Introduction
Parikh's Theorem [Parikh(1966)] (see also [Kozen(1997), Chapter H]) is a celebrated result in automata theory with far-reaching applications in computer science.These include software verification [Esparza and Ganty(2011), Hague and Lin(2011), Hague and Lin(2012)], decision procedures for array and string theories [Daca et al.(2016), Lin and Barceló(2016), Chen et al.(2020), Janku and Turonová(2019), Abdulla et al.(2019)], and evaluation and optimization of database queries [Barceló et al.(2012), David et al.(2012)], among others.Parikh's Theorem concerns the so-called letter-counting abstractions of strings and languages.For example, the Parikh image of the string abaacb is the mapping f : {a, b, c} → N, where f (a) = 3, f (b) = 2, and f (c) = 1.In other words, the Parikh mapping abstracts away the ordering from a string (resp.a set of strings), i.e. yielding a multiset (resp.a set of multisets).Parikh's Theorem states that the class of context-free languages and the class of regular languages coincide modulo a Parikh mapping, both of which are moreover expressible as a formula in Linear Integer Arithmetic (a.k.a.Presburger Arithmetic).This is illustrated in the following example.
Example 1.1.The Parikh image of the regular language L := (ab) * is the set S containing all mappings f : {a, b} → N with f (a) = f (b).Observe that S is also the Parikh image of the context-free language {a n b n : n ≥ 0}.The Parikh image of L can be expressed as x a = x b ∧ x a ≥ 0, where x a (resp.x b ) represents the count for the letter a (resp.b).
Although the classical formulation of Parikh's Theorem concerns mainly the expressiveness of language models modulo taking Parikh images, its usefulness in applications was enabled only decades later by the development of efficient algorithms that compute an existential LIA formula (i.e. of the form ∃xϕ, where ϕ is quantifier-free) from a given automaton/grammar, enabling the exploitation of highly optimized SMT-solvers.In fact, building on the result by Esparza [Esparza(1997)], Verma et al. [Verma et al.(2005)] develops a lineartime algorithm that computes an existential LIA formula capturing the Parikh image of a given grammar.These results enabled the exploitation of Parikh's Theorem in many applications.Among others, these include verification of multithreaded programs with counters and possibly with (recursive) function calls [Hague and Lin(2011), Hague and Lin(2012), To(2009)], verification of concurrent and multithreaded programs [Esparza and Ganty(2011), Hague and Lin(2012)], verification of cryptographic protocols [Verma et al.(2005)], decision procedures for array theories [Daca et al.(2016)], decision procedures for string constraints [Lin and Barceló(2016), Chen et al.(2020), Janku and Turonová(2019), Abdulla et al.(2019)], query evaluation over graph databases [Barceló et al.(2012)], and reasoning over XML documents [David et al.(2012)].The following two examples illustrate two simple applications of Parikh's Theorem for difficult problems.
Example 1.2.The problem of checking emptiness of the intersection of several context-free languages has immediate applications in static analysis of concurrent programs (e.g.see [Bouajjani et al.(2003)]).However, since the problem of checking emptiness of two context-free languages is well-known to be undecidable (e.g.see [Kozen(1997)]), multiple incomplete methods are proposed, which include Parikh abstractions [Bouajjani et al.(2003)] and synthesis of regular separators/overapproximations [Gange et al.(2015), Gange et al.(2016),Long et al.(2012)], among others.Take the two context-free languages used in the benchmark of [Gange et al.(2016)]: Loring et al.(2019), Chen et al.(2022), Amadini et al.(2019)].In this example, we deal with the simple string constraint which the authors of a recent paper [Blahoudek et al.(2023)] have found to lead to failure for all string solvers that they have tried.We want to find a solution (i.e.mapping from the string variables x, y, z to strings over the alphabet Σ = {a, b}) satisfying all the restrictions.The first restriction is an equation zyx = xxz, which enforces the two different concatenations of the strings instantiating the variables to be equal.For example, the mapping λ, where λ : z → aa and λ : y, x → a, satisfies this restriction; whereas the mapping λ, where λ : z → a and λ : y, x → b, does not.Each of the other restrictions is a regular constraint, which enforces a solution of a variable to satisfy certain regular patterns.For example, x ∈ a * enforces that the variable x should be instantiated to be a string consisting only the letter a.
By using letter-counting abstraction, we can easily show the above example to be unsatisfiable.For each l ∈ Σ, let |x| l denote the number of times l appearing in x.The letter-counting abstraction of the equation is the following quantifier-free LIA formula: l∈Σ |x| l + |y| l + |z| l = 2|x| l + |z| l , which can be simplified to l∈Σ |y| l = |x| l .The letter-counting abstraction of the regular constraint x ∈ a * (resp.z ∈ b * ) is |x| a ≥ 0 ∧ |x| b = 0 (resp.|z| a = 0 ∧ |z| b ≥ 0).Finally, the letter-counting abstraction of the regular constraint y ∈ a + b + is |y| a > 0 ∧ |y| b > 0. Therefore, the letter-counting abstraction of ϕ is the quantifier-free LIA formula This formula is easily seen to be unsatisfiable since it asserts that 0 = |x| b = |y| b > 0. Furthermore, this can be easily checked by virtually all existing SMT-solvers which support LIA (e.g.Z3 [de Moura and Bjørner(2008)]).
Despite the usefulness of Parikh's Theorem, most real-world applications require either large finite or even worse infinite alphabets, which renders the classical Parikh's Theorem impractical.For example, a regex r over )+, which accepts non-empty strings over non-ASCII characters of even length -has a total of 2 16 characters.A direct application of the linear algorithm from Verma et al. [Verma et al.(2005)] would yield a LIA formula with at least 2 16 variables, each keeping track of the count for each letter in UTF-16.
Symbolic automata framework.The framework of symbolic automata [D' Antoni and Veanes(2021), D 'Antoni and Veanes(2017), Veanes et al.(2012)] (a.k.a.automata modulo theories) has proven in the last decade to be a fruitful approach for handling large finite or even infinite alphabets.
The key to the framework is the symbolic representation of alphabets known as effective boolean algebras.Loosely speaking, an effective boolean algebra is a domain D with a class of monadic predicates (i.e. each has an interpretation as a subset of D) that is closed under boolean operations (set-union, set-intersections, and set-complementation).The term "effective" refers to the fact that each monadic predicate P describes a syntactic property (e.g. a character class in Unicode, or a LIA formula with one free variable), and that checking whether the interpretation [[P ]] ⊆ D is empty is decidable.Many examples of effective boolean algebras are available including, notably, SMT algebras.For example, a LIA boolean algebra consists of domain D = Z and monadic predicates of LIA (with existential quantifiers allowed), e.g., P := x ≡ 0 (mod 2).The syntactic representation of predicates P and decidability of checking emptiness can be taken advantage of by allowing an automaton transition of the form p → P q, where p, q are two automata states, representing all (potentially infinitely many) transitions of the form p → a q, where a satisfies P .Symbolic automata extend normal automata by allowing such transitions.An analogous representation of symbolic automata in terms of symbolic (regular) expressions [D' Antoni and Veanes(2021), Stanford et al.(2021)] is also possible, where predicates are allowed instead of concrete letters, e.g. the expression P + represents the sequences of strings of odd numbers, whenever P := x ≡ 1 (mod 2).
Most analysis of symbolic automata is known to be reducible to the case of normal automata, but with an exponential blow-up in the alphabet size and the number of transitions [D' Antoni and Veanes(2017), Veanes et al.(2012)].Although in most cases such an exponential blow-up is unavoidable in the worst case, clever algorithms that circumvent this exponential blow-up in practice have been devised on basic automata operations (e.g.boolean operations, transductions, learning, etc.).This takes us to the question of Parikh's Theorem in the setting of symbolic automata, which has so far not received much attention in the literature of symbolic automata.
A natural counterpart of letter-counting abstractions in the framework of symbolic automata is predicate-counting abstractions.Let us revisit Example 1.3 but with the letters a and b instantiated with different character classes.Let us start with a := \d (meaning a digit) and b := \D (meaning a non-digit).In this case, the predicate-counting abstraction with respect to a and b simply counts the numbers of occurrences of digits and non-digits in each string instantiations of x, y, z.The same reasoning used in Example 1.3 will allow us to prove unsatisfiability.On the other hand, consider a := \s (including space symbols, tabs, and newlines) and b := .(meaning any character, except for a newline).Then, a (resp.the complement ā) and b (resp.b) have non-empty intersections.(More precisely, a ∩ b contains a space symbol, a ∩ b contains a newline character, and ā ∩ b contains (say) a digit.)In general, n predicates in a symbolic automaton (equivalently, symbolic regular expression) can give rise to O(2 n ) different "combinations" (a.k.a.min-terms).In other words, predicatecounting abstractions over symbolic automata can be reduced to letter-counting abstractions of normal automata, but over an exponentially bigger alphabet.Thus, the linear-time construction of Verma et al. for Parikh images of automata/grammars with (say) 14 predicates would yield already a large LIA formula with more than 15000 integer variables, which is very challenging to solve for existing SMT-solvers.

Contributions.
Our main result is the first polynomial-time algorithm for computing an existential formula that captures the predicate-counting abstraction of a given symbolic automaton.In fact, the algorithm extends to more expressive formalisms, namely, symbolic context-free grammars even when they are extended with "read-only registers"; this is a model referred to as parametric symbolic grammars, which extend both symbolic automata [D' Antoni and Veanes(2021), D 'Antoni and Veanes(2017)], symbolic visibly pushdown automata [D' Antoni and Alur(2014)], and symbolic variable/parametric automata [Grumberg et al.(2010), Faran and Kupferman(2020), Figueira et al.(2022)].This new formalism has further applications including solving complex string constraints, e.g., with context-free constraints and, to some extent, the infamously difficult operator to re, which converts strings to regular expressions.We have provided an implementation of our algorithm and demonstrated its efficacy in solving some difficult string constraints examples.We detail these contributions below.
As described above, the main technical difficulty of our problem is that the standard reduction from symbolic automata A to normal automata A ′ yields an exponential-sized alphabet, i.e. 2 n when counting n predicates over theory T .This is in general not avoidable, e.g., symbolic regular expressions of the form P 1 P 2 • • • P n over the LIA algebra, where P i represents the set of all numbers that are congruent to 0 modulo the ith prime, have O(2 n ) feasible min-terms.It turns out that, when considering predicate-counting abstractions, if w is accepted by A, there is w ′ that is also accepted by A, the predicate-counting abstractions of w and w ′ are the same and w ′ uses O((n + |N |) log(n + |N |)) different letters.Notice that this is an almost linear bound.In the case when A is a parametric symbolic grammar, the size of the alphabet is , where ℓ represents the maximum length of the right-hand side of a production in A. Furthermore, we show that we can compute an existential T +LIA formula ϕ A that captures this predicate-counting abstractions.The formula ϕ A can be solved easily in the standard SMT framework of DPLL(T , LIA) (e.g.[Kroening and Strichman(2008)]), which uses LIA and T solvers separately to add blocking lemmas.It follows immediately that we obtain decision procedures for analyzing satisfiability of predicate-counting abstractions (possibly restricted with additional LIA formulas) with a tight complexity upper bound: if T is NP-complete (resp.PSpace-complete), then our problem is also NP-complete (resp.PSpacecomplete).
Since string constraints are defined over the Unicode alphabet1 , one natural application of our result is in checking unsatisfiability of string constraints.By means of predicate-counting abstractions, we show how string constraints can be abstracted into the Parikh image of a symbolic grammar with an additional LIA restriction.Here, we allow an expressive and wellknown subclass of string constraints (in particular, commonly used subclass of QF SLIA theory of SMT-LIB 2.6 [Barrett et al.(2017)]), which permits string concatenation, string equations, replace, regular constraints, contains, prefix-of, and suffix-of.Note that unsatisfiability of the latter implies unsatisfiability of the original string constraint, but not the converse; we are not aware of any non-trivial class for which the converse implication would hold, a trivial one is over a unary alphabet.At the same time our result admits an easy extension to sequence theories, which permit general effective boolean algebras (e.g.see [Jeż et al.(2023)]).
We have implemented this translation, which takes an SMT-LIB file and produces a quantifierfree LIA formula, which can be easily checked using SMT-solvers.Our experimental results show that our procedure can substantially outperform existing string solvers for proving unsatisfiability (details are in Figure 1).
Finally, as mentioned above, our paper establishes Parikh's Theorem for generalizations of symbolic automata, i.e., parametric (symbolic) grammars and parametric (symbolic) pushdown automata.Such formalisms are highly expressive, e.g., can express Dyck languages with infinitely many parenthesis symbols.This has many potential applications.The first application is the support of symbolic context-free constraints, i.e., an expression of the form x ∈ L, where L is given by a parametric grammar.In fact, classical results on string analysis (e.g.[Christensen et al.(2003), Minamide(2005)]) heavily use context-free constraints, which are not supported by SMT-LIB 2.6, but are supported by a handful of modern string solvers (e.g.TRAU [Abdulla et al.(2018)]).The second application is a partial support of a "futurelooking" feature in SMT-LIB 2.6: to re, which converts a string (possibly with variable names) to regular expressions.This is highly expressive, e.g., it allows encoding word equations with Kleene star like xy = z * .Existing benchmarks allow only a very limited usage of to re: strings with only constants (i.e.no variables) as input.Using parametric grammars, we can encode some interesting use cases of to re beyond only string with constants, e.g., we can encode parametric regular constraints of the form x ∈ y * , where both x and y are variables.Finally, parametric pushdown automata strictly extend the model of symbolic visibly pushdown au-tomata [D' Antoni and Alur(2014)], which has applications to dynamic analysis of programs.By allowing parameters and pushing values onto stack, our model allows some support of static analysis as well (of course with restrictions, for otherwise decidability would result).
Organization.We fix our notation and introduce our notion of parametric context-free grammar in Section 2. Also in this section, we show that these grammars admit a representation as parametric pushdown automata.In Section 3 we prove our new Parikh's Theorem for parametric symbolic grammars.In Section 4, we provide an abstraction of string constraint solving via predicate-counting constraints, and outline an extension to sequence theories.We describe our implementation and report our experimental results in Section 5. We conclude the paper in Section 6 with related work and future work.

Models
We start by introducing some basic notation for quantifier-free theories.Then we introduce parametric context-free grammars and finally an equivalent model of pushdown systems.

Preliminaries
For simplicity, we follow a model-theoretic approach to define our symbolic alphabets.This allows a more convenient treatment of parameters in our automata (e.g.see [Jeż et al.(2023)]).Let σ be a set of vocabulary symbols.We fix a σ-structure S = (D; I), where D can be a finite or an infinite set (i.e. the universe) and I maps each function/relation symbol in σ to a function/relation over D. The elements of our sequences will range over D. We assume that the quantifier-free theory T S over S (including equality) is decidable.Examples of such T S are abound from SMT, e.g., Linear Real Arithmetic and Linear Integer Arithmetic.We write T instead of T S , when S is clear.Our quantifier-free formula will use uninterpreted T -constants (which will represent the parameters) a, b, c, . .., and local variables x, y, z, . ... We use C to denote the set of all uninterpreted T -constants and V to denote the set of all local variables.
We write ϕ(X , Y) for a formula that is a Boolean combination of terms constructed from functions/relations in σ, uninterpreted constants X ⊆ C, and local variables Y ⊆ V. We say such a formula is a formula of T .An existential formula of T is of the form ∃x 1 , . . ., x n .ϕ, where ϕ is a quantifier-free formula of T .An interpretation or assignment of the constants (resp.variables) is a map C → D (resp.V → D).We write ϕ(X, Y ) for the formula under these interpretations.We write T |= ϕ(X, Y ) if ϕ is true in S under interpretations X, Y .A formula ϕ is satisfiable if there are interpretations X, Y such that the formula becomes true in S.
We write T 1 + T 2 for the Boolean combination of theories T 1 and T 2 .That is, quantifier-free Boolean combinations of formulas ϕ where ϕ is a quantifier-free formula either of T 1 or of T 2 .When T 1 and T 2 satisfy certain conditions, decision procedures for T 1 and T 2 can be combined, e.g. with Nelson-Oppen [Nelson and Oppen(1979)].
In brief, a production (A, α, ϕ) of a parametric context-free grammar is an extension of a production (A, α) of a classical context-free grammar.The production replaces a non-terminal A. However, α is not a sequence of characters and non-terminals, but a sequence of local variables and non-terminals, e.g.yAy.The final component ϕ is a guard over both the parameters of the grammar and the local variables.E.g. ϕ may assert y = x for the parameter x.Hence, if x took the concrete value a, the production would replace A with aAa.
Definition 2.1 (Parametric Grammar).A parametric grammar over alphabet theory T is of the form G = (X , N , D, P, S), where X is a finite set of parameters, N are the nonterminal symbols, D is the (perhaps infinite) set of symbols from the domain, S is a starting symbol and P is a finite set of triples (A, α, ϕ) from N × (Y ∪ N ) * × T (X , Y), where Y is the set of local variables.
Here, parameters are uninterpreted T -constants, i.e.X ⊆ C; the Y variables are "local" in the sense that they are instantiated each time the production is used.Formulas that appear in the productions from P will be referred to as guards, since they restrict which symbols can be produced.
The semantics is defined by generalizing the rewriting-style definition of derivation for context-free grammars.A derivation begins with the singleton sequence β 0 consisting only of the starting symbol S. From a sequence βAβ ′ we can derive βα ′ β ′ , when α ′ can be derived from A, which is defined as follows: For interpretations X of X and Y of Y we write α[X /X, Y/Y ] for the substitution of the constants in X by their interpretation under X and the variables Y by their interpretation under Y .The nonterminal A can be rewritten by a rule (A, α, ϕ(X , Y)) with α ′ = α[X /X, Y/Y ] when T |= ϕ(X, Y ).We call α[X /X, Y/Y ] an instantiation of α and often denote it by α ′ .We will often refer to a rule A → α in this case, suppressing the actual instantiation.
Then L X (G) is the set of nonterminal-free sequences from D * that can be derived by G for an interpretation X of the parameters and L(G) = X L X (G).
Then the language is Notice that the parameter x takes the value c in all productions, but the local variable y can take a different value (d i ) during each application of the productions.
Assuming D = N we can add a condition on the production: (S, yxSxy ′ , "y > x ∧ y + y ′ = 0"), (S, yxxy ′ , "y > x ∧ y + y ′ = 0") In proofs dealing the properties of the derivations (and not the derived sequences), we focus on rules A → α and their instantiations α ′ , and not the guards, which are defined implicitly by the choice of rule.
Note that when all rules of the grammar are of the form A → yB with |y| = 1 and B ∈ N or A → ε, then we can think of G as an automaton (with N taking the role of states, S being the initial state and states such that B → ε being the final states).Such an automaton is an extension of both symbolic automata [D' Antoni and Veanes(2021), D 'Antoni and Veanes(2017), Veanes et al.(2012)] and variable automata [Grumberg et al.(2010), Faran and Kupferman(2020)], which is referred to as parametric (symbolic) automaton [Figueira and Lin(2022), Figueira et al.(2022), Jeż et al.(2023)].
Proposition 2.3.Assume that T is solvable in NP (resp.PSpace).Then, deciding nonemptiness of a parametric grammar over T is in NP (resp.PSpace).
Proof.The proof is a generalization of a standard proof for context-free grammars and proofs for parametric automata, see [Faran and Kupferman(2020), Figueira et al.(2022), Figueira and Lin(2022)].The language L(G) is nonempty if and only if it is nonempty for some parameters X, which we will existentially quantify over.
As in the case of standard context-free grammars, (for fixed parameters) the L X (G) is nonempty if and only if we can order a subset of nonterminals A 0 , A 1 , . . ., A k , where S = A 0 , and for each find a rule (A i , α i , ϕ i ) such that: if A j is in α i then j > i; and for each i the guard ϕ i (X, Y ) is satisfiable (for some Y ).Hence we guess the sequence A 0 , A 1 , . . ., A k , verify the first condition and then verify the formula ∃X .k i=0 ∃Y .ϕ i (X, Y ).Clearly, if T is in NP (resp.PSpace) then the above algorithm is in the same class.

Parametric pushdown automata
As in the case of usual context-free grammars, there is a pushdown equivalent of our parametric grammar.Some care is needed, as some combinations of allowed transitions lead to a much more powerful model: for instance, it is not difficult to show that if we do not allow parameters then our grammar model cannot express the language d∈D d * , yet it is easy to come up with a pushdown automaton which can recognize such languages (with no parameters): it is enough to store the first symbol on the stack and then compare each consecutive symbol with it; hence such a run should not be allowed.However, it seems natural that we should allow storing elements of D on the stack, so that, say, palindromes can be recognized.As a solution, if A sees an element from D on the top of the stack then it must pop it from the stack.It cannot push and cannot make an ε-transition.
Nondeterministic parametric pushdown automata are formally defined as follows (cf.standard definition [Sipser(2013)]).We explain some restrictions on the definition below.
Definition 2.4 (Nondeterministic Parametric Pushdown Automata).A nondeterministic parametric pushdown automaton is a tuple (Q, D, Γ, ∆, q 0 , F ), where as usual Q is a finite set of states, Γ is finite stack alphabet, we require that .
Our definition is a strict extension of symbolic visibly pushdown automata [D' Antoni and Alur(2014)].In the above definition curr is a variable bound to the symbol from D read by the automaton, top is a variable bound to stack top symbol, which is from D. The four cases of transition function are for ease of presentation, as in principle one could define ∆ as one set with some syntactic conditions.The case-distinction is as follows: In ∆ Γ∪{ε} ∪ ∆ D the automaton reads an input letter (bound to variable curr) and in ∆ ε,Γ∪{ε} ∪ ∆ ε,D it does not, i.e. those are εtransitions and they do not refer to curr in the guards, nor in the word pushed to stack.In ∆ Γ∪{ε} ∪ ∆ ε,Γ∪{ε} the stack topmost symbol is from Γ or we do not read the stack at all; in ∆ D ∪ ∆ ε,D the stack top-most symbol is from D, in which case we are not allowed to push anything to the stack; on the other hand we can use it in the guard, say for comparison with curr.The reason for not allowing pushing to the stack in this case is that we do not want to copy the stack contents, which easily leads to recognition of language ∪ d∈D d * , which should not be recognized without parameters.
Like the case of parametric grammars, the sequence pushed to the stack may in general depend on the read character curr, some local variables Y and the parameter interpretation X.We require that when α ′ is actually pushed to stack (say for a transition in ∆ Γ∪ε ) then where d is the character read and Y is any assignment to Y such that ϕ(d, a, X, Y ) holds, where ϕ is the guard of the rule and a is the top of stack character; α ′ for ε-transitions is defined similarly, i.e. as α[X /X, Y/Y ].
Note, the guards can provide expressive power: E.g. for palindromes, we can store the first half of the read word on the stack and then, for the second half, check equality with the read symbol while popping character by character from the stack.That is, using the guard top = curr.
Let us describe the semantics, we will focus on ∆ Γ∪{ε} , the other cases are defined similarly.A configuration is a tuple (q, w) where q ∈ Q is a state and w ∈ (Γ∪D) * is the stack contents; as a convention, we assume that the stack top-most symbol is the first in w.Take a configuration (q, aw), where a ∈ Γ ∪ ε, and a transition (q, a, ϕ, q ′ , α), let d be the symbol read by the automaton.If for some Y the ϕ(d, X, Y ) holds then the automaton can change the configuration to (q ′ , α[curr/d, X /X, Y/Y ]w).Note that if a ∈ Γ then we need to pop it from the stack, and if a = ε then the transition does not depend on the stack contents.For ∆ D the move is defined analogously, but when the stack contents is sw for s ∈ D, the guard is evaluated as ϕ(d, s, X, Y ).The semantics of ε-transitions is defined similarly.
A word w ∈ D * (for parameter interpretation X) is accepted if there is a run for w from (q 0 , ε) to (q, w ′ ) for some w ′ and q ∈ F .By L X (A) we denote the language recognized by a parametric NPDA A for a given interpretation X of parameters, and define L(A) = X L X (A).
Theorem 2.5.The class of languages recognized by parametric context-free grammars and parametric non-deterministic pushdown automata coincide.
The equality is shown using two natural inclusions, proven in the Lemmata below.
Lemma 2.6.Given a parametric grammar G we can compute in polynomial-time a parametric NPDA A of size linear in the size of G such that for each parameter interpretation X we have Proof.The proof is an adaptation of the classic proof, see [Sipser(2013), Lemma 2.21]; note that since ε-transitions are allowed, we do not need Greibach normal form, which is a little cumbersome in parametric setting.
Given a sequence w the A will simulate the derivation of G by always greedily expanding the left-most nonterminal and matching the left-most unmatched letter of the input sequence.We use the same parameter interpretation X as G does.The automaton has three states q 0 , q, and q f .The starting state is q 0 and q f is the unique accepting state.The Γ is N ∪ {⊥}, where ⊥ represents the stack bottom.In q 0 , the automaton A pushes S⊥ to the stack (so S is top-most) and moves to q, here S is the starting symbol of the grammar..In q if A sees ⊥ on the stack then it moves to q f and accepts (it cannot proceed).Otherwise, if the topmost symbol is A ∈ N then A chooses (nondeterministically) a rule A → α and its (valid) instantiation α ′ , pops A and pushes α ′ to the stack.If the topmost symbol is d ∈ D and the next symbol is d (that is top = curr) then A pops the letter and reads the next symbol from the input (and stays in q).It is easy to see that the resulting automaton A has size linear in G, and can furthermore be computed in polynomial-time.A slight modification of a standard proof shows the for each instantions of parameters X we have The other direction is slightly more involved.Our construction is exponential in the maximum length of α appearing in a transition (q, c, ϕ, q ′ , α) ∈ ∆ Γ∪ε or in a transition (q, ϕ, q ′ , α) ∈ ∆ ε,Γ∪ε .Unlike the non-parametric case, we cannot split pushing transitions so that each transition pushes at most one symbol to the stack.This is because the values of Y and curr cannot be transferred across separate transitions.Thus a rule pushing α to the stack needs an exponential number of productions to handle all sequences of intermediate states that may occur while α is later being popped.However, so long as we fix the maximum length of such α, the resulting grammar is of polynomial size and can be computed in polynomial-time.
Lemma 2.7.Given a parametric NPDA A there is a parametric grammar G such that for each parameter interpretation X we have L X (G) = L X (A).The size of G is exponential in the maximum number M of push symbols appearing in any transition of A. When M is fixed, the algorithm runs in polynomial-time.
Proof.We modify a standard construction cf.[Sipser(2013), Lemma 2.27].For a fixed parameter interpretation X, L X (A q,q ′ ) is the language recognized by A with starting state q, final state q ′ .We modify A so that: • it has a single accepting state • it empties the stack before accepting • in each move it either pushes something (perhaps the empty sequence) to the stack or pops from the stack, but not both.
The first two conditions are easy to ensure, the last depends on the form of the transition: • If the transition is from ∆ D ∪ ∆ ε,D then it does not push, as required.
• If the transition is from ∆ Γ∪ε ∪ ∆ ε,Γ∪ε and it reads ε from the stack, then it does not pop from the stack, as required.
• If the transition is from ∆ Γ∪ε ∪ ∆ ε,Γ∪ε for a topmost symbol γ ∈ Γ then we create a new state q q,γ and create an ε-transition from q that removes γ without reading a letter and goes to q q,γ .Then from q q,γ the automaton ignores the stack and acts as if it were in q with γ on top of the stack.
The defined automaton recognizes the same language (for parameter interpretation X.) Note, that the conditions on the transition relation when the topmost symbol is from D are tailored so that the above separation of popping and pushing is possible.
In a standard proof of equivalence of NPDAs and CFGs, cf [Sipser(2013), Lemma 2.27], the computation of A is split into parts in which it empties the stack (from the symbols it introduced).The assumption that A can push at most one element to the stack makes the proof easier; however, we need to push more symbols.But this only means that when a 1 a 2 • • • a k is pushed to the stack, the computation is split into k subcomputations, in which it removes a 1 , a 2 , . . ., a k from the stack.
To be more precise, let Γ = N × Q × Q and denote its elements by A q,q ′ , with the intention that L X (A q,q ′ ) is the language of words such that A starting in q (and empty stack) will go on this word to the empty stack and state q ′ .In particular, we will set A q0,q f as the starting symbol, where q 0 is the starting state and q f the unique final state, and then L(A q0,q f ) = L(A).
Recall the classic construction, in which case each rule pushing an element to the stack pushes at most one symbol.When describing the computation taking the automaton from q to q ′ and from the empty stack to the empty stack, i.e. corresponding to the nonterminal A q,q ′ , either the stack is emptied somewhere on the way, say at state q ′′ , which means that we have a rule A q,q ′ → A q,q ′′ A q ′′ ,q ′ , or it is emptied at the last step.Hence if the first transition pushes s to the stack, the last removes it from the stack and in the meantime the computation is as if it started and ended on an empty stack.Hence the rule is of the form A q,q ′ → aA p,p ′ b such that there is a transition from q to p, reading a, pushing s and a transition from p ′ to q ′ popping s, reading b (both a, b can be a letter or ε).
In our case we cannot assume that a transition pushes just a single symbol to the stack, as the sequence pushed may contain the same local variable or curr several times and splitting into many rules would lose this connection.Hence we need to consider rules pushing, say, s 1 • • • s k and the k rules that pop those letters (and in between A acts as if it were starting and ending on the empty stack), at the same time and "compound" their computation.
We include the rules given below; The first bullet point states that a non-terminal A q,q can be rewritten to ε.That simulates a move from q to q without firing any transitions.In the second bullet, the production A q,q ′′ → A q,q ′ A q ′ ,q ′′ handles the case where the stack is emptied at state q ′ on the run from q to q ′′ .
Γ∪ε (so the one pushing s 1 • • • s k to the stack and not looking at the stack) such that there are q 2 , . . ., q k+1 (states in which s 1 , . . ., s k are popped) such that for all 2 ≤ i ≤ k we have (q i , ϕ the y i is the symbol read by the i-th transition, Y i are the local variables of the i-th transition, s ′ i is the symbol popped by i-th transition and ϕ i is the guard of the i-th transition; those are described in detail below.Note that if the i-th transition for 0 ≤ i ≤ k is an ε-transition then ϕ i does no depend on the curr, and if it does not read the stack top element, then ϕ i does not use top, but we write like this to streamline the argument. Concerning y i , corresponding to letters read by popping transitions, if the i-th transition is an ε transition then y i = ε and otherwise it is a fresh local variable (which appears in the guard).Concerning s ′ i , they "should be" s i , but the problem arises when s i is equal to curr, i.e. s i pushed to the stack is the read letter; then curr "should be" y 0 .Hence, if s i = curr we define s ′ i = y 0 .That is, at q 0 the read symbol y 0 was pushed to the stack.Otherwise -if s i = curr -we set s ′ i = s i , which could be both an element from Γ or D. We have different fresh copies Y i of the local variables.This is because multiple pushdown transitions are encoded in a single rule.The value of the local variables may differ for each transition fired.Hence, we need separate copies for the different guards ϕ i combined in the grammar production.Notice that these separate copies are never pushed onto the stack as there is only one pushing transition represented by the production.
The proof is now a generalization of the standard one [Sipser(2013), Lemma 2.27].When starting from q with an empty stack and ending in q ′ , then if we reach the empty stack somewhere on the way we use A q,q ′′ → A q,q ′ A q ′ ,q ′′ .Otherwise, we push some s 1 • • • s k to the stack and take them from the stack one by one: i.e. for each i there is the first moment when s i is taken from the stack and from the moment we took s i−1 right before we take s i the A acts as if on empty stack.

Computing Parikh images
We first generalise the notion of a Parikh image to the parametric setting.Then we discuss the construction of Parikh images, and the complexity of a related decision problem.

Definition
We first recall the classical definition of the Parikh image of a language.Take a finite alphabet Σ = {a 1 , . . ., a n }.For a word w ∈ Σ * , let |w| a be the number of occurrences of the character a in the word w.For a linearisation a 1 , . . ., a n of the characters of Σ, the Parikh image P (w) of w is a mapping f : {1, . . ., n} → N such that f (i) = |w| ai for all 1 ≤ i ≤ n.For a language L ⊆ Σ * the Parikh image is the set of mappings {P (w) : w ∈ L}.That is, the Parikh image counts the number of occurrences of each character in each word of L.
Parametric context-free grammars may have large or even infinite alphabets.Counting each of the characters is either impractical or impossible.Hence, we define a version of the Parikh image that is relative to a sequence of predicates.We then count the number of characters satisfying each predicate, rather than the number of each individual character.Let us fix a parametric grammar G over T and a sequence Ψ := ψ 1 , . . ., ψ n of T -formulas.

Representing the Parikh Image
We can construct a formula in the combined theory of T and quantifier-free Presburger arithmetic that represents the Parikh image of a given parametric grammar G.Such a formula is polynomial in size and can be used as part of a query to an SMT solver to solve decision problems over the grammar.Below, let QFPresburger be the theory of quantifier-free fragment of Presburger arithmetic.
Theorem 3.2.Let G be a parametric grammar over the theory T and take a sequence Ψ := ψ 1 , . . ., ψ n of T -formulas.There is an existential In principle there are exponentially many (in n) different symbols a which yield different images P Ψ (a).A naive approach to computing Parikh images would compute the possible images f of some character a and calculate their possible sums in words generated by G. Since this may require an exponential number of different characters, the naive approach implies at least an exponential running time.We show that this is not the case and that the formula can be polynomial in size.
The main step of the proof is to show that if f ∈ P Ψ (L(G)) then there is a sub-domain D 0 ⊂ D of nearly linear (in n and |N |) size such that f ∈ P Ψ (L(G) ∩ D * 0 ) (the actual statement is more precise).This essentially reduces the problem to the standard case of a finite alphabet; some details are still needed (like finding the value of the parameters, finding the exact subdomain, etc.), but this is the crucial step.Note that this finitization method does not hold for some models over infinite alphabets, e.g., semilinear data automata of [Figueira and Lin(2022)], which can impose that there are exponentially many different elements of the domain in a given predicate ψ.A similar result was shown independently to obtain Carathéodory bounds for integer cones [Eisenbrand and Shmonin(2006)] and used to derive parallel results on the complexity of non-emptiness of symbolic tree automata [Raya(2023)].
To show the existence of such a sub-domain, we consider derivations of a word and we focus on the number of times (counts) each production is used.Note that the Parikh image is determined by those counts.Hence, the counts are an "intermediate notion" between the exact derivations and Parikh images and we focus on them.For classic context-free grammars it is well-known that a derivation with given counts exists if and only if those counts satisfy simple arithmetic constraints.Those constraints can be formulated in terms of sums of mappings, which are similar to the Parikh image mappings.
Then, using combinatorial arguments, we show that if we have a large number of such mappings (we use many productions) then there are subsets that have the same effect: i.e. in any derivation we can replace a subset of the productions with another subset without changing the Parikh image of the result.Moreover, we can construct weights for such sets and ensure that each time we make a replacement, the weight drops, which means that the replacing terminates at some point.The result has to be a derivation with a small number of different productions used.
We illustrate this process with a small example over characters rather than productions.Take the language D * .Assume we have three predicates Ψ = ψ 1 , ψ 2 , ψ 3 .Take a word a 1 a 2 a 3 a 4 a 5 using five different characters and suppose the Parikh image (using vectors to represent the maps) is and hence we can construct the same Parikh image using only a 1 and a 2 .That is P Ψ (a 1 a 2 a 1 a 2 ) = P Ψ (a 1 a 2 a 3 a 4 a 5 ).Our proof shows that if a large number of different productions are used, the same principle will allow us to find subsets with the same sum.
To make the above intuition formal, we first recall that there is a derivation of a CFG with a given number of times each production is used, if and only if those counts satisfy a simple arithmetical relation [Verma et al.(2005)]; informally: for each nonterminal, the number of times it is introduced and expanded are the same (except for starting nonterminal) plus a condition guaranteeing a variant of being connected; this characterization is similar in spirit and proof to the Euler condition for directed graphs.Formally, for a rule A → α let n A,α be its count, then Lemma 3.3 (cf.[Verma et al.(2005), Thm. 3, 4]).There is a derivation of a CFG (with starting symbol S) that uses n A,α times a rule A → α if and only if and the underlying graph is connected.Moreover, if w is generated by such a derivation then Here the underlying graph has N as vertices and there is an (undirected) edge {A, B} when n A,α > 0 and |α| B > 0, for some α.In [Verma et al.(2005), Thm. 3, 4] Verma et al. give explicitly a stronger variant of this claim and implicitly formulate Lemma 3.3 in the proof.Moreover, the condition of the underlying graph being connected is also formulated via a Presburger arithmetic formula.While the original construction of such a Presburger formula contained a small error, there are alternative, correct variants in the literature (e.g.[Barner(2006)]); we follow those correct constructions.
We extend this characterization to the case of parametric grammars and reformulate conditions from Lemma 3.3 in terms of mappings, which will allow reasoning on Parikh images and derivations at the same time.We consider an extension of the Parikh image mappings that also assign counts to pairs {s, t} × N .A pair (s, A) indicates how many times the nonterminal A is the source of a derivation step, and a pair (t, A) indicates how many times A is introduced by a derivation step.Thus, we use mappings f : {1, . . ., n} ∪ {s, t} × N → N. Let d = n + 2|N | denote the size of the domain of the mapping.The mapping restricted to {1, . . ., n} corresponds to the Parikh image.
Given an instantiation α ′ of production A → α we denote by f A,α,α ′ the mapping f with f (i) = (P Ψ (α ′ ))(i) for i = 1, . . ., n (we assume that P Ψ ignores the nonterminals), and f (s, A) = 1 and f (s, A ′ ) = 0 for A = A ′ and f (t, B) is |α| B , for each B ∈ N .Note that we associate several mappings with a single production, as there are many instantiations α ′ for a fixed α and on the other hand several instantiations of a rule can have the same mapping.
We consider multisets F of mappings as above, so F : N {1,...,n}∪{s,t}×N → N, which implicitly defines how many times each mapping f ∈ F is used; by m • {f } we denote a multiset consisting of m instances of f .By F we denote f ∈F f , which is an element-wise sum: Given a multiset F of mappings as above we say that a derivation (for some fixed interpretation of parameters) which uses n A,α,α ′ times the instantiation α ′ of rule A → α, is corresponding to F , when Lemma 3.4.Given a multiset F of mappings there is a derivation corresponding to it if and only if and the underlying graph is connected.Moreover, this derivation yields a word with Parikh image ( F ) restricted to {1, . . ., n}.
Here the underlying graph has nodes {A : ( F ) (s, A) > 0} and edges {A, B} when there is f ∈ F such that f (s, A), f (t, B) > 0.
Proof.After fixing the parameters and the possible instantiations of the rules, the parametric CFG becomes a CFG grammar over a finite-size alphabet.We show that condition (1) from the Lemma is equivalent to the one from Lemma 3.3.Note that the condition that the underlying graph is connected is the same in proven Lemma and in Lemma 3.3.
Suppose that there is a derivation corresponding to F .Denote by n A,α,α ′ the counts of the instantiations α ′ of the rules A → α.Then we can treat the parametric grammar as an ordinary CFG, with a pair (A → α), α ′ being treated as a single rule A → α ′ .Hence the numbers n A,α,α ′ satisfy the following equations as in Lemma 3.3: Hence the two equations are a reformulation of the conditions (1).
In the other direction, the argument is similar, but note that for a mapping f ∈ F with count n f we need to give counts for each production A → α and its instantiation α ′ such that f A,α,α ′ = f and that the sum of those counts is n f .This is done arbitrarily: given a mapping f ∈ F with count n f we choose arbitrarily a single such rule A → α and its instantiation α ′ such that f A,α,α ′ = f and set n A,α,α ′ to n f .The same argument as above shows that the equations from the Lemma are just reformulations of equations from Lemma 3.3.
The claim on the Parikh image follows: the derivation uses an instantiation α ′ of the A → α rule n A,α,α ′ times, which yields that the contribution of the letters in α ′ is as stated.Now the idea is that if F corresponds to a derivation, then there is a different multiset F ′ with the same sum F = F ′ .Moreover, F ′ satisfies conditions (1) but uses fewer different mappings than F .In particular, F ′ also corresponds to some derivation.
Lemma 3.5.Let a multiset of mappings F satisfy condition (1) from Lemma 3.4.Suppose that there are two multisets F ′ , F ′′ ⊆ F such that By assumption F ′ = F ′′ we have that ((F \F ′ )∪F ′′ ) = F and hence the assumptions of Lemma 3.4 are met for (F \ F ′ ) ∪ F ′′ .
We now show that given a large enough set of mappings we can always find two its subsets of the same sum.Note that here we do not use multisets, but rather simply sets.
Proof.It is enough to show that there are two subsets of the same sum, they can be made disjoint by removing their intersection.
There are 2 k different subsets and at the same time each value of the sum of mappings is between 0 and kℓ, so in total there are at most (kℓ + 1) d different possible sums and so it is enough to show that (kℓ + 1 is increasing for k ≥ 2, it is enough to verify for k = 2d log(dℓ) that : .
Which is equivalent to (setting (x = dℓ)) As log is an increasing function, then it is enough to show that x 2 > 2x log x + 1, which clearly holds for x > 3, so in particular for d > 3.
Lemma 3.7.Given a multiset F of mappings such that there is a derivation corresponding to it, there is a multiset F ′ of mappings corresponding to it, F = F ′ and F ′ contains at most |N | + 2d log dℓ different mappings, where ℓ is an upper-bound on the value of each mapping in F .
Proof.The idea is as follows: given a multiset F we use Lemma 3.6 to claim that F uses few different mappings or to find F 1 , F 2 ⊂ F such that both (F \F 2 )∪F 2 and (F \F 2 )∪F 1 correspond to a valid derivation and (F \ F 1 ) ∪ F 2 = F = (F \ F 2 ) ∪ F 1 ; we then replace F with on of them.To guarantee that the replacement terminates, we introduce a natural well-founded order on multisets of mappings and show that (at least) one of (F \ F 1 ) ∪ F 2 and (F \ F 2 ) ∪ F 1 is strictly smaller in this order than F .This shows that the replacement terminates at some point and so the resulting multiset uses few different mappings, which shows the claim of the Lemma.
Consider the set of mappings in F , let us linearly order them in some arbitrary way as f 1 , f 2 , . . ., f m .We now treat the multisets of mappings as mappings themselves, i.e.F ′ (f i ) gives the number of times f i is in F ′ .We introduce a linear well-founded order ≤ on the multisets of mappings: we first compare by the number of non-zero components (a smaller number implies the image is smaller according to ≤), i.e. if |{i : and otherwise we compare the mappings lexicographically by coordinates, i.e. if |{i : By standard arguments, this is a well-founded order (as the order on N is well-founded and a lexicographic order such that the order on each component is well-founded is itself well-founded).
We arbitrarily choose F 0 ⊆ F mappings from F , where |F 0 | ≤ |N |, which guarantee that the underlying graph is connected; this can be done as the underlying graph is connected, by Lemma 3.3, and has |N | many vertices.The F 0 will be added to the final multiset of mapping to guarantee that the underlying graph is connected.Let F ′ = F \ F 0 be the remaining multiset of mappings.If F ′ has more than 2d log dℓ different mappings, then by Lemma 3.6 there are two disjoint sets 1 and F ′ 2 are multisets.By the same Lemma, 1 , F ′ 2 cannot have more non-zero components than F ′ , as we are only adding mappings which are already in F ′ .Hence, if F ′ 1 has a non-zero component, it is also non-zero in F ′ , and similarly for F ′ 2 .Hence F ′ cannot be smaller than F ′ 1 or F ′ 2 because it has less non-zero components.As F 1 = F 2 , consider the smallest i such that (note that it could be that F ′ 1 has fewer non-zero components than F ′ , in which case the above calculations are superfluous).
If F 1 (f i ) < F 2 (f i ) then similarly F ′ 2 < F ′ .We can proceed in this manner as long as there are at least 2d log dℓ different mappings in F ′ .Since ≤ is well-founded, the process terminates.Let F ′′ be the final set; it uses less than 2d log(dℓ) = 2(n + 2|N |) log((n + 2|N |)ℓ) different mappings.Then F 0 ∪ F ′′ , where F 0 are the initially chosen |N | many mappings that guarantee connectedness, we get the desired multiset: the underlying graph is connected thanks to F 0 and (F 0 ∪ F ′′ ) = F by easy induction and then condition (1) from Lemma 3.4 holds and so by this Lemma there is a derivation corresponding to F 0 ∪ F ′′ .This is enough to give a proof of Theorem 3.2: we can construct a formula for theory T combined with quantifier-free Presburger arithmetic representing the conditions needed.We then finally existentially quantify all intermediate variables to leave the free variables x 1 , . . ., x n counting the number of times each ψ i is satisfied.
The formula first guesses the values X of the parameters X for the grammar.Then it guesses |N |+2d log dℓ mappings {0, 1, . . ., n}∪{s, t}×N → {0, 1, . . ., ℓ}, where ℓ is the maximal length of G rules (we can represent a mapping with n + 1 + 2|N | integer variables ranging over {0, 1, . . ., ℓ}).It also guesses the counts of each mapping, yielding a multiset F .For each of those mappings it guesses the rule of the grammar and its instantiation.It verifies the correctness of those guesses: 1.That F satisfies (1).
2. The underlying graph is connected 3.That each f ∈ F indeed corresponds to a guessed rule A → α and an instantiation α ′ .
If all tests are satisfied then the formula accepts, otherwise it rejects.Notice that the first condition is a quantifier-free Presburger condition.For the second condition, for each f , we need to guess an assignment to Y. (We can use a fresh copy of Y for each of the |N | + 2d log dℓ mappings).Via a disjunction over all grammar rules, the formula guesses the corresponding rule, and verifies that the number of symbols in α satisfying ψ i matches f (i) and similarly for the nonterminal symbols (on both sides).To see how to do this with a polynomial-size formula, let α 1 • • • α m be the right-hand side of the guessed rule.For each ψ i , introduce a 0-1 variable c i j for each α j .This variable is 0 if ψ i (α j ) does not hold and 1 otherwise.Then we check f (i) = j c i j .For this, we only require Boolean combinations of the two theories.Each free-variable x i of the formula takes the value f ∈F f (i) for all 1 ≤ i ≤ n.If there is a satisfying assignment to the formula, giving a mapping f , then indeed there is w ∈ L X (G) such that P Ψ (w) equals f on the first n components.We guessed the value of the parameters and F corresponds to a derivation, by Lemma 3.4 and the Parikh image of the derived word is indeed f = F restricted to components 1, . . ., n.
In the other direction, if there is w ∈ L(G) then by definition for some X we have w ∈ L X (G).Fix a derivation of w and let F be the corresponding set of mappings; F satisfies the conditions from Lemma 3.4, so in particular satisfies 2. Then by Lemma 3.7 we can assume without loss of generality that there are at most |N | + 2d log dℓ different mappings in F .From this we can construct assignments to the variables representing F and its counts that satisfy condition 1.Since each f ∈ F corresponds to a rule in a concrete derivation, it also follows that condition (3) is also satisfied.

Complexity
To measure the complexity of Parikh images of parametric grammars, we consider the problem of deciding whether P Ψ (L(G)) ∩ [[Φ]] = ∅, for a given existential Presburger formula Φ over variables x 1 , . . ., x n .Here, [[Φ]] refers to the set of assignments f : {1, . . ., n} → N such that f (1), . . ., f (n) satisfies F .Because we are able to compute a polynomial representation of the Parikh image, the complexity remains similar to T .Theorem 3.8.Let T be solvable in complexity class C. Then the complexity of Parikh over images of parametric grammars T is in ∃C.In particular, if T is solvable in P, NP, PSpace then the Parikh images of parametric grammars are in NP, NP, PSpace, respectively.
If there is w , fix a derivation of w and let F be the corresponding set of mappings; F satisfies (1) and Φ.For each different f ∈ F consider its count n f in F .Consider conditions 1 and 3 and Φ.We claim that if they are satisfied, then they are satisfied for counts {n f } f ∈F that are at most exponential.Given the set of different mappings in F , condition (1) and Φ can be interpreted as a system of linear equations on {n f } f ∈F , and condition (3) requires a guess of the rule and its instantiation, which can be done separately.The check whether the underlying graph is connected can also be done separately, as it does not depend on {n f } f ∈F .
Then the system of equations is over |N | + 2d log dℓ variables and with polynomial-size constants.By standard results, if it has a solution, it has one that is at most exponential-size.Thus, the integer variables have solutions encodable via a polynomial number of bits.If T is solvable in P or NP, then we can decide

Abstraction of String Constraints
In this section we outline an application of the symbolic Parikh image abstraction to solving string constraints.String constraints arise naturally during symbolic execution analysis of programs, where the string data type is ubiquitous.During symbolic execution, the potential paths of the program are explored and the constraints on the variables are collected.For example, the positive branch of an if-statement with the condition x = ab will result in the condition x = ab being added to the collection of constraints on the path.If an error state is discovered, a check whether the collection of constraints on the path to the state are satisfiable is made.That is, is there some assignment to the variables that would cause this path to be executed?Because many of these checks will be made during analysis, it is important that the string constraint solver is efficient.
In this setting, we consider constraints that may contain string-and integer-valued variables, Presburger arithmetic, and a number of common string operations such as concatenation and containment in a regular language.
To help improve solver efficiency, we may consider using the Parikh image abstraction to help identify unsatisfiable constraints and avoid a potentially costly proof search by the solver.The goal of the abstraction is to overapproximate the satisfiable instances.This means an unsatisfiable abstracted instance implies the original instance was also unsatisfiable.That is, we allow false positives, but not false negatives.We may hope that the abstracted constraint -which no longer includes string variables but integer variables representing the Parikh image of the strings -can be solved more quickly than the unabstracted constraint.
In the next sections, we introduce the constraint language, describe our abstraction of the input constraints, then describe a modified representation of the Parikh image of a regular expression.

The Constraint Language
We focus on a subset of the QF SLIA theory of SMT-LIB 2.6 [Barrett et al.(2017)].That is, string constraints with linear integer arithmetic.The constraint language is explained below.Because we focus on SMT-LIB, our constraint language only has constraints that a string is in a regular language.Of course, we can easily extend this language to support containment checks in the language of a parametric context-free grammar.
Note, we assume all formulas are given in negation normal form.This is because, as explained below, abstracting negative equations can lead to false negatives.Our constraint language contains the following components.
• String-valued expressions That is, string-valued expressions can be string literals w, string variables x, the concatenation str .++ of two string-valued expressions s 1 and s 2 , the result of replacing the first occurrence of s 1 in s with s 2 , or the substring of s from position i 1 of length i 2 , for integer expressions i 1 and i 2 .
• Boolean-valued string expressions That is, the Boolean-valued expressions can be the test that s is (or is not) in the regular expression r, that s 1 equals s 2 , s 1 contains the contiguous substring s 2 , s 1 is a prefix of s 2 , or s 1 is a suffix of s 2 .We support the regular expressions supported by Z3 which are detailed below.Boolean expressions may appear in any positive Boolean combination.
• The integer-valued expressions str .len(s)for string expression s.Note, other integer valued expressions (that do not include strings) that are supported by Z3 are also permitted.In particular, quantifier-free Presburger arithmetic (or linear integer arithmetic).
• Regular expressions have a standard interpretation and are where a is a concrete character; re.in range(a, a ′ ) is any character between a and a ′ (inclusive); r 1 r 2 is concatenation; r 1 ∨ r 2 , r 1 ∧ r 2 , and ¬r are Boolean operations; r * is 0 or more consecutive matches of r; r + is 1 or more consecutive matches of r; r ? is 0 or 1 matches of r; r l,h is between l and h matches of r where l is a nonnegative integer and h is a nonnegative integer or infinity; ∅ matches no word; and Σ matches any character.

Abstraction of Input
Let Ψ := ψ 1 , . . ., ψ n the predicates of the Parikh image.We assume that ψ 1 = ⊤ as it is convenient for encoding the length of a string.We will take a vector view of the Parikh image mappings f .That is, we will denote f as a vector (f 1 , . . ., f n ) where f i = f (i).This means we can refer to a vector of variables or expressions that represent a Parikh image.Let ϕ ⊤ be the Parikh image of the regular expression Σ * that matches any string.We use ϕ ⊤ to assert that any vector of expressions c = (c 1 , . . ., c n ) encodes the Parikh image of a string.E.g., for predicates ⊤, ⊥, the counts 0, 1 are not a valid Parikh image and ϕ ⊤ (0, 1) does not hold.
Our approach creates an overapproximation abstraction of the input SMT-LIB formula.Each string expression s is abstracted as a vector c s where each component c s i is an expression counting the number of times a character satisfies ψ i in the value of s.Similarly, regular expressions are abstracted as a formula ϕ recognising the Parikh image of the expression.
This means all string variables x are abstracted with a vector x Ψ = (x Ψ 1 , . . ., x Ψ n ).The variable x Ψ i counts the number of occurrences of characters matching the predicate ψ i in the value assigned to x.We assert ϕ ⊤ ( x Ψ ) for each abstracted string variable.
Each string expression s is abstracted recursively.The translation may introduce new variables, for which additional side conditions need to be asserted.The side conditions are collected as a side-effect of the translation.Pseudo-code is given in Algorithm 1 and explained below.The Assert function adds the assertion to the output formula.
The abstraction of w directly counts the number of characters of w that satisfy each predicate.Since w is a concrete string, this is a straightforward character-by-character check against each predicate.For a string variable x we use the abstraction variables x Ψ .The str .replaceoperation is the most subtle and reflects the semantics of str .replace in SMT-LIB.We introduce fresh variables y Ψ to store the abstracted result of the replace.There are three possible outcomes.If s 2 does not appear in s 1 , then s 1 is unchanged and we assert y Ψ = c 1 where c 1 is the abstraction of s 1 .If s 2 appears in s 1 , then the first instance of s 2 is removed from s 1 and replaced with s 3 .The final case is when s 2 is the empty string.In this case, the SMT-LIB semantics is that s 3 is prepended onto s 1 .This is encoded by y Ψ = c 1 + c 3 .Notice, since the values of y Ψ are derived from c 1 , c2 , and c 3 , we do not need to assert ϕ ⊤ as they are already implied.
Finally, for str .substrwe again introduce fresh variables y Ψ .We ignore the integer arguments i 1 and i 2 and simply require that y Ψ is (point-wise) contained within c 1 .We assert ϕ ⊤ to ensure the values represent a string.
Algorithm 1: AbstractSEXP(s) We can then abstract Boolean expressions contained in the input.We first convert each assertion to negation normal form.This is because we can abstract, for example, x 1 = x 2 but not x 1 = x 2 . 2 We then substitute maximal subexpressions according to the following scheme.
• ¬str .inre(s, r) is replaced by ϕ ¬r (AbstractSEXP(s)) where ϕ ¬r is the Parikh image of the complement of the language accepted by r (discussed in the next section).
• str .inre(s, r) is replaced by ϕ r (AbstractSEXP(s)) where ϕ r is the Parikh image of the language accepted by r.
• str .len(s) is replaced by c 1 where c is the result of AbstractSEXP(s).Recall ψ 1 was assumed to be ⊤, so c 1 gives the length of the string.

Construction of the Parikh Image
The abstraction above uses ϕ r , which is the symbolic Parikh image abstraction of the regular expression r.We describe how we encode ϕ r as an SMT-LIB formula.Our encoding goes via symbolic automata and is slightly different to the proof of Theorem 3.8.The alternative encoding is driven by the concrete transitions of the automaton representing r, rather than a nondeterministically chosen subset of the states and transitions.This is because when transitions are represented by variables, many clauses need to contain a disjunction over all possible instantiations of the variables, causing an undesirable polynomial blow-up.This new encoding may require 2sn log(n) character variables, where s is the number of transitions of the symbolic automaton.This is theoretically worse than the 2(n + 2|Q|) log(n + 2|Q|) characters needed in the proof of Theorem 3.8.In the next section we describe some mitigating optimisations.Given a regex r, we build an equivalent symbolic automaton A. We briefly recall the definition we use of a symbolic automaton.It is equivalent to parametric context-free grammars where all productions are of the form (A, α, ϕ(curr)) (i.e.no parameters) and α = curr B or α = ε.
For a given theory T , a symbolic automaton is a tuple (Q, ∆, q 0 , F ) where Q is a finite set of states, ∆ ⊆ Q × T (curr) × Q is the transition relation, q 0 is the initial state, and F ⊆ Q is the set of final states.A run over a word a 1 . . .a ℓ is a sequence of transitions (q 0 , ϕ 1 , q 1 )(q 1 , ϕ 2 , q 2 ) . . .(q ℓ−1 , ϕ ℓ , q ℓ ) where T |= ϕ i (a i ) for all i and q ℓ ∈ F .
Using predicates that are Boolean combinations3 of ϕ(curr) := curr = a and of ϕ(curr) := str .inre(curr, re.in range(a, a ′ )), standard constructions can be used to obtain a symbolic automaton that is equivalent to a regular expression as defined above.From A we build a formula ϕ(c 1 , . . ., c n ), where the free variable c i indicates the number of characters satisfying predicate ψ i .
Let Labels(A) be the set of predicates appearing on the transitions of A. We use p to denote these predicates to avoid confusion with the output predicates ψ 1 , . . ., ψ n .For a predicate p ∈ Labels(A), let TransLabelled(p) be the set of transitions of A labelled by the predicate p.Let Labels(A) be p 1 , . . ., p m .Using a corrected version of Verma et al. [Verma et al.(2005), Barner(2006)], we can build a linear-sized existential Presburger formula from A with free variables c p1 , . . ., c pm where c p indicates how many transitions labelled p appear in a run of A. That is, given a run ρ = t 1 . . .t ℓ of A and a transition t, let |ρ| t be the number of occurrences of t in ρ.Then, for a predicate p, let |ρ| p = t∈TransLabelled(p) |ρ| t .If ϕ(c p1 , . . ., c pm ) holds, then there is a run ρ of A with |ρ| pi = c pi for all i.
We split the target counts c 1 , . . ., c n of the output predicates ψ 1 , . . ., ψ n between the counts contributed by each of the transition labels in Labels(A).To do this we introduce variables c p i indicating that the predicate p accounts for c p i characters satisfying ψ i .That is for all i We then check, for each p ∈ Labels(A) whether there exists a sequence of characters a 1 , . . ., a ℓ ′ such that ℓ ′ = c p (recall c p is the number of times p appeared as the label of a transition in the run).Moreover, for each output predicate ψ i , we require that c p i is the number of characters a in a 1 , . . ., a ℓ ′ such that ψ i (a) holds.
By Lemma 3.6, the sequence a 1 , . . ., a ℓ ′ needs at most 2n log(n) different characters.We introduce character variables a p i for 1 ≤ i ≤ 2n log(n) and character count variables k p i .That is, character a p i appears k p i times in a 1 , . . ., a ℓ ′ .Additionally, we naturally require that a i satisfies p.
All together, we assert (using ϕ) that • the counts for each label (c p ) is in the Parikh image of the automaton (using ϕ), • the total count for each output predicate (c i ) is the sum of counts of the labels satisfying the predicates (using ϕ sum ), and • for each label, we assert (using ϕ labels ) that the count for that label is spread across 2n log(n) characters (using ϕ lcounts , recalling a p i appears k p j times), -each character satisfies the label (using ϕ preds ), and the counts of the labels satisfying a predicate is correct (using ϕ pcounts ).
That is, recalling Labels(A) = p 1 , . . ., p m , we define ϕ r (c 1 , . . ., c n ) to first use the non-symbolic Parikh image ϕ(c p1 , . . ., c pm ) to calculate how many time each transition label can occur.Then it uses ϕ sum to assert that the total number of times ψ i is satisfied is distributed across each transition label.Finally ϕ labels bridges between the number of labels, the number of characters satisfying those labels, and the number of times each output predicate ψ i is satisfied.That is, ϕ r (c 1 , . . ., c n ) := ∃ p∈Labels(A) correctness of the labels, see below the number of times ψi is satisfied by transitions labelled p is the sum of the 2n log(n) character counts that satisfy ψi    and ψ 0,1 (a) is 1 when a satisfies ψ and 0 otherwise.

The Overapproximation
We put everything together to gain an overapproximation of an SMT-LIB formula containing string and integer expressions.We abstract each string variable and expression as a sequence of integer variables -one for each output predicate.We replace string expressions with their abstracted equivalent, which may include Parikh images of symbolic automata.If the abstracted constraint is unsatisfiable, we conclude that the original constraint was also unsatisfiable, and avoid reasoning over the string data type.We describe our experiments in the next section.

Extensions
Here we discuss possible extensions of our constraint language and technique.First, we could additionally allow context-free constraints, not just regular expressions.This is allowed already by some string solvers (e.g.TRAU [Abdulla et al.(2018), Abdulla et al.(2017)]), but this is not yet supported by SMT-LIB.The technique in this section easily extends to context-free constraints since our general results in Section 3 concerns symbolic context-free grammars.This can also be extended to symbolic pushdown automata with the restriction that the number of push symbols in a transition is small.Second, using parametric grammars, we could support a currently "forward-looking feature" of string theory inside SMT-LIB 2.6, namely, the operator to re, which converts a string (possibly with string variables) into a regular language.This results in a highly expressive language, which may capture word equations with Kleene stars.Existing solvers and benchmarks only handle the use cases of to re, to which the input contains only string constants.Using parameters, we may capture constraints of the form x ∈ y * , where x is a string variable and y is a "character variable" (meaning, string variable of length 1).This can be expressed as follows in SMT-LIB 2.6: (declare-fun x () String) (declare-fun w () String) (assert (str.in_rew (re.* (str.to_rex) ) ) ) (assert (str.lenx 1)) Third, we could also allow other effective boolean algebras, and consider instead sequence theories.Although such an extension is partly supported by leading SMT-solvers like Z3 [de Moura and Bjørner(2008)] and CVC5 [Barbosa et al.(2022)], there is as yet a standard logic and file format for sequence theories.In addition, the decidability of such theories has only been very recently studied [Jeż et al.(2023)], whereby the quantifier-free fragment consisting of sequence equational constraints (i.e.concatenation of sequence variables and constants) and regular constraints (as parametric symbolic automata) is shown to be reducible to the case of finite alphabet, but incurring an exponential blow-up in the alphabet size.An example of such a constraint over LIA is which has solutions y, z → Z * , which satisfy the equation yz = zy and that y (resp.z) is a sequence of numbers that are p modulo 6 (resp.p modulo 7), for some p ∈ Z.Our results allow us to also analyze such constraints by similar approach outlined above for string constraints, even when the sequence constraints are additionally extended with other predicates that we permit for string constraints (e.g.length constraints, contains, etc.) and symbolic context-free grammars.

Implementation
We implemented our approach described in Section 4 in C++.We used the Z3 [de Moura and Bjørner(2008)] library to parse and represent SMT-LIB formulas.We supported symbolic regular expressions by adapting the symbolic automata code and translations from the Z3 codebase.In the next sections we describe the optimisations we have implemented, the benchmarks used for testing, and then finally our results and analysis.

Optimisations
We improve the performance of the tool with two optimisations.The first reduces the number of characters a p i required for each transition label, the second helps restrict the search space of Z3 when solving the final abstracted constraints.
In our optimisations, we do not exploit this interval representation.This means our optimisations apply to theories other than strings.In a dedicated string solver it would be possible to use well-known optimisations for character intervals to produce smaller formulas.

Reducing the Characters per Transition Label
In the encoding above, each transition label requires 2n log(n) different character variables.However, consider the transition predicate p(curr) := (curr = a) that asserts that the character on the transition is the 'a' character.Clearly there is only one character that can satisfy p and 2n log(n) characters are not needed.
Similarly, if the only output predicate were ψ(curr) := ⊤, then all 2n log(n) characters contribute the same vector (1).In this case also only one character is required.
Using these observations, we approximate the number of characters than can satisfy p while having pairwise different profiles with respect to the output predicates they satisfy.To do this, we place the output predicates into "buckets".Initially, one may suppose that each predicate is in its own bucket {ψ 1 }, . . ., {ψ n } .
Supposing each character can either satisfy or not satisfy a predicate, a naive upper bound on the number of possible characters with different profiles is 2 × • • • × 2 = 2 n .However, suppose the first two output predicates ψ 1 and ψ 2 were such that there is no character a such that ψ 1 (a) ∧ ψ 2 (a) holds.That is, a character either satisfies ψ 1 or ψ 2 but never both.This gives three possibilities (a satisfies ψ 1 , a satisfies ψ 2 , or a satisfies neither) instead of the naive upper bound 2 2 = 4.In fact, this condition can be tightened: we only need that there is no character satisfying the transition predicate p that can simultaneously satisfy ψ 1 and ψ 2 .
We can extend this to multiple predicates.If ψ 1 , . . ., ψ n ′ are mutually exclusive (i.e.any value a can only satisfy at most one of the predicates), then there are n ′ + 1 possibilities instead of 2 n ′ .
Supposing we are able to group the output predicates into buckets B 1 , . . ., B n ′ .The number of possible vectors with respect to the predicates in a bucket B is the size of the bucket, plus one if it's possible to simultaneous satisfy p and not satisfy any of the predicates in the bucket.That is, for each bucket, let Our approximation of the upper bound on the number of characters for buckets B 1 , . . ., B n ′ is then If this value is less than 2n log(n), we use it instead of 2n log(n) for the characters associated with the transition label p in the encoding above.
To compute the buckets B 1 , . . ., B n ′ we use Algorithm 2 which is a simple greedy approach to allocating label predicates to buckets.Note, the "continue" keyword jumps to the next iteration of the for loop, so a new bucket is only created if a predicate overlaps with some predicate in all buckets computed so far.Our second optimisation is to help Z3 to determine the satisfiability of an abstracted formula.Suppose for some p we have character variables a p 1 , . . ., a p m .Suppose further that the solver has managed to determine that the assignment a 1 , . . ., a m cannot lead to a satisfying assignment.It is clear that any permutation of a 1 , . . ., a m also cannot lead to a satisfying assignment.However, without sophisticated inference, the solver needs to repeat the proof for all permutations.We extend our encoding to eliminate permutations as much as possible.
First, we assume the characters have a linear order <.That is, we can assert a < a ′ .In our implementation we represent characters with integers, so such an ordering is readily available.This means we can add the following constraint to our formula to eliminate permutations.
We can go a little further an also enforce that characters have different profiles with respect to the satisfaction of the output predicates.We enforce this with the following constraint.
Notice that we could have enforced this constraint for each pair of characters a i , a i ′ for 1 ≤ i = i ′ ≤ 2n log(n).However, this would have required a much larger formula.

Experimental Results
Our implementation is written in C++ and uses Z3 4.12.1.Z3 is used to read SMT-LIB files and its data structures are used to represent the fomulas.Z3 is also used as the backend solver for the produced constraints.Our implementation of symbolic automata is a slightly adapted version of the internal Z3 code. 4 We represent characters as Z3 integers, using their unsigned character codes.The implementation is available via its online repository [SymParikh Repository(2023)] and as an artifact with a disk image on Zenodo [SymParikh Artifact(2023)].
Our tool provides two methods for selecting the predicates to use in the Parikh image.In the default mode, the predicates are those appearing on the transitions of the symbolic automata constructed when parsing the input regular expressions.We only select those predicates that are of the form ϕ(curr) := curr = a or ϕ(curr) := str .inre(curr, re.in range(a, a ′ )).In the second mode, we additionally take predicates of the form ϕ(curr) := curr = a from the string literals appearing in the string equations.That is, if a 1 . . .a n appears as a string literal in a string equation, we introduce the predicates ϕ i (curr) := curr = a i for all 1 ≤ i ≤ n.
We describe the benchmarks used before giving the results.

Benchmark Sets
We used several benchmark sets from SMTCOMP 2022 [SMTCOMP2022(2022)], under the QF SLIA category.That is, quantifier-free constraints using strings and linear integer arithmetic.We also generated a set of benchmarks from regular expressions with a 5-star rating on regexlib.com.
Because we intend our technique to complement existing solvers, we restricted our attention to "difficult" benchmarks.We defined the "difficult" benchmarks to be those that could not be solved by Z3 in less than 10 seconds.
• The Norn benchmarks were introduced for the Norn tool [Abdulla et al.(2014)] and consist of concatenations of string literals and variables tested for membership (and nonmembership) of regular expressions.
• The Kepler benchmarks were introduced for the Kepler tool [Pham et al.(2018)] and consist of quadratic word equations.That is, equality tests between two concatenations of string literals and variables, with each variable appearing at most twice.
• The WordEQ benchmarks were randomly generated by us for this paper from regular expressions taken from regexlib.com.This website collects user-submitted regular expressions for tasks such as email recognition, currency values, and others.We took regular expressions with a 5-star rating to avoid spam submissions.
The generated benchmarks were designed to test conjunctions of membership queries 4 Specifically, we changed the representation to use transition sets instead of vectors to avoid transition duplication.We also used sets instead of vectors during minterm calculation when complementing automata to avoid multiple copies of the same predicate.Other minor changes include using Z3's push/pop feature instead of reset, and providing some extra convenience functions.
between overlapping regular expressions.We generated 100 benchmarks of the form str .inre(x 1 , (r 1 . . where 1 ≤ n, m ≤ 3 are randomly chosen integers,5 r 1 , . . ., r n , r ′ 1 , . . ., r ′ m were randomly selected from the regular expressions obtained as above, and s 1 and s 2 are each concatenations of three variables picked randomly (possibly with duplicates) from {x 1 , x 2 , x 3 } such that each variable appears at least once in s 1 or s 2 (and possibly in both).
In addition to the above sets, we also considered other QF SLIA benchmarks submitted to SMTCOMP 2022.These results are not included as the remaining benchmarks either contained unsupported features (such as string-to-integer functions, or character index functions), or were solved within 10 s by Z3.

Comparison Solvers
We compare our tool with three state-of-the-art string solvers: Z3 (4.12.1) [de Moura and Bjørner(2008)], CVC5 (1.0.5) [Barbosa et al.(2022)], and OSTRICH [Chen et al.(2019), Chen et al.(2022), Chen et al.(2020), UUVerifiers( 2023)].Z3 is a well-known SMT solver developed at Microsoft.CVC5 performs strongly in SMTCOMP competitions.Because the performance of Z3 and CVC5 can sometimes be similar, we also compare with two variants of the OSTRICH tool, which uses an automaton-based approach and often out-performed Z3 and CVC5 on unsat instances in SMTCOMP 2022.The CEA variant of OSTRICH uses cost-register automata and also makes use of Parikh images [Chen et al.(2020)].It was run with the parameters +parikh and -profile=strings.Both variants were taken from the Cea-new branch of OSTRICH, commit ce855e26 [UUVerifiers(2023)].

Results
Our experiments were performed on a Lenovo X380 Yoga ThinkPad with 8Gb RAM and 8 In-tel® i7-8550U 1.8GHz CPUs, running Arch Linux (kernel 6.4.1).We used the default method for generating predicates for the Parikh image for all benchmarks except Kepler.For Kepler we extracted predicates from the string literals as the benchmarks do not contain regular expressions.When running the tools on the "difficult" benchmarks, we set the timeout to 30 s.
Our tool over-approximates the true satisfiability of the input string equations and may return false-positives.Hence we are interested in the number of unsatisfiable instances, and those reported incorrectly as sat by our tool.For each benchmark set we consider the following.
• How many of the total number of benchmarks were "difficult"?
• How many of the "difficult" benchmarks were unsatisfiable instances?
• How many of the unsatisfiable instances were identified as false-positives by our tool (i.e.our tool returned "sat")?
• The runtime of our solver, compared with other competitive solvers?The results for the final bullet point are presented in Figure 1, where our tool is labelled sym parikh.These graphs show the cumulative number of benchmarks solved in the given time.The remaining data is in Table 1.We note that Z3 did not solve any of the selected benchmarks within the timeout.However, we expect this is due to the bias in benchmark selection: we chose "difficult" benchmarks, where the Z3 solver was used to determine difficulty.Hence, only benchmarks that Z3 found difficult were included.

Analysis
Of 515 difficult instances, the majority -357 (69 %) -were found to be unsatisfiable.Of these, 219 (61 %) were able to be proved unsatisfiable using the Parikh image abstraction.Thus, our approach gives useful results in 42 % of considered cases.It can be seen in Figure 1 that performance was relatively robust on our benchmarks when compared with with the exact analysis of the comparison tools, with all queries answered within 1 s.This shows some potential for the use of the approach in the optimisation of string solvers.However, the performance can be seen to vary between the benchmarks sets.Since the summary results can be affected by the number of available benchmarks in each set, we discuss each set individually below.This will allow us to gain some intuition on where the approach may be best applied, and where it may be less useful.
Our tool performed well on the Norn benchmarks, solving all instances almost immediately.All difficult benchmarks were unsatisfiable instances.Our tool reported 9 false positives.These results are promising and indicate that the Parikh image abstraction may prove useful in quickly filtering unsatisfiable string constraints with regular expression containment checks.
The performance on the Kepler benchmarks was more mixed.These benchmarks proved difficult for most solvers, with only CVC5 and our tool able to return a large number of answers within the timeout period.CVC5 solved fewer instances than our tool, but did so more quickly.We note that our tool solved almost all instances within 0.5 s.However, out of 195 unsatisfiable instances, our tool reported 129 false positives.We conjecture that this high false positive rate can be explained by the nature of the word equations.The values taken by the string variables were not limited by regular expression containment checks.This provides a lot of freedom for variables to take on values that equalise the Parikh images of both sides of the word equations, especially in cases where a variable only appears on one side of the equation.For example, in ax = xy, the variable y can contain the required a character.Unsatisfiability of Parikh image equality can require rarer inconsistencies.For example xy = yax will always require one more a character on the right hand side than the left.
Finally, our tool performed well on the difficult WordEQ instances.Of the 38 that were difficult for Z3, 35 were unsatifiable instances.Our tool was able to determine the correct result quickly in all cases.This shows that the Parikh image abstractions may prove useful for examples containing complex interactions between overlapping regular expressions.

Conclusion
We have investigated Parikh images of languages over symbolic and parametric alphabets.In such a settings, the large, or even infinite alphabet makes a naive use of Parikh images impractical.Instead, our parametric version of the Parikh image is relative to a sequence of predicates Ψ = ψ 1 , . . ., ψ n and counts the number of times each predicate is satisfied by a character in the word.
The fact that Parikh images over classical context-free grammars can be computed by a linear-sized existential Presburger formula is a key ingredient in several verification applications.We introduce a parametric version of context-free grammars and an equivalent pushdown model.
Because the alphabet is large and multiple predicates can be satisfied simultaneously, one may expect an exponential blow-up over the classical results.Surprisingly, this turns out not to be the case.We can represent the Parikh image of a parametric context-free grammar with a polynomially-sized existential formula, and the complexity of related decision problems remains the same.
We presented an application of our results to overapproximate satisfiability of string constraints and provided an implementation based on Z3.Our experimental results showed that constraints that are difficult for existing solvers can be solved quickly using our abstraction.
Future work.These initial results suggest several avenues of future work.We first discuss limitations of our implementation.Firstly, our implementation makes a naive selection of predicates Ψ over which to compute the Parikh image.Improved predicate selection algorithms may balance the need for insightful information about the constraints being analysed, and the need to keep the number of variables small to allow constraints to be solved quickly.One may also investigate how existing solvers can deploy these techniques from within the solver, rather than as a one-off preprocessing step that analyses the whole formula at once.Secondly, our prototypical application to string constraint solving does not exploit the full potential of the results.For example, SMT-LIB does not currently support context-free constraints (as supported by some solvers like TRAU [Abdulla et al.(2018), Abdulla et al.(2017)]) and sequence theories over any effective boolean algebra [Jeż et al.(2023)], and we have remarked that our results admit an easy extension to these.As an example, we may use parametric context-free languages to analyse streams of XML data, where the set of possible tags is infinite and should respect a nested structure.
On the theory side, it is still an open problem whether our results can be extended to other classes of recognizers over infinite alphabets, e.g., [D'Antoni et al.(2019), Brunet and Silva(2019), Moerman et al.(2017), Figueira and Lin(2022)].In particular, we mention recent results on Parikh images of subclasses of nominal automata [Hofman et al.(2021)] and variants of data automata [Figueira and Lin(2022)], which provide a more precise Parikh abstraction and thus require a higher computational complexity (e.g. in [Figueira and Lin(2022)] double-exponential time algorithms).Secondly, in the light of the polynomial-time complexity result [Kopczynski and To(2010)] on reasoning about Parikh images of NFA with fixed alphabet size k, one could study Parikh images of parametric automata with a fixed number of predicates (for certain alphabets like ASCII, this number might be as small as 10 [ Moseley et al.(2023), D 'Antoni and Veanes(2021)]).Here, a simple application of the result in [Kopczynski and To(2010)] yields a polynomial-time complexity for any fixed k, but the actual complexity would be double exponential in k.Is it possible to lower this to a single exponential in k?
Finally, one could investigate further potential applications of our results.For example, as explained in [D' Antoni and Veanes(2021)], model checking is typically done over Kripke structures over atomic propositions P 1 , . . ., P n .This gives rise as well to exponential-sized alphabets.Parikh's Theorem for symbolic automata could potentially be used to model checking temporal logics with additional predicate-counting abstractions.Similar applications for the case of finite alphabets have been discussed in [Hague and Lin(2011), Laroussinie et al.(2012), Laroussinie et al.(2010)].To avoid potentially large automata, one could potentially also consider restrictions of temporal logics (e.g.LTL with only future/global operators [Benedikt et al.(2013)]).
Research Council under European Union's Horizon research and innovation programme (grant agreement no 101089343).

Errata
There is an error in the optimisation proposed in Section 5.1.2and published in POPL 2024.It may happen, for example, that a transition label permits only one possible value.It is then not possible to choose values for a p i such that a p i < a p i+1 .This may result in some satisfiable instances becoming unsatisfiable.To correct the optimisation, we can apply it only to characters a p i such that k p j > 0. That is, we require a character differs from its predecessor, or it is not used at all.If a character is not used at all, we can assert it is the same as its predecessor to further restrict the search space.We can further assert that if a character is not used, then none of its successors are used either.
The two optimising formulas become  2. The rerun experiments were conducted on the same Lenovo X380 Yoga ThinkPad with 8Gb RAM and 8 Intel® i7-8550U 1.8GHz CPUs, running Arch Linux, updated to kernel 6.7.3.We compared with Z3 version 4.12.5 (updated from 4.12.1) and CVC5 1.1.1(updated from 1.0.5).The same version of OSTRICH and OSTRICH-CEA were used.
The change to the Z3 version used means there are slight differences in how many benchmarks were considered "hard".In particular, 310 Kepler benchmarks were considered hard, whereas 350 were difficult for Z3 previously.Our estimation of the number of unsatisfiable instances has also changed in some cases.These changes will be for benchmarks where our tool was the only tool returning a result, which was erroneously unsat instead of sat due to the over-aggressive optimisation.
The largest difference in the results is for the Norn benchmark set.In this case, the number of false positives reported by our tool increased from 9 out of 127 to 110 out of 126 benchmarks.The other two benchmark sets showed only slight changes to the overall picture of the results.
Overall, there are now 474 difficult instances, with 313 being unsatisfiable (66%).Of these unsatisfiable benchmarks, our tool detected unsatisfiability in 72 cases (23%).Thus, our approach gives useful results in 15% of considered cases.This is a drop from the 42% reported previously.
However, we observe that runtimes remain good for all benchmarks, meaning that our technique can quickly be used as an unsatisfiability check when traditional techniques are taking a long time.Thus, the approach remains a viable approximate filter, particularly for benchmarks of the shape explored by the WordEQs benchmarks, which are of the form that Definition 3.1 (Parametric Parikh Image).For a word w = a 1 . . .a k ∈ D * and T -formula ψ(curr) over one local variable curr, the count |w| ψ is the number of positions i of w such that T |= ψ(a i ).For a sequence Ψ := ψ 1 , . . ., ψ n of T -formulas, the Parikh image P Ψ (w) of w over Ψ is a mapping f : {1, . . ., n} → N with f (i) = |w| ψi for all i.For a parametric grammar G over T , the Parikh image of G over Ψ is P Ψ (L(G)) := {P Ψ (w) : w ∈ L} .

Figure 1 :
Figure 1: The number of instances solved in the given time across the benchmark sets.The line markings (shapes) are only to distinguish lines without colors, and are not individual data points.Our tool is labelled sym parikh.
With the new encoding, the experimental results are changed.Updated runtimes are given in Figure2and summary data is given in Table

Figure 2 :
Figure 2: The number of instances solved in the given time across the benchmark sets.The line markings (shapes) are only to distinguish lines without colors, and are not individual data points.Our tool is labelled sym parikh.

Table 1 :
Summary data for benchmark sets.For each benchmark set, the table shows the number of instances, how many of the instances are "difficult", the number of difficult benchmarks that are unsatisfiable, and the number of those unsatisfiable instances identified as satisfiable by our approximation.

Table 2 :
Summary data for benchmark sets.For each benchmark set, the table shows the number of instances, how many of the instances are "difficult", the number of difficult benchmarks that are unsatisfiable, and the number of those unsatisfiable instances identified as satisfiable by our approximation.