A Constraint Solving Approach to Parikh Images of Regular Languages

A common problem in string constraint solvers is computing the Parikh image, a linear arithmetic formula that describes all possible combinations of character counts in strings of a given language. Automata-based string solvers frequently need to compute the Parikh image of products (or intersections) of finite-state automata, in particular when solving string constraints that also include the integer data-type due to operations like string length and indexing. In this context, the computation of Parikh images often turns out to be both prohibitively slow and memory-intensive. This paper contributes a new understanding of how the reasoning about Parikh images can be cast as a constraint solving problem, and questions about Parikh images be answered without explicitly computing the product automaton or the exact Parikh image. The paper shows how this formulation can be efficiently implemented as a calculus, PC*, embedded in an automated theorem prover supporting Presburger logic. The resulting standalone tool Catra is evaluate on constraints produced by the Ostrich+ string solver when solving standard string constraint benchmarks involving integer operations. The experiments show that PC* strictly outperforms the standard approach by Verma et al. to extract Parikh images from finite-state automata, as well as the over-approximating method recently described by Janků and Turoňová by a wide margin, and for realistic timeouts (under 60 s) also the nuXmv model checker. When added as the Parikh image backend of Ostrich+ to the Ostrich string constraint solver’s portfolio, it boosts its results on the quantifier-free strings with linear integer algebra track of SMT-COMP 2023 (QF_SLIA) enough to solve the most Unsat instances in that track of all competitors.


INTRODUCTION
Extending automated theorem provers and SMT solvers with support for rich string constraints is important for program analysis, particularly to detect cross-site scripting vulnerabilities and other string manipulation bugs [Bultan et al. 2017].To check the satisfiability of a formula like ∈ L 1 ∧ ∈ L 2 ∧| | > | |, with string variables , and regular languages L 1 , L 2 , it is necessary to reason about the possible lengths of , that are admitted by L 1 , L 2 , respectively.The set of word lengths {| | | ∈ L} in a language L is a special case of the Parikh image of a regular language.This required combined reasoning about strings and string length has long been identified as a major bottleneck in string solvers [Abdulla et al. 2015;Berzish et al. 2017Berzish et al. , 2021;;Janků and Turoňová 2020].Other string solvers make use of Parikh automata [Klaedtke and Rueß 2002], and thus Parikh images in the general case, to handle operations that combine strings and integers (including str.substr and str.at), which comes at an even higher price in terms of computational complexity [Chen et al. 2020].
The Parikh image, more broadly, is a characterisation of formal languages in terms of their character counts.Given a language L over an alphabet { 1 , . . ., }, the Parikh image is a set of -dimensional vectors that contains some vector [ 1 , . . ., ] if and only if the language L contains a word in which each occurs times.It is a classical result that the Parikh image of every context-free language (and, thus, also of every regular language) is a semilinear set, i.e., Presburger-definable [Parikh 1966].In fact, it is possible to compute an existential Presburger arithmetic formula describing the Parikh image of any context-free language in linear time [Verma et al. 2005] in the size of the grammar describing the language.For the special case of regular languages, this result was also stated in [Seidl et al. 2004].
In applications, it is often necessary to consider the Parikh image not only of a single regular language, but of the intersection of multiple languages.This happens in string solvers in particular, as conjunctions of string constraints lead to the computation of length images of intersections of regular languages represented as finite-state automata.On the other hand, we are often not interested in a closed-form description of the complete Parikh image, but rather in checking whether the Parikh image contains vectors satisfying some given properties.The main problem considered in this paper is the following:
The problem can be solved using the classical construction of Parikh images [Seidl et al. 2004;Verma et al. 2005] by first computing the product of finite-state automata accepting the languages L 1 , . . ., L , then extracting a Presburger arithmetic formula describing the Parikh image of the product automaton, and finally checking the satisfiability of ∧ .
While theoretically elegant, this construction has several disadvantages that easily turn into bottlenecks when reasoning about Parikh images in algorithms or applications.
Firstly, computing the product automaton is often prohibitively expensive in both memory and CPU time.In several instances we have observed while solving real-world string constraints, the computation of the product of automata exhausts the memory of any machine available due to the exponential blow-up in size of the product, quickly becoming intractable as the number of automata in the product increases.
Secondly, the constructed Presburger arithmetic formula contains a linear number of existential quantifiers in the size of the product automaton, as well as complex Boolean structure, which is needed to express the connectedness of paths considered in the construction.Solving formulas of this kind generally tends to be taxing for solvers [Chen et al. 2020]; in the present case, since the product automaton itself is exponentially big, also formula has exponential size in the number of considered regular languages, and can easily become too complex for today's Presburger arithmetic solvers to handle.
• Enforce automata connectivity constraints lazily; • Use propagation of constraints on the Parikh image to prune transitions from the automata that are incompatible with constraints; • Intelligently branch on the presence or absence of key transitions to drive propagation of the connectivity constraint; • Compute products of automata lazily, after pruning transitions that violate constraints using propagation.
We implement PC* as a plug-in theory for the Princess automated theorem prover [Rümmer 2008], and additionally wrap the Parikh image solver as a stand-alone tool, Catra.Catra also supports a variant of the approximate method of [Janků and Turoňová 2020], its fall-back variant adapted from [Verma et al. 2005], and an adapter for the nuXmv model checker [Cavada et al. 2014].Using Catra, we compare PC* to the other back-ends on 37 497 distinct Parikh automata intersection problems generated by Ostrich+ when solving the PyEx string constraint benchmark suite involving string length constraints [Reynolds et al. 2017], finding that PC* outperforms both nuXmv and the baseline method, allowing PC* to solve every problem solved by baseline within 30 s in under 5 s.PC* in particular outperforms nuXmv on unsatisfiable instances, more than doubling the number of unsatisfiable results found within a 30 s timeout.
We also extend the Ostrich string solver [Chen et al. 2019] with a new back-end based on Catra as the Parikh image intersection solver of the Ostrich+ algorithm, obtaining improvements in both Sat and Unsat performance in Ostrich' results from SMT-COMP 2023, crucially allowing Ostrich to win the unsatisfiable category of the track of quantifier-free, linear constraints over string logic (QF_SLIA) and increasing the number of Sat results by 3% on the same track.
In summary, we contribute: • The PC* calculus to efficiently check the satisfiability of the Parikh image of an intersection of regular languages modulo Presburger arithmetic side conditions.• Techniques to efficiently implement PC* in a modern automated theorem prover, including strategies for case splitting, clause learning, and constraint propagation for connectedness.• The Catra tool for solving such instances, containing an implementation of PC*, the overapproximation described in [Janků and Turoňová 2020], and an adapter for the nuXmv model checker [Cavada et al. 2014].• Experiments illustrating the performance of PC* on real-world examples from string solving, including 37 497 instances in a standardised format made available for future study.

Related Work
The problem of satisfying Parikh images over products of regular languages, modulo Presburger arithmetic side conditions, amounts to checking emptiness of products of Parikh automata.Parikh automata are regular automata extended with integer counters with given increments and decrements for each transition, where we allow checking a set of linear constraints on the final values of the counters (but not their intermittent values) [Klaedtke and Rueß 2002].Parikh automata without constraints on the final values on their registers are also sometimes called cost-enriched automata, weighted automata, or counter automata, depending on exact definitions and side constraints.The decision problem tackled in this paper, determining the emptiness of an intersection of Parikh automata, was shown to be PSPACE-complete [Figueira and Libkin 2015].Parikh image computations, as well as Parikh automata, feature extensively in string solvers, including as mentioned above Ostrich and Ostrich+ [Chen et al. 2020[Chen et al. , 2019]], but also forms the basis of Trau [Abdulla et al. 2017], and occurs in Sloth [Holík et al. 2017].Parikh images frequently appear when introducing cardinality constraints like length or string indexing.While relying on Parikh images (or on being able to check the emptiness of Parikh images under given side constraints), the mentioned papers do not propose any techniques to compute Parikh images.Seidl et al. and Verma et al. define a closed-form description of the Parikh image of any regular language as an existential Presburger arithmetic formula.An approach to optimise the construction is to over-approximate the Parikh image of a product of automata (L (A 1 ) ∩ . . .∩ L (A )) with the conjunction of the Parikh images, =1 (L (A )) [Janků and Turoňová 2020].Due to the over-approximation, this approach is primarily useful for unsatisfiable instances, and requires falling back to computing the product of the automata before using the standard approach for finding its image originally presented in [Verma et al. 2005].Our calculus PC*, in a somewhat similar but more fine-grained manner, utilises laziness to postpone or avoid the most expensive steps in the computation of Parikh images of the intersection of regular languages.
Our calculus PC* is also similar in spirit to the work of Stanford et al., who tackle the exponential blow-up resulting from Boolean combinations of finite-state automata in SMT string solvers through the use of symbolic derivatives [Stanford et al. 2021].The research by Stanford et al. does not consider Parikh images, however.In addition to presenting a decision procedure for lazily dispatching constraints, we similarly also allow for symbolic labelling of automata to handle large alphabets.
Beyond the field of string solving, Parikh image computation is used as an elementary building block in a variety of areas.For instance, Parikh automata have been proposed as the basis of queries in graph databases [Figueira and Libkin 2015]; Parikh images are used for handling cardinalities in parameterised model checking for epistemic logic [Stan and Lin 2021]; or for handling summation constraints in expressive array logics [Raya 2023].Our calculus PC*, and the stand-alone solver Catra, are potentially useful in all such applications.
Other generalisations of the Parikh image than the projections we use here have been studied.Prominent examples include generalising the Parikh map to segments of a fixed length [Karhumäki 1980] and the more general Parikh matrix, which contains not only the Parikh vector, but also information about the order of letters.Another notable generalisation is the p-vector, introduced in [Siromoney and Rajkumar Dare 1985], which denotes the position of each letter in the word rather than the number of their occurrences and allows for generalisations into infinite alphabets.All of these in some sense extend the Parikh map.By contrast, the main utility of the formulation introduced here is to reason about Parikh images lazily, thereby potentially obtaining answers more quickly.We expect that our calculus PC* can be generalised to the mentioned functions on formal languages as well, but leave such investigations to future work.

AN INTUITION FOR OUR APPROACH
In this section, we will introduce an intuition for how a string constraint problem in an automatabased solver like Ostrich+ is translated to a Parikh automata intersection problem, and solved using PC*.
We use PCRE regular expression notation here and throughout the paper, writing them like this.This means that | is alternation, * the Kleene star, and .matches any single character.For the length of a string , we write | |.All number variables in this example are in N with 0.
that is there is at least one character in 2 before and after 1 .
Although the constraints are simple, it should be noted that decidability results for string constraints involving integers are notoriously hard to obtain; many state-of-the-art string solvers, for instance Z3 [de Moura and Bjørner 2008], different versions of Z3-str [Mora et al. 2021;Zheng et al. 2013], or cvc5 [Barbosa et al. 2022] are not guaranteed to be complete or to terminate on such constraints (but, of course, they often achieve very good performance in practice).
A decision procedure for a fragment of string constraints with integers, covering our constraints is defined in [Chen et al. 2020].We will first perform translation into Parikh automata, then break down how the constraints will be handled by PC*.This example will recur in a more formal fashion in Section 4, which can be read in parallel if (even) more detail is desired.

Translating the Constraints into a Parikh Automata Intersection Problem
By a Parikh automaton we mean a standard non-deterministic finite automaton (NFA) with a set of integer registers (Parikh registers) that are incremented (or decremented) at each transition.A formal definition can be found in Definition 4.1.
The automaton ′ , whose Parikh registers count the length of the string accepted by the automaton (register ), the start offset of the substring ( ) and length ( ) of the substring matching c|dd.Note the symbolic transitions in the starting and accepting states matching any symbol in the alphabet!
The automaton , where the Parikh registers count the length of the string in register .To solve Example 2.1 using PC* in a theorem proving context, Constraint (i) to Constraint (iii) can be translated to the product of Parikh automata B × A ′ .Automaton B is given in Fig. 1b.Automaton A ′ (Fig. 1a) is an automaton defining the pre-image of L (c|dd) under the substring function, and obtained by applying the construction described in [Chen et al. 2020].Intuitively, A ′ describes all tuples ( 2 , , ) such that substring( 2 , , ) is in L (c|dd).To this end, A ′ contains registers , , , which counts the length of the overall string read, the length of the prefix before the extracted substring, and the length of the substring, respectively.Registers are assumed to start from zero in each automaton run.
In the automata of Fig. 1, we use the notation / + to mean that a transition reads an input character and increments register by one.We will omit zero-valued increments and assume that all registers are scoped to their automaton.The increments are usually represented as a vector (hence the brackets), but as it is mostly sparse here, we use the symbolic notation 1+ rather than the more cumbersome [0, 1, 0, . . ., 0].
Note that the labels of transitions can be ranges of characters with Σ representing any character in the alphabet.
Intuitively, Example 2.1 is unsatisfiable.By inspecting the automata, we realise that the path using dd in A ′ is unusable since are no d-labelled transitions in B. This means that 1 = c, and thus = 1.That means that Constraint (iv) implies that = 0.However, no path through passes by a c without a preceding character.Therefore, already Constraint (i), Constraint (iii) and Constraint (iv) together are unsatisfiable.An eager approach to finding the Parikh image, as described in [Verma et al. 2005], would start by computing the product A ′ × B, translating it to a Presburger formula with , , etc., as free variables, and then adding Constraint (iv) and Constraint (v) to the resulting set of linear inequalitites.Scalability of this construction is limited due to the size of the product automata to be computed, and the complexity of the resulting Presburger formula.

Solving the Parikh Automata Intersection Problem Using PC*
We will now proceed to show how our calculus PC* can lazily prove the unsatisfiability of the running example.The calculus interleaves several reasoning principles, which we will later define precisely as calculus rules: (i) Similarly, as in [Verma et al. 2005], we first describe each automaton as a flow network, counting how often each transition is taken in a run of the automaton.The flow constraints are an over-approximation of the possible accepting runs of an automaton.(ii) We use a linear integer solver to simplify the flow constraints and prune away paths through the automata that are not feasible.(iii) We use a tailor-made propagation algorithm to identify disconnected transitions of the automata that can never be taken and lazily add the corresponding constraints.(iv) When propagation cannot infer further constraints, we use case splitting to subdivide the problem into smaller parts.(v) Once the constituent automata of a product have become sufficiently small, we compute the precise product automaton.

Approximating Paths
Through Automata Using Flow Analysis.We associate each transition of A ′ and B with a fresh variable ranging over natural numbers (i.e., non-negative integers).These variables represent how many times each transition is taken.Hence, the final value for registers 1 , . . ., of each automaton is the element-wise sum: • ¯ where ¯ are the increments of transition . (1) We proceed by adding linear constraints requiring transition variables to represent a flow through their automaton by adding linear constraints stating that the number of incoming transitions is equal to the number of outgoing ones.E.g., state of automaton B would have the sum where we let ⟨ , , ′ ⟩ refer to the integer variable associated with a transition from state to state ′ with label .The initial state of the automaton receives an additional inflow of 1, and accepting states have an additional common outflow of 1.Note that self-loops cancels themselves out.Therefore, when a loop can become unreachable, additional constraints are required to ensure consistency.

Arithmetic Flow Simplification.
We then use linear arithmetic reasoning on the register equations (1) and flow equations (2) to simplify the counters associated with transitions.As an example, we start with A ′ .Substituting back solutions to the equations for A ′ to the automaton, we obtain the automaton in Fig. 2, where the notation / now denotes a transition that accepts letter and is taken times.In this case, the transitions of the automaton can be specified purely in terms of the free variables we care about representing: the length of the accepted word ( ), the start of the substring ( ), and the length of the substring ( ).
Having obtained this representation, we conclude that 1 < ≤ 2 from Constraint (iv) and Constraint (v).The lower bound directly follows from Constraint (iv) and Constraint (v), whereas the upper bound is obtained by the following reasoning.Since all transition variables are non-negative A ′ a er even more simplification using linear algebra, with most transitions expressed in terms of .
Intuitively, this version captures the fact that is given by distributing the incoming flow of 1 across the two outgoing transitions from ′ , the initial state.
A ′ with its associated transition variables in symbolic form.
(b) B with its associated transition variables in symbolic forms.The large number of implicitly existentially quantified variables on transitions suggest that this automaton has a more complex structure with relation to its (free) target variable representing the string length than A ′ .integers (a transition cannot be used a negative amount of times), ≤ 2 from the to transition.Therefore, it follows that = 2, which implies that the to transition can never be used under these constraints and that we must use the two d-labelled transitions, both now taken − 1 = 1 times.Substituting the derived value of and applying similar reasoning to automaton B, we arrive at the simplified automata in Fig. 3.
Note that the per-automata length counting registers have been assigned the same solver variable .Since they have to be equal, either one can be used in both automata through similar applications of equality elimination rules.

Case Spli ing and Connectivity
Propagation.We now have a choice of two paths through automaton B; the upper through state or the lower through state .Since arithmetic reasoning and propagation are not able to resolve this choice, we perform a case split by selecting a transition variable that would disconnect some strongly connected component of automaton B, in this case the transition from to state guarded by variable 12 .We split the reasoning into the cases 12 > 0 (transition used) and 12 = 0 (transition unused).For presentation, we focus on the latter case, as the former can be handled in a similar way.
In the case 12 = 0, we can conclude that the state is now unreachable, which means that its outgoing transitions are now unusable.Propagation can therefore infer the additional equations 9 = 0 and 11 = 1.

Computing Products.
After discounting all transitions that can no longer be taken, we are left with two (small) flat automata, and can compute their product with relative ease.By putting off computing the product B × A ′ until after performing linear reasoning, and using that to prune transitions, we have computed a smaller product than we would have with an eager approach.
Computing the product, we will immediately notice that the d-labelled transition of automaton A ′ has no correspondence in B, leading to an empty product.We can close the proof goal and backtrack, and we will eventually derive that the imposed constraints are unsatisfiable by repeating the same process on the other branch.

PRELIMINARIES
We first survey some of the required background on finite-state automata and the Parikh image.In addition, we assume basic familiarity with first-order logic, Presburger arithmetic, and the classical sequent calculus; for reference, see e.g.[Fitting 1996].
A homomorphism is a structure-preserving map between two monoids = ⟨ ; ⊕; 0 ⟩ and

Languages, Finite-State Automata and Their Products
We define an alphabet as a finite set of symbols Σ with words Σ * , and the concatenation operation as 1 • 2 over two strings 1 , 2 .Note that Σ * = ⟨Σ; • ; ⟩, is a non-commutative monoid, referred to as the free monoid on Σ.The string length function, | | is an example of a homomorphism between Σ * and Z.
A finite-state automaton A with alphabet Σ is a tuple ⟨ , , , ⟩, where is the set of states, the initial state, the set of accepting states, and ⊆ × Σ × the transition relation.We write a transition = ⟨ , , ′ ⟩ ∈ as = − → ′ .Similarly, we use the notation − → to refer to the set of transitions starting in , and − → to refer to the set of transitions coming into , whenever the automaton is clear from the context.
The word of a path = ⟨ 0 1 1 . . .⟩ is the word formed by the labels on the path.Finally, the set of words accepted by an automaton A, denoted by L (A) ⊆ Σ * , is the set of words of accepting paths.
The product of two automata A 1 = 1 , 1 , 1 , 1 and A 2 = 2 , 2 , 2 , 2 is the automaton The product automaton runs A 1 and A 2 in parallel on an input and only accepts the input if both automata would do so; we have

The Parikh Map and Its Image
Formally, the Parikh map over an alphabet Σ = { 1 , . . ., } is defined as in [Kozen 1997]: That is, ( ) is a vector of the number of occurrences of each character in the language for a given string .For example, for Σ = { , }, we would have We define the image of this map, the Parikh image, of some language L ⊆ Σ * as: We also sometimes use the standard notation # ( ) to talk about an individual letter in a word .For example, for the Parikh vector above, we would have # ( ) = 1.Parikh's theorem states that any context-free language has a Parikh-equivalent regular language (c.f.[Esparza et al. 2011] for a construction of such automata from context-free grammars and [Lavado et al. 2013] for bounds on its size).The Parikh image is therefore a semi-linear set and Presburger-definable.While Parikh's theorem applies to arbitrary context-free languages, in this paper we focus only on regular languages.

The Parikh Image of a Regular Language Expressed in Presburger Arithmetic
It is known that the Parikh image of any context-free language can be described by a linear-size existential Presburger formula [Verma et al. 2005].This representation can be straightforwardly adapted for use with a product of regular languages.For an intuition, the approach consists of first computing the product, then assigning each state and transition an existentially quantified non-negative integer variable, and then describing all paths through the automaton through two sets of constraints: flow equations relating the inflow and outflow of each automaton state, and constraints that enforce connectedness by ordering states by distance in a spanning tree rooted in the initial state.
We refer to this model as the baseline approach, though we also apply optimisations as described in Section 7.1.The calculus introduced in this paper, by contrast, lazily enforces the connectedness constraint while also interleaving the computation of products of automata and propagating information between the steps to reduce the amount of work that needs to be done.

PROJECTIONS ON PARIKH IMAGES
The Parikh map represents a homomorphism from the (free) non-commutative monoid Σ * to the (free) commutative monoid N .We are, however, often interested in projections of the Parikh map, rather than the full image.We therefore consider arbitrary homomorphisms ℎ : Σ * → , where = ( ; ⊕; 0 ) is a commutative monoid.We give several examples of such projections on Parikh images later in this section.
Observe that every homomorphism ℎ : Σ * → can be represented as the composition ℎ ′ • , for some homomorphism ℎ ′ : N → .One of the insights underlying our approach is that it is more efficient to directly compute a projection ℎ(L) on the Parikh image, than to first compute the standard image (L) followed by projection to some property of interest.
Example 4.1 (String Length).One such simplifying homomorphism can express string length, the problem that originally motivated our study of the Parikh map.This mapping is relevant when solving constraints that combine language membership with string length, for instance the constraint given in the introduction: To solve this formula, let = N, and define the homomorphism by ( ) = 1 for all characters ∈ Σ.The length of a string and to solve (3) we can instead solve the equi-satisfiable formula ∈ (L 1 ) ∧ ∈ (L 2 ) ∧ > .This paper proposes efficient native procedures to reason about membership constraints like ∈ (L 1 ), avoiding the computation of the complete image (L 1 ).This encoding can be seen applied to automaton in Example 2.1 (see Fig. 1b).

Integer Constraints on Strings
Parikh images are also applicable for deciding more general classes of string constraints [Chen et al. 2020].Consider the substring constraint Constraint (iii) of Example 2.1: 1 = substring( 2 , , ), that is 1 is an -length substring of 2 starting at offset .That constraint belongs to an expressive fragment of string logic that cannot be decided by most state-of-the-art string solvers.
In Section 2 we modelled Constraint (iii) (and the other constraints) using Parikh automata [Cadilhac et al. 2011;Klaedtke and Rueß 2002].A Parikh automaton is a finite-state automaton extended such that transitions are additionally labelled with offset vectors defining the increments of a finite number of counters.This means that Parikh automata recognise words over an extended alphabet Σ × , where ⊆ N is a finite set of increment vectors (notation as in [Cadilhac et al. 2011]) and Σ is the alphabet of the original automaton.
We use the symbols Σ , to denote projections to the first and the second component of a composite letter ( , ¯ ), respectively, and extend those projections to words: Definition 4.1.A Parikh automaton of dimension ≥ 0 is a pair ⟨A, ⟩, where ⊆ N is a semilinear set (or, equivalently, a Presburger formula), and A is a finite automaton with the alphabet Σ × , where ⊆ N .We say that ⟨A, ⟩ recognises a word ∈ Σ * if and only if the automaton has a run accepting an extended word ′ ∈ (Σ × ) * such that Σ ( ′ ) = and ( ′ ) ∈ .
Applied to Constraint (iii), the decision procedure in [Chen et al. 2020] will construct a preimage of 1 under the substring operation, and check whether this pre-image is consistent with the constraint 2 ∈ L ( ), corresponding to the full product seen in Example 2.1.Because substring depends on the values of the integer variables , , a Parikh automaton, shown in Fig. 1a, models the pre-image.The Parikh automaton has dimension 3, as it includes also the length register , besides variables , .Intuitively, this construction accumulates any prefix symbol, incrementing to mark the start of the substring; followed by the automaton representing the substring itself, modified to add a counter to increment at each transition; followed by a state that accepts any suffix after the matched substring.
Denoting the language described by Fig. 1a as L , we can then replace 1 = substring( 2 , , ) and 2 ∈ L ( ) (of Constraint (i)) with an equi-satisfiable formula that no longer contains any explicit substring operation: To check the satisfiability of Eq. ( 4), we need a decision procedure that can process intersections of regular languages (in this case, of L and L ( ), synchronising on Σ), while imposing the side condition 0 ≤ ≤ ≤ | | on the increment sum.In [Chen et al. 2020], this decision procedure turned out to be main bottleneck of the string solver, which was one of the motivations to develop the lazy algorithm proposed in this paper.

A CALCULUS FOR PROJECTIONS ON PARIKH IMAGES
We start by defining our calculus, PC*, for one automaton, and only extend it to products of automata in Section 6. Assume an automaton A = ⟨ , , , We use the notations introduced in Section 3.2.For convenience, we introduce the following additional notions: Definition 5.1.The transition count, #( , ) is the number of times a transition = − → ′ ∈ appears on a path .A transition selection function is a function : → N labelling every transition ∈ with a non-negative number.
We introduce the two predicates that will be used by our calculus, with the following definitions: Definition 5.2.The Parikh predicate, Im A,ℎ ( , ) holds for some automaton A = ⟨ , , , ⟩, some homomorphism ℎ : Σ * → to a commutative monoid , some transition selection function : → N, and some monoid element ∈ if is an element of the Parikh image of L (A) modulo ℎ, or more formally, when there is an accepting path = ⟨ 0 1 1 . . .⟩ ∈ Paths(A) such that ( ) = #( , ) for all ∈ , and = ℎ( ( )).
Definition 5.3.Conn(A, ) holds for some automaton A = ⟨ , , , ⟩ and transition selection function : → N if for every = − → ′ ∈ , if ( ) > 0 there is some -selected accepting path that visits 's starting state .More formally, if ( ) > 0 then there is some path ∈ Paths(A) with ( ′ ) > 0 for every ′ ∈ , and ∈ States( ) such that ends in an accepting state ′ ∈ .The predicate Conn(A, ) represents the condition that A is connected under the selection function for every transition, and is implied by Im A,ℎ ( , ).
The rules of PC* for one automaton are given in Table 1.The rules operate on sets of formulas and can be interpreted as rules of a one-sided sequent calculus, in which all formulas are located in the antecedent [Fitting 1996].The rules relate premises 1 , . . ., with some conclusion .When constructing a proof, we start with some root , and then apply proof rules to the goals of the proof in bottom-up direction until all goals are closed, or no more rules are applicable.
A proof in which all proof goals are closed shows that the formulas in the root of the proof are inconsistent (have no solutions).An unclosable goal to which no rules are applicable gives rise to a solution of the formulas in the root .Such a goal will only contain formulas in Presburger arithmetic, allowing a solution to be computed using standard algorithms [Harrison 2009

Name
Rule Side conditions

Expand
Conn(A, ), Flow(A, ), Subsume , Conn(A, ), , Split and Prop cannot be applied We use the convention of splitting the formulas in proof goals into linear (in-) equalities ( ) and other formulas ( ), and assume that predicates Im and Conn only occur positively.The transition selection function is represented symbolically and can, in practice, be read as a function from transitions to N-valued terms (e.g.t or t+1).In our implementation Catra, described in Section 7, is a vector of fresh variables with the same size as .
To ensure termination, rules can only be applied when they add new formulas on every created branch (the notion of regularity of a proof is required [Fitting 1996]).For example, this means that Split can only be applied to proof goals that contain neither ( ) = 0 nor ( ) > 0, and can never be applied to split on the same term twice on the same branch.
The rule Expand expands an Im A,ℎ ( , ) predicate into the more basic predicate Conn(A, ), as well as linear equations relating the transitions mentioned by with the monoid element , and linear flow equations described by Flow (below).Since Conn and Im are partially redundant and the difference is covered by Flow, we can remove the instance of Im when applying Expand.In this sense, we split the semantics of the Im predicate into its counting aspect (covered by Flow) and its connectedness aspect (covered by Conn).
In Expand, we use the shorthand notation ℎ( − → ′ ) = ℎ( ), i.e., we allow the homomorphism ℎ to be applied also to transitions .The predicate Flow(A, ) represents the flow equations to be generated when expanding Im A,ℎ ( , ).We assume that each application of the predicate introduces fresh integer variables for every accepting state ∈ , and define: The rule Split allows us to branch the proof tree by trying to exclude a transition from a potential solution before concluding that it must be included.Intuitively, this is what guarantees our ability to make forward progress by eliminating paths through A.
The Prop rule allows us to propagate (dis-)connectedness across A. It states that we are only allowed to use transitions attached to a reachable state, and is necessary to ensure connectedness in the presence of cycles in A. The rule makes use of the notion of dominating sets of transitions, defined as follows: Definition 5.4.A set of transitions of an automaton A dominates a transition , written Dom( , A, ), if every accepting path of A with ∈ contains at least one transition from .Notably, Dom(∅, A, ) for every unreachable transition , and Dom({ } , A, ) for every transition .
The Dom relation can be efficiently implemented in a solver by using standard Ford-Fulkerson/ Edmonds-Karp min-cut between a state and the initial state after removing transitions where ( ) = 0.By only performing this computation after such filtering, a solver additionally avoids breaking the rule of adding clauses that already appear in the formula.
Finally, the rule Subsume can be applied when the connectedness of an automaton has been ensured by exhaustive application of the other rules.This suggests a proof strategy where you Prop when you can, Split when you must, and Subsume when neither is possible anymore.
In addition to Table 1, we assume the existence of a rule Presburger-Close, corresponding to a sound and complete solver for Presburger arithmetic formulas, and for constraints over .
A decision procedure would start from one or multiple predicates Im A,ℎ ( , ) to be satisfied, possibly in combination with other constraints about .It would then first expand the predicates using the Expand rule, and subsequently apply the other rules to search for a solution.
As illustrated in Section 2, a decision procedure can also perform arithmetic rewriting of the occurring terms and equations.Such reasoning is not necessary for correctness or completeness, but it shortens the examples considerably; we will therefore assume the existence of a rule Algebra that allows us to perform standard algebraic reasoning on linear arithmetic constraints.

An Example
Here we will return to A ′ from Section 2 and perform the steps of Sections 2.2.2 and 2.2.3 with the formal calculus we just established, but exclude Constraint (ii) and therefore the entire automaton B, since we introduce support for products of automata in Section 6.The example then becomes satisfiable, as 1 = dd, = 1, 2 = adda, = 4 is a satisfying assignment to Constraint (i) and Constraint (iii) to Constraint (v).
Starting with A ′ , ℎ extracts the increments of a transition, ℎ( 1 We use the same compact notation here as in Section 2 to represent what is essentially a sparse vector of 1 and 0 coefficients, i.e.
The reader is advised to review Fig. 1a from Section 2 while going through this example.
Initially, we let map to fresh variables to obtain, after some simplifications: . . .The proof tree is shown in Fig. 4. Like in Sections 2.2.1 and 2.2.2, by arithmetic reasoning the calculus ends up with fixed values for several of the transition variables , and eventually concludes that the transition directly from state to (variable 3 ) is incompatible with the constraint.It is then possible to remove the Conn predicate using Subsume, since Prop is not able to infer further constraints, and no non-trivial applications of Split remain to be done.After this, by further arithmetic reasoning we can derive a solution = 4, = 1, = 2, and conclude that the root constraint is satisfiable.To obtain values for the string variables 1 , 2 corresponding to the solution, one can construct an accepting path of the automaton with each transition taken ( ) times.

Correctness of PC*
Our correctness proof of PC* consists of two main parts: first, we show that the construction of a proof always terminates, and then that each of the proof rules in Table 1 is an equivalence transformation, i.e., does not change the set of satisfying assignments of a formula.In combination, those two results immediately imply that PC* gives rise to a decision procedure.

PC* Terminates.
Lemma 5.1.Suppose is a set of formulas in which the predicates Im and Conn only occur positively.There is no infinite sequence of proofs 0 , 1 , 2 , . . . in which 0 has as root, and each +1 is derived from by applying one of the rules in Table 1.
Proof.The rule Expand can only be applied finitely often since each application removes one Im predicate, and none of the rules introduce new instances of the predicate.The rule Subsume can only be applied finitely often since it strictly decreases the combined number of Im and Conn predicates in sets of formulas, and none of the rules increases that number.
To show termination of Split and Prop, observe that the in a predicate Conn(A, ) is never updated on a proof branch, which means that the set of terms ( ) for ∈ A on every branch is finite.Each application of Split and Prop adds a new formula ( ) = 0 or ( ) > 0 to a proof goal, which can only happen finitely often.□ Proof.This property has to be shown by analysing the possible applications of each proof rule.Expand unfolds the definition of the Im predicate.To show that the rule is solution-preserving, we prove the equivalence of the upper and lower sets of formulas:
• Assume that satisfies the premise, which implies that val ( ) describes a consistent, connected flow of the automaton.By the same argument as in [Verma et al. 2005], this flow can be mapped to an accepting path of A such that each transition occurs on exactly val ( ( )) times.Together with the equation val ( ( )) = #( , ), this implies that satisfies Im A,ℎ ( , ).In Split, we make use of the fact that ( ) is N-valued by definition.For any , clearly exactly one of ( ) = 0 or ( ) > 0 will be satisfied, implying the property.
For Prop, suppose that Dom( , A, ), which means that every accepting path containing contains at least one of the transitions in .For a satisfying Conn(A, ).This means that every accepting path where ∈ has at least one transition ′ ∈ such that ( ′ ) = 0.This means that also val ( ( )) = 0 has to hold since no unbroken path containing exists.
Finally, for Subsume, observe that if Split cannot be applied, then a goal must contain ( ) = 0 or ( ) > 0 for every .In case the formulas in are inconsistent, an application of Subsume is trivially solution-preserving; therefore assume that is consistent, which means that it contains exactly one of ( ) = 0 or ( ) > 0 for each .Since Prop is not applicable, the transitions with ( ) > 0 must form a connect sub-graph of the automaton; this means that Conn(A, ) is redundant as it is implied by .□

PARIKH IMAGES FROM PRODUCTS OF AUTOMATA
We now generalise our calculus to natively work with intersections of regular languages, or equivalently products of automata.For this extension, we change the main predicate Im to be indexed by a vector of automata ⟨A 1 , . . ., A ⟩.For simplicity, we assume that the sets of states of the automata (and therefore also the transition sets) are pairwise disjoint.
For the calculus (Table 2), we first extend Expand to generate flow equations and instances of Conn for each automaton, resulting in a new rule ExpandM.Unlike Expand, ExpandM does not remove the Im predicate, since it is needed to keep track of the currently considered partial products.
The rule Materialise introduces the product of two individual automata A , A ; this step eliminates A , A as index of the Im predicate, and instead adds the product of the automata restricted to the transitions that can still be taken.The rule also introduces the flow equations and the Conn predicate.
In Materialise, we use the following notation for pruning away parts of an automaton based on the transition selection function , only keeping those transitions for which is positive: This filtering operation can be optimised to also eliminate states from and that become unreachable; this is kept implicit for sake of presentation at this point.
The rule Materialise has to connect the newly introduced product to the previous automata A , A .This is done by extending the selection function to ′ , mapping the transitions of the product to fresh variables : The multiplicity of transitions in the product then has to be related to the multiplicities in the individual automata, modelled using the Bind predicate.The predicate expresses that the multiplicity of a transition ∈ A in A has to coincide with the sum of the multiplicities of transitions in (A ) × (A ) derived from , and similarly for A :  For precisely one automaton, neither rule applies, and we perform the calculus as before.

An Example
We return to our example in Section 2. We also reuse the definition of ℎ as the simple projection to extract the counter increments, so in this case, ℎ( ) = [ + ] for every transition since B only counts length.The reader is advised to review Figs.3a and 3b from Section 2 while going through this example.The transition variables of the figures match the ones used here.
We extend Eq. ( 5) from Section 5.1 with definitions for shown in Eq. ( 6): The only possible rule at the start is ExpandM, which we use to add the corresponding constraints on each automaton of the product as we would have had in the single-automaton version.After that, we expand the various sums and apply light equality propagation.
We then continue by repeating the steps of Section 5.1, shown in Fig. 4 since the steps to propagate Example 2.1 across the automaton A ′ still apply for the product.Note that this fixes the values of all transitions but one of A ′ , including the free counter variables , , whose new constant values we propagate everywhere.
Once we have performed reasoning by equality to eliminate transition variables, we obtain the version of B shown in Fig. 3b and represented by the clauses in the final node in the tree.This corresponds to the end of Section 2.2.2.Our only options now are to either split on which path we take from by case-splitting on 12 or to directly invoke Materialise.The latter would produce a shorter tree in this instance but might lead to a larger product being computed, so we pick Split.This leads to an opportunity for propagation since 12 = 0 cuts off 9 from on the left branch.Note that the same opportunity does not present itself on the right branch; the use of the upper path to state does not preclude visiting state .
In both cases the automata are now simpler, so we apply Materialise to obtain their product, which for both branches is empty.An empty product will have no outgoing transitions from its initial state, and so will lead to a flow equation 1 = 0. We can then close the proof and know that the problem is unsatisfiable.A full derivation tree for the example can be found in Fig. 5.    2 only extends the existing rules of Table 1, we focus on the differences compared to the calculus for a single automaton.

PC* for Products of Automata Terminates.
Lemma 6.1.Suppose is a set of formulas in which the product version of Im only occurs positively.There is no infinite sequence of proofs 0 , 1 , 2 , . . . in which 0 has as root, and each +1 is derived from by applying one of the rules in Table 2.
Proof.The rule Materialise can similarly only be used finitely many times, as each application reduces the number of automata in the product of Im by one automaton, until only one remains and Lemma 5.1 for single-automaton instances apply.
ExpandM can only be applied precisely once per Im term since each application introduces an identical set of formulas and we have a generic side condition that no rule may add only redundant formulas.□ 6.2.2The Rules in Table 2 are Solution-preserving.Since our calculus now includes a rule introducing new variables, the Materialise rule, we have to slightly generalise the notion of solutionpreservation: Lemma 6.2.Consider an application of one of the rules in Table 2, with premises 1 , . . ., and conclusion .An assignment (over the symbols in ) satisfies the conclusion if and only if there is an extension ′ satisfying one of the premises .
Proof.We have to consider the two new rules in Table 2.The result is immediate for ExpandM, since this rule does not remove the Im predicate from a proof goal, and the newly introduced formulas are all implied by the Im predicate.
For Materialise, observe that the existence of an accepting path in A × A is equivalent to the existence of individual paths in A , A accepting the same word.The path in the product will satisfy the flow equations and connectedness, and it will be related to the individual paths as stipulated by the Bind predicate.□

IMPLEMENTATION
We implement PC* for Parikh automata as described in Section 4.1.The artefact submitted along with this paper is a program that reads an instance file with one or more products of one or more Parikh automaton with transition labels defined as ranges of Unicode characters, along with a set of constraints on the final values of their registers expressed as Presburger arithmetic in a C-like syntax.We call this program Catra.Catra is written in Scala, with the calculus described in this paper implemented as a theory plug-in for the Princess automated theorem prover [Rümmer 2008], which also performs the Presburger reasoning.For comparison, we also provide an implementation of the baseline method from [Verma et al. 2005], a direct translation that uses the nuXmv symbolic model checker [Cavada et al. 2014] to solve our constraints, and the approximation described in [Janků and Turoňová 2020] on top of the standard baseline back-end.An example of an input file corresponding to our running example introduced in Section 2 can be found in the root directory of the artefact [Stjerna and Rümmer 2024].
Catra uses symbolic labels for automata.A symbolic label is defined as a finite range of Unicode code points.This allows representing regular expression patterns like (a-z) as a -> b [a, z] when it would have otherwise required 27 non-symbolic transitions.
In satisfaction mode, supported by all backends, Catra tries to satisfy the constraints expressed by the input file, reporting Sat with register assignments or Unsat.Additionally, baseline and PC* also support generating the Presburger formula describing the constraints of the input file, i.e., computing a closed-form representation of the complete Parikh image.

Implementing the Baseline
As a baseline, we use the same Presburger solver (Princess), input file parser, and automaton implementation as Catra.We do this to better analyse the impact of the calculus rules themselves.Adapting [Verma et al. 2005 term and add them to Princess.We compute the product incrementally term by term, checking satisfiability at each step.We use a priority queue to select automata for each step and order them by their number of transitions.We use this heuristic to put off computing large (and therefore slow) products until we have to, hoping to find an empty intermittent product.This is roughly similar to the approach taken in [Janků and Turoňová 2020].
Algorithm 1: How we implement the baseline approach Data: A 1 , . . ., A automata, other constraints Result: Sat or Unsat ← newTheoremProver() assert( , As an optimisation, our automata (including intermittent products) have dead states eliminated during construction.Any automaton we produce contains only states that are both reachable from the initial state and have a path to an accepting state.We never perform any other minimisation on the automata for either backend.More complex minimisation was left out since performing minimisation on automata with counters is non-trivial.

Heuristics and Search Strategies
PC* as described in Sections 5 and 6 leaves some choices unspecified, including the priority of rules and the order of their arguments.In this section, we address these and describe additional implementation details and techniques used to enhance Catra.7.2.1 Spli ing, Materialisation, and Propagation.We order our rule applications as follows: first, propagate connectedness if possible, then perform materialisation if tractable as defined below, then finally resort to splitting as a last resort.
In addition to applying Split as described in Table 1 to randomly selected transitions, we prefer splitting to sever a strongly connected component from the initial state.We randomly select an automaton where we can compute a cut between an SCC and the initial state, that is, where the SCC does not contain the initial state and where the sum of the transition variables of the transitions in the cut is not known to be positive.If there are multiple such strongly connected components, we choose one randomly.We then proceed to split on the sum of the transition variables of the cut as if it were a regular transition, e.g., its sum being zero or nonzero.In this way, we drive PC* towards applying Prop.
The implementation of the connectedness constraint is opportunistic and straightforward.We compute a set of dead states by performing forward and backward reachability computations on an automaton, where we disregard any transition whose associated variable is known to be zero.After that we add clauses ensuring any transition variable associated with a transition starting in a dead state is zero.
Product materialisation is the final piece of the puzzle.In the current implementation we put off computing intermediate products until at most six transition variables of one of the automata are not known to always be used (> 0) or always unused (≤ 0).The number was chosen experimentally.The other automaton for the product is selected randomly.7.2.2Clause Learning.Catra enables clause learning by default when using our backend, as it has been experimentally shown to increase the performance in aggregate (though not strictly).We currently only implement minimal clause learning based on forward-reachability cuts.No sophisticated clause learning for products has been implemented.7.2.3Random Restarts.Finally, we perform restarts scaled by the Luby series [Luby et al. 1993].Experimental results have shown this to have a large improvement in performance, which is unsurprising given how many random choices we make during solving and how tail-heavy our problem is.

EVALUATION
We evaluate the performance of Catra on 37 497 instances of Parikh automata intersection problems generated by the Ostrich+ string constraint solver [Chen et al. 2020] when solving the PyEx benchmarks, which are string constraints from symbolic execution that are known to be hard for many solvers [Reynolds et al. 2017].After generating an initial 38 227, we remove 314 instances solved in under five seconds by baseline as well as 416 instances that were duplicates of other instances in the set.
We also attempted to benchmark instances generated by Ostrich+ solving the Kaluza benchmarks [Saxena et al. 2010] (38 227 instances), but discarded them since they all turned out to be trivial for our tool Catra: every instance could be solved in under five seconds, with a mean runtime of about 0.1s.The benchmarks are run on commit d54e33b of Catra.
The benchmarks for PC* and nuXmv are executed in parallel on a server running Ubuntu 22.04.3LTS with an AMD Ryzen 9 5900X processor at 2.27 GHz and 12 cores, with 4 threads sharing the same JVM.Baseline was executed one JVM instance per input as 6 jobs in with a 4 GB heap each on an Intel i5-10600, 3.3 GHz CPU with 6 cores and 12 threads.Simultaneously, StarExec ran the same experiment and produced almost identical results.
We compiled the code using Scala 2.13.12 and executed the experiments on OpenJDK Java 1.8 with a maximum heap of 100 GB.We used nuXmv version 2.0.0 invoked as a subprocess for each instance.Instances were executed in batches of 10, each given a fresh JVM.Each JVM was warmed up for 10 s on a random benchmark from the set before starting to execute.We believe this represents a realistic use case where PC* is used to support, e.g., a string solver.Experiments were executed in random order for all backends.Each instance got a time budget of 30 s.
All runtimes are measured in wall-clock time as observed by the JVM when executing the instance, and exclude time spent parsing (usually far below 0.1 s).

Execution Time and Ability to Solve Instances
In Fig. 6a, we show how many of the 37 497 instances the respective back-end could solve.A summary of their outcomes by instance type is also available in Table 3.Note that many instances lack a ground truth as they are solved by only one backend.We see that PC* generally outperforms nuXmv on determining unsatisfiability, as does baseline, while being similar at satisfiable instances.Both PC* and nuXmv outperforms baseline on satisfiable instances.On satisfiable instances, nuXmv and PC* have similar performance.Baseline performs worse on satisfiable instances because it executes a heuristic meant to detect unsatisfiability early, similar to [Janků and Turoňová 2020].The heuristic is enabled since the improvement is significant compared to the extra cost.
Finally, in Figs.6b, 7a and 7b and we compare runtimes for solved instances between the backends.We see that nuXmv has a more even spread of runtimes up to 20 s, while both PC* and baseline tend towards solving their instances quickly or not at all, though PC* does have a long tail of outlier instances that finish as the timeout increaseses.Notably, Fig. 7a shows that any instance solved by baseline in 30 s is also solved by PC* in under 5 s.A cactus plot showing the number of instances solved within a given timeout can be seen in Fig. 6c.We see that PC* outperforms nuXmv in general but that nuXmv might scale better on very long runtimes.This is likely due to two factors.First, nuXmv is more mature than PC*, and its more general model checking methods might pay off on more difficult instances compared to problem-specific methods.Second, for longer-running calculations clause learning as described in Section 7.2.2 might matter more.As clause learning in Catra is rudimentary and generalises poorly across product computations, there is a performance hit.

Evaluation in Ostrich
To evaluate the effectiveness of PC* in a string solver, Catra was experimentally integrated as the Parikh automata product solver into the string solver Ostrich version 1.3.Ostrich is an independently developed solver that participated in the recent SMT-COMP 20231 , winning the single-query track for quantifier-free strings (QF_S), as well as dominating other solvers on unsatisfiable string benchmarks [SMT-COMP 2023].For our experiments, we modify the CE-Str solver to apply Catra (with PC* as its chosen backend) instead of the previous baseline method, resulting in a new back-end CA-Str.Combining the results from SMT-COMP for Ostrich 1.3 with our new results running CA-Str we construct a virtual portfolio Ostrich+CA that simulates running CA-Str as a fourth back-end to Ostrich.We similarly combine the results of all non-Ostrich solvers to obtain the virtual portfolio solver Competition.
We extend the results of SMT-COMP 2023 with our modified Ostrich on the single-query string solving with linear integer arithmetic constraints track (QF_SLIA).We picked QF_SLIA since Ostrich already performed well on the other two string solving tracks.In fact, every solver, including Ostrich, handled every benchmark in QF_SNIA within the timeout.Additionally, Parikh intersection problems would mainly be generated by Ostrich when solving constraints involving integers, meaning that CA-Str would be of little or no help on the QF_S track.
We obtain the results by executing Ostrich 1.3 with CA-Str using the same benchmarking infrastructure (the StarExec cluster) and configuration that ran SMT-COMP 2023, combining our new results for CA-Str with the published results of SMT-COMP 2023. 2 Where available, we use the revised, out-of competition version of the results for solvers with bugs discovered during the competition, including Ostrich and Z3-Noodler. 3 Note that Z3-Noodler abstained on 70 instances, and thus has a lower total number of results.We used the parallel results for all solvers.
As can be seen from Table 4 and Fig. 8, integrating Catra as a back-end leads to gains both on satisfiable and unsatisfiable problems.On satisfiable problems, the combination Ostrich+CA is still outperformed by cvc5 and z3alpha.On unsatisfiable benchmarks, Ostrich+CA now narrowly beats the other solvers squeezing past cvc5, which is promising given that Ostrich 1.3 (without CA-Str), cvc5, and z3alpha all show very strong performance on this class of benchmarks.
These results are not unexpected.It is known that automata-based string solvers (Ostrich, Z3-Noodler) tend to perform better on unsatisfiable than on satisfiable benchmarks, compared to solvers that do not utilize automata and directly work on regular expressions (cvc5, z3alpha).The computation of automata representations of regular constraints can be expensive, and might be unnecessary for satisfiable formulas.In addition, the algorithm in Ostrich has stronger theoretical completeness guarantees than cvc5 and Z3-Str, and ensuring completeness often has an adverse effect on performance in practice [Barbosa et al. 2022;Chen et al. 2019;Zheng et al. 2013].
The results show that PC* can be used to enhance the performance of an automaton-based string solver.Moreover, the cactus plot in Table 4 illustrates that CA-Str is immediately useful, boosting the results for the Ostrich portfolio even at the first datapoint (though marginally).These results should be considered preliminary, however, as we believe that a deeper integration of Catra into string solvers can lead to significant performance gains.In particular, the integration layer is currently too shallow to allow Ostrich to learn clauses generated by Catra and additionally incurs overhead from serialising and deserialising the current Parikh automata problem into Catra's input format.

Threats to Validity
The most obvious threat to validity would be an unsound implementation.To address this we have validated all reported solutions made by PC* with nuXmv.A previous version contained a race with random restarts during product materialisation causing non-deterministic unsoundness in 0,7 of instances.Additionally, both we and the artefact evaluation committee independently discovered a soundness issue in CA-Str on one instance in QF_S where the 20230329-automatark-lu/ instance08425.smt2instance was incorrectly reported as Unsat.The underlying issue was that Catra generated a too broad blocking clause for a product without transitions.We implemented a fix and re-evaluated all benchmarks.Performance was not measurably affected.
Since addressing those bugs we have observed no further soundness issues in neither Catra nor in CA-Str.Benchmarking results for both runs are included in the artefact of this paper, and show virtually identical performance characteristics.We have additionally executed all of the benchmarks on machines with widely different performance characteristics and have observed the trends to be robust.
The second threat to validity is our implementation of automata operations.As PC* by design offloads some of the product computation work onto Princess, the baseline could be unfairly disadvantaged by a slow automata product implementation.We believe this is not an issue since similar performance issues with the baseline approach have been reported for other string solvers, as well as for a previous implementation in Ostrich.Additionally, profiling shows that baseline spends most of its time in Princess, suggesting that the automaton implementation is not the bottleneck.Finally, the difference in performance between PC* and baseline was unaffected by significant optimisation of the automata library, additionally strengthening this thesis.

CONCLUSION
In this paper, we have introduced a calculus to compute commutating operations on intersections of regular languages that we call PC*.We have evaluated it on 37 497 Parikh automata intersection problems generated by the Ostrich+ string solver [Chen et al. 2020] solving the PyEx benchmark suite [Reynolds et al. 2017] using our Parikh automata solver Catra.
Within Catra, PC* shows astonishing performance in terms of solve-time compared to the baseline approach laid out in [Verma et al. 2005] when implemented on the same underlying automated theorem prover (Princess, [Rümmer 2008]).It is also competitive with the nuXmv model checker [Cavada et al. 2014], outperforming it on unsatisfiable instances and generally outperforming it for timeouts under 30 seconds with its advantage increasing drastically for even shorter timeouts.30 seconds would generally be considered a long timeout for our intended use as supporting infrastructure to a string constraint solver.
Future investigations involve two tracks.The first one is integration into existing string solvers (wich Ostrich being a particularly promising candidate due to its shared use of Princess), and further adaptation to that use case.Closer inspection of the instances where we currently time out should be useful to further improve our heuristics.
The second track for future improvements is the extension into other problem domains, including other logics, model checking problems, as well as to more powerful automata such as transducers.In principle, we can already express stronger constraints than Parikh automata due to our use of an automated theorem prover which allows rich constraints on counter variables.

DATA-AVAILABILITY STATEMENT
A reproduction package featuring all relevant logs and all benchmarked instances is available [Stjerna and Rümmer 2024].Catra and Ostrich are both available as living software under a 3-clause BSD license on GitHub at https://github.com/amandasystems/catraand https://github.com/uuverifiers/ostrich respectively.A version of Ostrich using CA-Str is available in the cea-catra branch.Code for the paper itself is available at https://github.com/amandasystems/oopsla-artefact.

Fig. 1 .
Fig. 1.The collection of automata we use as running examples, both derived from Example 2.1.

Fig. 4 .
Fig. 4. A proof tree for Constraint (i) and Constraint (iii) to Constraint (v) from Example 2.1, corresponding to handling the Parikh automaton A ′ of Fig. 1a.

Fig. 5 .
Fig. 5.A derivation for PC* on the Parikh image strings for the constraints of Example 2.1.Note the constant propagation for !

Fig. 8 .
Fig. 8.The number of instances solved in the QF_SLIA track of SMT-COMP 2023 as the time budget increases.

Table 1 .
Derivation rules for one automaton.
Table 1 are Solution-Preserving.Lemma 5.2.Consider an application of one of the rules in Table 1, with premises 1 , . . ., and conclusion .An assignment satisfies the conclusion if and only if it satisfies one of the premises .

Table 2 .
Additional derivation rules for products of arbitrarily many automata.

Table 3 .
Number of successful results within a timeout of 30 s. Instances solved by no backend within the timeout (about half of the set) are omi ed from the table.To investigate scaling we additionally execute 2 000 randomly sampled (without replacement) benchmarks with a 120-second timeout.We add runs of PC* with clause learning and restarts disabled, and with only clause learning disabled to show the impact of Section 7.2.3, and Section 7.2.2 respectively.These experiments are executed on a smaller machine, a six-core AMD Ryzen 5 2600 but the same Java version.Their max heap size is set to 20 GB since this system has less RAM.The baseline experiments are executed separately as only two threads.

Table 4 .
Number of solved benchmarks in the set of quantifier-free strings with linear integer arithmetic constraints (QF_SLIA) at SMT-COMP 2023.The numbers are from the competition results, except for CA-Str, which is executed by us on the same cluster as the competition with the same resources, and the two virtual portfolio solvers Ostrich+CA and Competition which aggregate the best results from Ostrich/CA-Str and all the competition results for non-Ostrich solvers respectively.