On the Communication Complexity of Approximate Pattern Matching

The decades-old Pattern Matching with Edits problem, given a length-$n$ string $T$ (the text), a length-$m$ string $P$ (the pattern), and a positive integer $k$ (the threshold), asks to list all fragments of $T$ that are at edit distance at most $k$ from $P$. The one-way communication complexity of this problem is the minimum amount of space needed to encode the answer so that it can be retrieved without accessing the input strings $P$ and $T$. The closely related Pattern Matching with Mismatches problem (defined in terms of the Hamming distance instead of the edit distance) is already well understood from the communication complexity perspective: Clifford, Kociumaka, and Porat [SODA 2019] proved that $\Omega(n/m \cdot k \log(m/k))$ bits are necessary and $O(n/m \cdot k\log (m|\Sigma|/k))$ bits are sufficient; the upper bound allows encoding not only the occurrences of $P$ in $T$ with at most $k$ mismatches but also the substitutions needed to make each $k$-mismatch occurrence exact. Despite recent improvements in the running time [Charalampopoulos, Kociumaka, and Wellnitz; FOCS 2020 and 2022], the communication complexity of Pattern Matching with Edits remained unexplored, with a lower bound of $\Omega(n/m \cdot k\log(m/k))$ bits and an upper bound of $O(n/m \cdot k^3\log m)$ bits stemming from previous research. In this work, we prove an upper bound of $O(n/m \cdot k \log^2 m)$ bits, thus establishing the optimal communication complexity up to logarithmic factors. We also show that $O(n/m \cdot k \log m \log (m|\Sigma|))$ bits allow encoding, for each $k$-error occurrence of $P$ in $T$, the shortest sequence of edits needed to make the occurrence exact. We leverage the techniques behind our new result on the communication complexity to obtain quantum algorithms for Pattern Matching with Edits.


ABSTRACT
The decades-old Pattern Matching with Edits problem, given a length-string (the text), a length-string (the pattern), and a positive integer (the threshold), asks to list all fragments of that are at edit distance at most from .The one-way communication complexity of this problem is the minimum amount of space needed to encode the answer so that it can be retrieved without accessing the input strings and .
The closely related Pattern Matching with Mismatches problem (de ned in terms of the Hamming distance instead of the edit distance) is already well understood from the communication complexity perspective: Cli ord, Kociumaka, and Porat [SODA 2019] proved that Ω( / • log( / )) bits are necessary and O ( / • log( |Σ|/ )) bits are su cient; the upper bound allows encoding not only the occurrences of in with at most mismatches but also the substitutions needed to make each -mismatch occurrence exact.
Despite recent improvements in the running time [Charalampopoulos, Kociumaka, and Wellnitz; FOCS 2020 and 2022], the communication complexity of Pattern Matching with Edits remained unexplored, with a lower bound of Ω( / • log( / )) bits and an upper bound of O ( / • 3 log ) bits stemming from previous research.In this work, we prove an upper bound of O ( / • log 2 ) bits, thus establishing the optimal communication complexity up to logarithmic factors.We also show that O ( / • log log( |Σ|)) bits allow encoding, for each -error occurrence of in , the shortest sequence of edits needed to make the occurrence exact.Our result further emphasizes the close relationship between Pattern Matching with Mismatches and Pattern Matching with Edits.

INTRODUCTION
While a string is perhaps the most basic way to represent data, this fact makes algorithms working on strings more applicable and powerful.Arguably, the very rst thing to do with any kind of data is to nd patterns in it.The Pattern Matching problem for strings and its variations are thus perhaps among the most fundamental problems that Theoretical Computer Science has to o er.
In this paper, we study the practically relevant Pattern Matching with Edits variation [33].Given a text string of length , a pattern string of length , and a threshold , the aim is to calculate the set Occ ( , ) consisting of (the starting positions of) all the fragments of that are at most edits away from the pattern .In other words, we compute the set of -error occurrences of in , more formally de ned as where we utilize the classical edit distance (also referred to as the Levenshtein distance) [32] as the distance measure.Here, an edit is either an insertion, a deletion, or a substitution of a single character.

Pattern Matching with Edits
Input: a pattern of length , a text of length , and an integer threshold > 0. Output: the set Occ ( , ).
Even though the Pattern Matching with Edits problem is almost as classical as it can get, with key algorithmic advances (from O ( ) time down to O ( ) time) dating back to the early and late 1980s [30,31,33], major progress has been made even very recently, when Charalampopoulos, Kociumaka, and Wellnitz [16] obtained an Õ ( + 3.5 / )-time1 solution and thereby broke through the 20-years-old barrier of the O ( + 4 / )-time algorithm by Cole and Hariharan [20].And the journey is far from over yet: the celebrated Orthogonal-Vectors-based lower bound for edit distance [5] rules out only O ( + 2−Ω (1) / )-time algorithms (also consult [16] for details), leaving open a wide area of uncharted algorithmic territory.In this paper, we provide tools and structural insights that-we believe-will aid the exploration of the said territory.
We add to the picture a powerful new nding that sheds new light on the solution structure of the Pattern Matching with Edits problem-similar structural results [11,15] form the backbone of the aforementioned breakthrough [16].Speci cally, we investigate how much space is needed to store all -error occurrences of in .We know from [15] that O ( / • 3 log ) bits su ce since one may report the occurrences as O ( 3 ) arithmetic progressions if = O ( ).However, such complexity is likely incompatible with algorithms running faster than Õ ( + 3 / ).In this paper, we show that, indeed, O ( / • log 2 ) bits su ce to represent the set Occ ( , ).
Formally, the communication complexity of Pattern Matching with Edits measures the space needed to encode the output so that it can be retrieved without accessing the input.We may interpret this setting as a two-party game: Alice is given an instance of the problem and constructs a message for Bob, who must be able to produce the output of the problem given Alice's message.Since Bob does not have any input, it su ces to consider one-way singleround communication protocols.
Main Theorem 1.The Pattern Matching with Edits problem admits a one-way deterministic communication protocol that sends O ( / • log 2 ) bits.Within the same communication complexity, one can also encode the family of all fragments of [ . . ) that satisfy ( , [ . .)) ≤ , as well as all optimal alignments [ . . ) for each of these fragments.Further, increasing the communication complexity to O ( / • log log( |Σ|)), where Σ denotes the input alphabet, one can also retrieve the edit information for each optimal alignment.
Observe that our encoding scheme su ces to retrieve not only the set Occ ( , ) (which contains only starting positions of the -error occurrences) but also the fragments of with edit distance at most from .In other words, it allows retrieving all pairs 0 ≤ ≤ ≤ such that ( , [ . .)) ≤ .
We complement Main Theorem 1 with a simple lower bound that shows that our result is tight (essentially up to one logarithmic factor).
Observe that our lower bound holds for the very simple case that the pattern is the all-zeros string and only the text contains nonzero characters.In this case, the edit distance of the pattern and another string depends only on the length and the number of nonzero characters in the other string, and we can thus easily compute the edit distance in linear time.
From Structural Insights to Better Algorithms: A Success Story.Let us take a step back and review how structural results aided the development of approximate-pattern-matching algorithms in the recent past.
First, let us review the key insight of [15] that led to the breakthrough of [16].Crucially, the authors use that, for any pair of strings and with has at most O ( 2) occurrences with at most edits in , or (b) and the relevant part of are at edit distance O ( ) to periodic strings with the same period.This insight helps as follows: First, one may derive that, indeed, all -error occurrences of in form O ( 3 ) arithmetic progressions.Second, it gives a blueprint for an algorithm: one has to tackle just two important cases: an easy nonperiodic case, where and are highly unstructured and -error occurrences are rare, and a not-so-easy periodic case, where and are highly repetitive and occurrences are frequent but appear in a structured manner.
The structural insights of [15] have found widespread other applications.For example, they readily yielded algorithms for differentially private approximate pattern matching [35], approximate circular pattern matching problems [13,14,17], and they even played a key role in obtaining small-space algorithms for (online) language distance problems [6], among others.
Interestingly, an insight similar to the one of [15] was rst obtained in [11] for the much easier problem of Pattern Matching with Mismatches (where we allow neither insertions nor deletions) before being tightened and ported to Pattern Matching with Edits in [15].Similarly, in this paper, we port a known communication complexity bound from Pattern Matching with Mismatches to Pattern Matching with Edits; albeit with a much more involved proof.As proved in [19], Pattern Matching with Mismatches problem admits a one-way deterministic O ( log( |Σ|/ ))-bit communication protocol.While we discuss later (in the Technical Overview) the result of [19] as well as the challenges in porting it to Pattern Matching with Edits, let us highlight here that their result was crucial for obtaining an essentially optimal streaming algorithm for Pattern Matching with Mismatches.
Finally, let us discuss the future potential of our new structural results.First, as a natural generalization of [19], Ô ( )-space algorithms for Pattern Matching with Edits should be plausible in the semi-streaming and (more ambitiously) streaming models, because Ô ( )-size edit distance sketches have been developed in parallel to this work [29].Nevertheless, such results would also require Ô ( )-space algorithms constructing sketches and recovering the edit distance from the two sketches, and [29] does not provide such space-e cient algorithms.Second, our result sheds more light on the structure of the non-periodic case of [15]: as it turns out, when relaxing the notion of periodicity even further, we obtain a periodic structure also for patterns with just a (su ciently large) constant number of -error occurrences.This opens up a perspective for classical Pattern Matching with Edits algorithms that are even faster than Õ ( / + 3 ).
Application of our Main Result: Quantum Pattern Matching with Edits.As a fundamental problem, Pattern Matching with Edits has been studied in a plethora of settings, including the compressed setting [9,15,23,36], the dynamic setting [15], and the streaming setting [8,28,34], among others.However, so far, the quantum setting remains vastly unexplored.While quantum algorithms have been developed for Exact Pattern Matching [26], Pattern Matching with Mismatches [27], Longest Common Factor (Substring) [2,22,27], Lempel-Ziv factorization [24], as well as other fundamental string problems [1,4,10,18,37], no quantum algorithm for Pattern Matching with Edits has been known so far.The challenge posed by Pattern Matching with Edits, in comparison to Pattern Matching with Mismatches, arises already from the fact that, while the computation of Hamming distance between two strings can be easily accelerated in the quantum setting, the same is not straightforward for the edit distance case.Only very recently, Gibney, Jin, Kociumaka, and Thankachan [24] demonstrated a quantum edit-distance algorithm with the optimal query complexity of Õ ( √ ) and the time complexity of Õ ( √ + 2 ).We follow the long line of research on quantum algorithms on strings and employ our new structural results (combined with the structural results from [15]) to obtain the following quantum algorithms for the Pattern Matching with Edits problem.
Main Theorem 3. Let denote a pattern of length , let denote a text of length , and let > 0 denote an integer threshold.
Surprisingly, for = O ( ), we achieve the same query complexity as quantum algorithms for computing the (bounded) edit distance [24] and even the bounded Hamming distance of strings (a simple application of Grover search yields an Õ ( √ ) upper bound; a matching Ω( √ ) lower bound is also known [7]).While we did not optimize the time complexity of our algorithms (reasonably, one could expect a time complexity of Õ ( / • ( √ + 3.5 )) based on our structural insights and [16]), we show that our query complexity is essentially optimal by proving a matching lower bound.
Again, our lower bounds hold already for the case when the pattern is the all-zeroes string and just the text contains nonzero entries.

TECHNICAL OVERVIEW
In this section, we describe the technical contributions behind our positive results: Main Theorems 1 and 3. We assume that ≤ 3 / 2 (if the text is longer, one may split the text into O ( / ) overlapping pieces of length O ( ) each) and that = ( ) (for = Θ( ), our results trivialize).Due to space constraints, we defer the proofs and the technical details to the full version.

Communication Complexity of Pattern Matching with Mismatches
Before we tackle Main Theorem 1, it is instructive to learn how to prove an analogous result for Pattern Matching with Mismatches.
Compared to the original approach of Cli ord, Kociumaka, and Porat [19], we neither optimize logarithmic factors nor provide an e cient decoding algorithm; this enables signi cant simpli cations.
Recall that our goal is to encode the set Occ ( , ), which is the Hamming-distance analog of the set Occ ( , ).Formally, we set Without loss of generality, we assume that {0, − } ⊆ Occ ( , ), that is, has -mismatch occurrences both as a pre x and as a su x of .Otherwise, either we have Occ ( , ) = ∅ (which can be encoded trivially), or we can crop by removing the characters to the left of the leftmost -mismatch occurrence and to the right of the rightmost -mismatch occurrence.
Encoding All -Mismatch Occurrences.First, if = 0, as a famous consequence of the Periodicity Lemma [21], the set is guaranteed to form a single arithmetic progression (recall that ≤ 3 / 2 ), and thus it can be encoded using O (log ) bits.Consult Figure 1 for a visualization of an example.
If > 0, the set Occ ( , ) does not necessarily form an arithmetic progression.Still, we may consider the smallest arithmetic progression that contains Occ ( , ) as a subset.Since we have 0 ∈ Occ ( , ), the di erence of this progression can be expressed as := gcd(Occ ( , )).
A crucial property of the gcd(•) function is that, as we add elements to a set maintaining its greatest common divisor , each insertion either does not change (if the inserted element is already a multiple of ) or results in the value decreasing by a factor of at least 2 (otherwise).Consequently, there is a set The encoding that Alice produces consists of the set with each -mismatch occurrences ∈ augmented with the mismatch information for and Recovering the -Mismatch Occurrences.It remains to argue that the encoding is su cient for Bob to recover Occ ( , ).To that end, consider a graph G whose vertices correspond to characters in and .For every ∈ and ∈ [ 0 . .), the graph G contains an edge between [ ] and [ + ].
The pattern occurs in starting at the positions 0, , and 2 ; these starting positions form the arithmetic progression ( ) 0≤ ≤2 .
(b) Suppose that we were to identify an additional occurrence of in starting at position 4 .Now, since occurrences start at 0, 2 , and 4 (which in particular implies that , as well as at position , we directly obtain that there is also an occurrence that starts at position 3 in ; which means that the arithmetic progression from Figure 1a is extended to ( ) 0≤ ≤4 .More generally, one may prove that any additional occurrence at a position extends the existing arithmetic progression in a similar fashion.
(c) Suppose that we were to identify an additional occurrence of in starting at position 0 < < .Now, similarly to Figure 1b, we can argue that there is also an occurrence that starts at every position of the form gcd( , ) (this is a consequence of the famous Periodicity Lemma due to [21])again an arithmetic progression.Crucially, the di erence of the arithmetic progression obtained in this fashion decreased by a factor of at least two compared to the initial arithmetic progression.
Figure 1: The structure of occurrences of exact pattern matching is easy: either all exact occurrences of in form an arithmetic progression or there is just one such occurrence (which we may also view as a degenerate arithmetic progression).Depicted is a text and exact occurrences starting at the positions denoted above the text; we may assume that there is an occurrence that starts at position 0 and that there is an occurrence that ends at position | | − 1.
is black; otherwise, the edge is red and annotated with the values Observe that Bob can reconstruct G using the set and the mismatch information for the -mismatch occurrences at positions ∈ .
Next, we focus on the connected components of the graph G .We say that a component is black if all of its edges are black and red if it contains at least one red edge.Observe that Bob can reconstruct the values of all characters in red components: the annotations already provide this information for vertices incident to red edges, and since black edges connect matching characters, the values can be propagated along black edges, ultimately covering all vertices in red components.The values of characters in black components remain unknown, but each black component is guaranteed to be uniform, meaning that every two characters in a single black component match.
The last crucial observation is that the connected components of G are very structured: for every remainder ∈ [ 0 . . ) modulo , there is a connected component consisting of all vertices [ ] and [ ] with ≡ .This can be seen as a consequence of the Periodicity Lemma [21] applied to strings obtained from and by replacing each character with a unique identi er of its connected component.Consult Figure 2 for an illustration of an example for the special case if there are no mismatches and consult Figure 3 for a visualization of an example with mismatches.A convenient way of capturing Bob's knowledge about and is to construct auxiliary strings # and # obtained from and , respectively, by replacing all characters in each black component with a sentinel character (unique for the component).Then, Occ ( , ) = Occ ( # , # ) and the mismatch information is preserved for the -mismatch occurrences.(a) Compare Figure 1a.So far, we identi ed three occurrences of in ; each occurrence is an exact occurrence.Correspondingly, we have = { (0, ∅), ( , ∅), (2 , ∅) }.With this set , we obtain three di erent black components, which we depict with a circle, a diamond, or a star.
(b) The graph G that corresponds to Figure 2a: observe how we collapsed the di erent patterns from Figure 2a into a single pattern .In the example, we have three black components, that is, bc(G ) = 3.
(c) Suppose that we were to identify an additional occurrence of in starting at position 0 < < (highlighted in purple).From Figure 1c, we know how the set of all occurrences changes, but-and this is the crucial point-we do not add all of these implicitly found occurrences to , but just .In our example, we observe that the black components collapse into a single black component, which we depict with a cloud.
(d) The graph G that corresponds to Figure 2c: observe how we collapsed the di erent patterns from Figure 2c into a single pattern .Highlighted in purple are some of the edges that we added due to the new occurrence that we added to .In the example, we have one black components, that is, bc(G ) = 1.
(e) Recovering an occurrence in G from Figure 2d that starts at position gcd( , ), illustrated for the rst character of the pattern.
Figure 2: Compare Figure 1: we fully understand the easy structure of exact pattern matching.In this gure, we reinterpret our knowledge in terms of the encoding scheme of Alice for Pattern Matching with Mismatches (in particular we show just the occurrences included in the set ) and showcase how the corresponding graph G and its black components evolve.We connect the same positions in , as well as pairs of positions that are aligned by an occurrence of in .As there are no mismatches, every such line implies that the connected characters are equal.For each connected component of the resulting graph (a black component), we know that all involved positions in and must have the same symbol.For illustrative purposes, we assume that = 3 and we replace each character of a black component with a sentinel character (unique to that component), that is, we depict the strings # and # .

Communication Complexity of Pattern Matching with Edits
On a very high level, our encoding for Pattern Matching with Edits builds upon the approach for Pattern Matching with Mismatches presented above: • Alice still constructs an appropriate size-O (log ) set of -error occurrences of in , including a pre x and a su x of .
• Bob uses the edit information for the occurrences in to construct a graph G and strings # and # , obtained from and by replacing characters in some components with sentinel characters so that Occ ( , ) = Occ ( # , # ).
At the same time, the edit distance brings new challenges, so we also deviate from the original strategy: • Connected components of G do not have a simple periodic structure, so = gcd( ) loses its meaning.Nevertheless, we If we allow at most 3 mismatches, we now do not have an occurrence starting at position anymore; hence we obtain six black components.
(b) The graph G that correspond to Figure 3a.We make explicit characters that are di erent from the "default" character of a component; the corresponding red edges (that are highlighted) are exactly the mismatch information that is stored in .For the remaining edges, the color depicts the color of the connected component that they belong to.
In the example, we have four black components, that is, bc(G ) = 4.
(Observe that contrary to what the image might make you believe, not every "non-default" character needs to end in a highlighted red edge.)2c.We are still able to identify an additional occurrence of in starting at position 0 < < (highlighted in purple).Now, as before, connected components of G merge; this time, this also means that some characters that were previously part of a black component now become part of a red component (but crucially never vice-versa).
In the example, this means that we now have just a single black component, that is, bc(G ) = 1.
(d) The graph G for the situation in Figure 3c.Again, we make explicit characters that are di erent from the "default" character of a component; the corresponding red edges (that are highlighted) are exactly the mismatch information that is stored in .For the remaining edges, the color depicts the color of the connected component that they belong to (where purple highlights some of the black edges added due to the new occurrence).

a c c c
(e) Checking for an occurrence at position 2gcd( , ) (which would be an occurrence were it not for mismatched characters).We check two things, rst that the black component aligns; and second, for the red component where we know all characters, we compute exactly the Hamming distance (which is 4 in the example, meaning that there is no occurrence at the position in question).
Figure 3: Compared to Figure 2, we now have characters in and that mismatch.Again, we showcase how the corresponding graph G and its black components evolve; in the example, we allow for up to = 3 mismatches.Again, for illustrative purposes, we assume that = 3 and we replace each character of a black component with a sentinel character (unique to that component), that is, we depict the strings # and # .
prove that black components still behave in a structured way, and thus the number of black components, denoted bc(G ), can be used instead.• The value bc(G ) is not as easy to compute as gcd( ), so we grow the set ⊆ Occ ( , ) iteratively.In each step, either we add a single -error occurrence so that bc(G ) decreases by a factor of at least 2, or we realize that the information related to the alignments already included in su ces to retrieve all -error occurrences of in .• Once this process terminates, there may unfortunately remain -error occurrences whose addition to would decrease bc(G )-yet, only very slightly.In other words, such -error occurrences generally obey the structure of black components, but may occasionally violate it.We need to understand where the latter may happen and learn the characters behind the black components involved so that they are not masked out in # and # .This is the most involved part of our construction, where we use recent insights relating edit distance to compressibility [12,24] and store compressed representations of certain fragments of .

General Setup.
Technically, the set that Alice constructs contains, instead of -error occurrences [ . .′ ), speci c alignments [ . .′ ) of cost at most .Every such alignment describes a sequence of (at most ) edits that transform onto [ . .′ ); see the full version for details.In the message that Alice constructs, each alignment is augmented with edit information, which speci es the positions and values of the edited characters; again, see the full version for details.For a single alignment of cost , this information takes O ( log( |Σ|)) bits, where Σ is the alphabet of and .
Just like for Pattern Matching with Mismatches, we can assume without loss of generality that has -error occurrences both as a pre x or as a su x of .Consequently, we always assume that contains an alignment X pref that aligns with a pre x of and an alignment X suf that aligns with a su x of .
The graph G is constructed similarly as for mismatches: the vertices are characters of and , whereas the edges correspond to pairs of characters aligned by any alignment in .Matched pairs of characters correspond to black edges, whereas substitutions correspond to red edges, annotated with the values of the mismatching characters.Insertions and deletions are also captured by red edges; see the full version for details.
Again, we classify connected components of G into black (with black edges only) and red (with at least one red edge).Observe that Bob can reconstruct the graph G and the values of all characters in red components and that black components remain uniform, that is, every two characters in a single black component match.Consult Figure 4 for a visualization of an example.
Finally, we de ne bc(G ) to be the number of black components in G .If bc(G ) = 0, then Bob can reconstruct the whole strings and , so we henceforth assume bc(G ) > 0.
First Insights into G .Our rst notable insight is that black components exhibit periodic structure.To that end, write | for the subsequence of that contains all characters of that are contained in a black component in G and write | for the subsequence of that contains all characters of that are contained in a black component in G .Then, for every ∈ [ 0 . .bc(G ) ), there is a component consisting of all characters | [ ] and | [ ] such that ≡ bc(G ) ; for a formal statement and proof, consult the full version.Also consult Figure 4c for an illustration of an example.

Extra Information to
[ . .′ ) matches [ ] with [ + ], there is no guarantee that it also matches . The reason behind this phenomenon is that the composition of optimal edit-distance alignments is not necessarily optimal (more generally, the edit information of optimal alignments and is insu cient to recover ( , )).In these circumstances, our workaround is to identify a set ⊆ [ 0 . .bc(G ) ) such that the underlying characters can be encoded in Õ ( | |) space and every alignment X : [ . .′ ) that we need to capture matches [ ] with [ + ] for every For this, we investigate how an optimal alignment X : [ . .′ ) may di er from a canonical alignment A : ).Following recent insights from [12,24], we observe that the fragments of on which A and X are disjoint can be compressed into O ( A ( , [ . .′ ))) space (using Lempel-Ziv factorization [38], for example).Moreover, the compressed size of each of these fragments is at most proportional to the cost of A on the fragment.Consequently, our goal is to understand where A makes edits and learn all the fragments of (and ) with a su ciently high density of edits compared to the compressed size.Due to the quasi-periodic nature of and , for each ∈ [ 0 . .bc(G ) ), all characters in the th black component are equal to [ 0 ], so we can focus on learning fragments of [ 0 0 . .

bc(G ) −1 0
].The bulk of the alignment A can be decomposed into pieces that align [ . .+1 ) onto [ + . .+1 + ).In the full version, we prove that ( [ . .+1 ), [ + . .+1 + )) ≤ w ( ), where w ( ) is the total cost incurred by alignments in on all fragments Further, as alignments for occurrences are no longer unique, we have to choose an alignment for each occurrence in the set (which can fortunately be stored e ciently).
(b) The graph G that corresponds to the situation in Figure 3a.Observe that now, we also have a sentinel vertex ⊥ to represent that an insertion or deletion happened.Observe further that due to insertions and deletions, the last empty star character of now belongs to the component of lled diamonds.
In the example, we have two black components, that is, bc(G ) = 2. Figure 4: Compare Figures 2 and 3.In addition to mismatches, we now also allow character insertions or deletions.In the example, we depict occurrence with at most = 4 edits.
up by an appropriate constant factor).Additionally, to handle corner cases, we also learn the longest pre x and the longest su x of [ 0 0 . .Following the aforementioned strategy of comparing the regions where X : [ . .′ ) is disjoint with the canonical alignment A : [ . .′ ), we prove the following result.Due to corner cases arising at the endpoints of [ . .′ ) and between subsequent fragments [ 0 + . .

2.2.3
Extending with Uncaptured Alignments.Proposition 2.1 indicates that captures all -error occurrences [ . .′ ) such that As long as does not capture some -error occurrence [ . .′ ), we add an underlying optimal alignment X : [ . .′ ) to the set .In the full version, we prove that bc(G ∪{ X } ) ≤ bc(G )/2 holds for such an alignment X.For this, we rst eliminate the possibility of + 0 0 ≫ 0 0 − 0 (using X suf ∈ , which matches Based on this encoding, we can construct strings # and # obtained from and , respectively, by replacing with # every character in the th connected component for every ∈ [ 0 . .bc(G ) )\ .As a relatively straightforward consequence of Proposition 2.1, we then prove that Occ ( , ) = Occ ( # , # ) and that the edit information is preserved for every optimal alignment [ . .′ ) of cost at most .

Quantum Query Complexity of Pattern Matching with Edits
As an illustration of the applicability of the combinatorial insights behind our communication complexity result (Main Theorem 1), we study quantum algorithms for Pattern Matching with Edits.As indicated in Main Theorems 3 and 4, the query complexity we achieve is only a sub-polynomial factor away from the unconditional lower bounds, both for the decision version of the problem (where we only need to decide whether Occ ( , ) is empty or not) and for the standard version asking to report Occ ( , ).
Our lower bounds (in Main Theorem 4) are relatively direct applications of the adversary method of Ambainins [3], so this overview is solely dedicated to the much more challenging upper bounds.Just like for the communication complexity above, we assume that ≤ 3 / 2 and = ( ).In this case, our goal is to achieve the query complexity of Ô ( √ ).Our solution incorporates four main tools: • the approximate pattern matching algorithm of [15], • the recent quantum algorithm for computing (bounded) edit distance [24], • the novel combinatorial insights behind Main Theorem 1, • a new quantum (1) -factor approximation algorithm for edit distance that uses Ô ( √ ) queries and is an adaptation of a classic sublinear-time algorithm of [25].
2.3.1 Baseline Algorithm.We set the stage by describing a relatively simple algorithm that relies only on the rst two of the aforementioned four tools.This algorithm makes Õ ( √ 3 ) quantum queries to decide whether Occ ( , ) = ∅.
The ndings of [15] outline two distinct scenarios: either there are few -error occurrences of in or the pattern is approximately periodic.In the former case, the set Occ ( , ) is of size O ( 2), and it is contained in a union of O ( ) intervals of length O ( ) each.In the latter case, a primitive approximate period of small length | | = O ( / ) exists such that and the relevant portion of (excluding the characters to the left of the leftmost -error occurrence and to the right of the rightmost -error occurrence) are at edit distance O ( ) to substrings of ∞ .It is solely the pattern that determines which of these two cases holds: the initial two options in the following lemma correspond to the non-periodic case, where there are few -error occurrences of in , whereas the third option indicates the (approximately) periodic case, where the pattern admits a short approximate period .Here, ( , * * ) denotes the minimum edit distance between and any substring of ∞ .The proof of Lemma 2.2 is constructive, providing a classical algorithm that performs the necessary decomposition and identi es the speci c case.The analogous procedure for Pattern Matching with Mismatches also admits an e cient quantum implementation [27] using Õ ( √ ) queries and time.As our rst technical contribution, we adapt the decomposition algorithm for the edit case to the quantum setting so that it uses Õ ( √ ) queries and Õ ( √ + 2 ) time.Compared to the classic implementation in [15] and the mismatch version in [27], it is not so easy to e ciently construct repetitive regions.In this context, we are given a length-⌊ /8 ⌋ fragment with exact period and the task is to extend it to so that := ( , * * ) reaches ⌈8 / • | |⌉.Previous algorithms use Longest Common Extension queries and gradually grow , increasing by one unit each time; this can be seen as an online implementation of the Landau-Vishkin algorithm for the bounded edit distance problem [30].Unfortunately, the near-optimal quantum algorithm for bounded edit distance [24] is much more involved and does not seem amenable to an online implementation.To circumvent this issue, we apply exponential search (just like in Newton's root-nding method, this is possible even though the sign of ⌈8 / • | |⌉ − ( , * * ) may change many times).At each step, we apply a slightly extended version of the algorithm of [24] that allows simultaneously computing the edit distance between and multiple substrings of ∞ ; see the full version for details.
Once the decomposition has been computed, the next step is to apply the structure of the pattern in various cases to nd the -error occurrences.The fundamental building block needed here is a subroutine that veri es an interval of O ( ) positive integers, that is, computes Occ ( , ) ∩ .The aforementioned extension of the bounded edit distance algorithm of [24] allows implementing this operation using Õ ( √ ) quantum queries and Õ ( √ + 2 ) time.
By directly following the approach of [15], computing Occ ( , ) can be reduced to veri cation of O ( 2) intervals (the periodic case constitutes the bottleneck for the number of intervals), which yields total a query complexity of Õ ( √ ).If we only aim to decide whether Occ ( , ), we can apply Grover's search on top of the veri cation algorithm, reducing the query complexity to Õ ( √ 3 ).One can also hope for further speed-ups based on the more recent results of [16], where the number of intervals is e ectively reduced to Õ ( 1.5 ).Nevertheless, already in the non-periodic case, where the number of intervals is O ( ), this approach does not provide any hope of reaching query complexity beyond Õ ( √ 2 ) for the decision version and Õ ( √ 3 ) for the reporting version of Pattern Matching with Edits.

How to E iciently Verify O( ) Candidate Intervals?
As indicated above, the main bottleneck that we need to overcome to achieve the near-optimal query complexity is to verify O ( ) intervals using Ô ( √ ) queries.Notably, an unconditional lower bound for bounded edit distance indicates that Ω( √ ) queries are already needed to verify a length-1 interval.
A ray of hope stemming from our insights behind Main Theorem 1 is that, as described in Section 2.2, already a careful selection of just O (log ) among the -error occurrences reveals a lot of structure that can be ultimately used to recover the whole set Occ ( , ).To illustrate how to use this observation, let us initially make an unrealistic assumption that every candidate interval contains a -error occurrence for some = Ô ( ).Such occurrences can be detected using the existing veri cation procedure using First, we verify the leftmost and the rightmost intervals.This allows nding the leftmost and the rightmost -error occurrences of in .We henceforth assume that text is cropped so that these two -error occurrences constitute a pre x and a su x of , respectively.The underlying alignments are the initial elements of the set that we maintain using the insights of Section 2.2.Even though these two alignments have cost at most , for technical reasons, we subsequently allow adding to alignments of cost up to ′ = + O ( ).Using the edit information for alignments X ∈ , we build the graph G , calculate its connected components, and classify them as red and black components.
If there are no black components, that is, bc(G ) = 0, then the edit information for the alignments X ∈ allows recovering the whole input strings and .Thus, no further quantum queries are needed, and we complete the computation using a classical veri cation algorithm in O ( + 3 ) time.
If there are black components, we retrieve the positions If any of the candidate intervals contains a position ∈ that is not captured by , we verify that interval and, based on our assumption, obtain a -error occurrence of in that starts somewhere within .Furthermore, we can derive an optimal alignment X : [ . .′ ) whose cost does not exceed + | | ≤ ′ because | | = O ( ).This ′ -error occurrence is not captured by , so we can add X to and, as a result, the number of black components decreases at least twofold.
The remaining possibility is that captures all positions contained in the candidate intervals .In this case, our goal is to construct strings # and # , which are guaranteed to satisfy Occ ( , ) ∩ = Occ ( # , # ) ∩ for each candidate interval because ≤ ′ .For this, we need to build a period cover (that has the aforementioned properties; again see the full version for details), which requires retrieving certain compressible substrings of .The minimum period cover utilized in our encoding does not seem to admit an e cient quantum construction procedure, so we build a slightly larger period cover whose encoding incurs a logarithmic-factor overhead.
The key subroutine that we repeatedly use while constructing this period cover asks to compute the longest fragment of (or of the reverse text ) that starts at a given position and admits a Lempel-Ziv factorization [38] of size bounded by a given threshold.For this, we use exponential search combined with the recent quantum LZ factorization algorithm [24].Based on the computed period cover, we can construct the strings # and # and resort to a classic verication algorithm (that performs no quantum queries) to process all O ( ) intervals in time O ( + 3 ).The next step is to drop the unrealistic assumption that every candidate interval contains a -error occurrence of .The natural approach is to test each of the candidate intervals using an approximation algorithm that either reports that Occ ( , ) ∩ = ∅ (in which case we can drop the interval since we are ultimately looking for -error occurrences) or that Occ ( , ) ∩ ≠ ∅ (in which case the interval satis es our assumption).Given that | | is much smaller than , it is enough to approximate ( , [ . .+ )) for an arbitrary single position ∈ (distinguishing between distances at most O ( ) and at least − O ( )).Although the quantum complexity of approximating edit distance has not been studied yet, we observe that the recent sublinear-time algorithm of Goldenberg, Kociumaka, Krauthgamer, and Saha [25] is easy to adapt to the quantum setting, resulting in a query complexity of Ô ( √ ) and an approximation ratio of (1) = Ô (1); see the full version for details.
Unfortunately, we cannot a ord to run this approximation algorithm for every candidate interval: that would require Ô ( √ ) queries.Our nal trick is to use Grover's search on top: given a subset of the O ( ) candidate intervals, using just Ô ( √ ) queries, we can either learn that none of them contains any -error occurrence (in this case, we can discard all of them) or identify one that contains a -error occurrence.Combined with binary search, this approach allows discarding some candidate intervals so that the leftmost and the rightmost among the remaining ones containerror occurrences.The underlying alignments (constructed using the exact quantum bounded edit distance algorithm of [24]) are used to initialize the set .At each step of growing , on the other hand, we apply our approximation algorithm to the set of all candidate intervals that are not yet (fully) captured by .Either none of these intervals contain -error occurrences (and the construction of may stop), or we get one that is guaranteed to contain aerror occurrence.In this case, we construct an appropriate low-cost alignment X using the exact algorithm and extend the set with X.Thus, the unrealistic assumption is not needed to construct the set and the strings # and # using Ô ( √ ) queries.

2.3.3
Handling the Approximately Periodic Case.Verifying O ( ) candidate intervals was the only bottleneck of the non-periodic case of Pattern Matching with Edits.In the approximately periodic case, on the other hand, we may have O ( 2 ) candidate intervals, so a direct application of the approach presented above only yields an Ô ( √ 2 )-query algorithm.Fortunately, a closer inspection of the candidate intervals constructed in [15] reveals that they satisfy the unrealistic assumption that we made above: each of them contains an O ( )-error occurrence of .This is because both and the relevant part of are at edit distance O ( ) from substrings of ∞ and each of the intervals contains a position that allows aligning into via the substrings of ∞ (so that perfect copies of matched with no edits).Consequently, the set O (log ) alignments covering all candidate intervals can be constructed using Õ ( √ ) queries.Moreover, once we construct the strings # and # , instead of verifying all O ( 2 ) candidate intervals, which takes O ( + 4 ) time, we can use the classic Õ ( + 3.5 )-time algorithm of [16] to construct the entire set Occ ( # , # ) = Occ ( , ).
For a single -mismatch occurrence, the mismatch information can be encoded in O ( log( |Σ|)) bits, where Σ is the alphabet of and .Due to | | = O (log ), the overall encoding size is O ( log log( |Σ|)).
Compare Figure2a.We depict mismatched characters in an alignment of to by placing a cross over the corresponding character in .

]
this is because the path from [ ] to [ + ] in G allows us to obtain an alignment [ . .+1 ) [ + . .+1 + ) as a composition of pieces of alignments in and their inverses.Every component ∈ [ 0 . .bc(G ) ) uses distinct pieces, so the total weight := w ( ) does not exceed • | |.The weight function w ( ) governs which characters of we need to learn.In the full version, we formalize this with a notion of a period cover ⊆ [ 0 . .bc(G ) ).Most importantly, we require that [ . .] ⊆ holds whenever the compressed size of [ 0 . .0 ] is smaller than the total weight = −1 w ( ) (scaled Compare Figure 3a.In addition to mismatched characters, we now also have missing characters in and (depicted by a white space).
An illustration of the additional notation that we use to analyze G .Removing every character involved in a red component, we obtain the strings | and | .For each black component, we number the corresponding characters in and from left to right.
holds for every ∈ [ 0 . .0 ), on the other hand, then there is no ∈ [ 0 . .bc(G ) ) such that [ 0 ] can be matched with any character in the th connected component.Consequently, each black component becomes red or gets merged with another black component, resulting in the claimed inequality bc(G ∪{ X } ) ≤ bc(G )/2.From bc(G ∪{ X } ) ≤ bc(G )/2 and since bc(G ) ≤ holds when we begin with | | = 2, the total size | | does not exceed O (log ) before we either arrive at bc(G ) = 0, in which case the whole input can be encoded in O ( | | log( |Σ|)) bits, or captures all -error occurrences.In the latter case, the encoding consists of the edit information for all alignments in , as well as the set {( , [ 0 ]) : ∈ } which we know how to encode in O ( | | log( Σ)) bits on top of the graph G (as we prove in the full version).

00
, . . ., 0 0 −1 and 0 0 , . . ., 0 0 −1 contained in the 0-th black component.Based on these positions, we can classify ′ -error occurrences [ . .′ ) into those that are captured by (for which | 0 − 0 0 − | is small for some ∈ [ 0 . .0 − 0 ]) and those which are not captured by .Although we do not know ′ -error occurrences other than those contained in , the test of comparing | 0 − 0 0 − | against a given threshold (which is O ( ′ | |)) can be performed for any position , and thus we can classify arbitrary positions ∈ [ 0 . .| | ] into those that are captured by and those that are not.
Testing if an Occurrences Starts at a Given Position.With these ingredients, we are now ready to explain how Bob tests whether a given position ∈ [ 0 ..− ] belongs to Occ ( , ).If is not divisible by , then for sure ∉ Occ ( , ).Otherwise, for every ∈ [ 0 ..), the characters [ ] and [ + ] belong to the same connected component.If this component is red, then Bob knows the values of [ ] and [ + ], so he can simply check if the characters match.Otherwise, the component is black, meaning that [ ] and[ + ] are guaranteed to match.As a result, Bob can compute the Hamming distance ( , [ . .+ )) and check if it does not exceed .In either case (as long as is divisible by ), he can even retrieve the underlying mismatch information.