Locally Consistent Decomposition of Strings with Applications to Edit Distance Sketching

In this paper we provide a new locally consistent decomposition of strings. Each string x is decomposed into blocks that can be described by grammars of size O(k) (using some amount of randomness). If we take two strings x and y of edit distance at most k then their block decomposition uses the same number of grammars and the i-th grammar of x is the same as the i-th grammar of y except for at most k indexes i. The edit distance of x and y equals to the sum of edit distances of pairs of blocks where x and y differ. Our decomposition can be used to design a sketch of size O(k2) for edit distance, and also a rolling sketch for edit distance of size O(k2). The rolling sketch allows to update the sketched string by appending a symbol or removing a symbol from the beginning of the string.


Introduction
Edit distance is a measure of similarity of two strings.It measures how many symbols one has to insert, delete or substitute in a string x to get a string y.The measure has many applications from text processing to bioinformatics.The edit distance ED(x, y) of two strings x and y can be computed in time O(n 2 ) by a classic dynamic programming algorithm [WF74].Save for poly-log improvements in the running time [MP80,Gra16], the best known running time for edit distance computation is O(n + k 2 ) [LMS98], where k = ED(x, y).Assuming Strong Exponential Time Hypothesis (SETH) this running time cannot be substantially improved [BI15].The conditional lower bound does not exclude some approximation algorithms, though, and there was a recent progress on computing edit distance in almost-linear time to within some constant factor approximation [CDG + 18, KS20, BR20, AN20].
Another problem for edit distance that saw a major progress in recent years is sketching.In sketching we want to map a string x to a short sketch sk ED n,k (x) so that from sketches sk ED n,k (x) and sk ED n,k (y) of two strings x and y we can compute their edit distance, either exactly or approximately.Apriori it is not even obvious that short sketches for edit distance exist.In a surprising construction, Belazzougui and Zhang [BZ16] gave an exact edit distance sketch of size O(k 8 log 5 n) bits.The sketch size was then improved to O(k 3 log 2 ( n δ ) log n) bits by Jin, Nelson and Wu [JNW21], where the ED(x, y) was computed exactly from the sketches with probability at least 1 − δ, if ED(x, y) ≤ k.The current best sketch is of size O(k 2 log 3 n) bits and was given by Kociumaka, Porat and Starikovskaya [KPS21].[JNW21] gives a lower bound Ω(k) on the size of a sketch for exact edit distance.
The major problem in edit distance computation as well as in sketching is how to align the matching parts of two strings x and y.Finding an optimal alignment of two strings is the crux in the computation of edit distance and its sketching.In sketching finding a good alignment is even more challenging as we do not have both strings in our hands simultaneously to look for the matching.To the best of our knowledge, to resolve this issue all edit distance sketches use CGK random walk on strings [CGK16] which allows to embed the edit distance metrics into Hamming distance metrics with distortion O(k).The walk implicitly fixes some reasonably good matching between the two strings.Going from the CGK random walk to a sketch is non-trivial undertaking and all three sketch results rely on sophisticated machinery to achieve it.
In this paper we provide a new technique to align two strings x and y in oblivious manner.In nutshell, we provide a decomposition procedure that breaks x and y into the same number of "short" blocks so that at most k pairs of blocks in the decomposition of x and y differ, and all other pairs of blocks are matching in an optimal alignment.So the edit distance of x and y is the sum of edit distances of the differing blocks.To be more specific our blocks are not short in their length but they are short in the sense that each of them can be described by a context-free grammar of size O(k).Our decomposition algorithm constructs the grammars.Our decomposition is based on locally consistent parsing of strings a technique similar to the one used in [SV94, BES06, Jow12, BGP20] and hash based partitioning similar to [?].Our main technical result is: Theorem 1.1 (String decomposition).There is an algorithm running in time O(|x|) that for each string x of length at most n produces grammars G x 1 , . . ., G x s such that with probability at least ) and each of the grammars is of size O(k).(The O(•) notation hides factors that are poly-logarithmic in n.) Furthermore, for any two strings x and y of edit distance at most k with grammars G x 1 , . . ., G x s and G y 1 , . . ., G y s ′ , resp., that are produced by the algorithm using the same randomness, the following is true simultaneously with probability at least 4/5: 1. s = s ′ , 2. G x i = G y i , for all i ∈ {1, . . ., s} except for at most k indices i, and 3. ED(x, y) = i ED(eval(G x i ), eval(G y i )).Here, for a grammar G, eval(G) denotes its evaluation.Our decomposition can be used immediately to give an embedding of edit distance into Hamming distance with distortion O(k).It also readily yields a sketch for exact edit distance of size O(k 2 ): Theorem 1.2 (Sketch for edit distance).There is a randomized sketching algorithm sk ED n,k that on an input string x of length at most n produces a sketch sk ED n,k (x) of size O(k 2 ) in time O(nk), and a comparison algorithm running in time O(k 2 ) such that given two sketches sk ED n,k (x) and sk ED n,k (y) for two strings x and y of length at most n obtained using the same randomness of the sketching algorithm outputs with probability at least 1 − 1/n (over the randomness of the sketching and comparison algorithms) the edit distance of x and y if it is less than k and ∞ otherwise.
Furthermore, we can also provide a rolling sketch, a sketch in which we can update the stored string by appending a symbol or removing its first symbol.
Theorem 1.3 (Rolling sketch for edit distance).There are algorithms Append(sk x , a), Remove(sk ax , a), and Compare(sk x , sk y ) such that for integer parameters k ≤ m: 1. Given a sketch sk x representing a string x and a symbol a, Append(sk x , a) outputs a sketch sk xa for the string xa in time O(k 2 ).
2. Given a sketch sk ax representing a string ax for a symbol a, Remove(sk ax , a) outputs a sketch sk x for the string x in time O(k 2 ).
3. Given two sketches sk x and sk y representing strings x and y obtained from the same random sketch for empty string using two sequences of at most m operations Append and Remove, Compare(sk x , sk y ) calculates the edit distance of x and y if it is less than k, and outputs ∞ otherwise.The algorithm Compare(sk x , sk y ) runs in time O(k 2 ).
All the sketches are of size O(k 2 ).The probability that any of the algorithms fails or produces incorrect output is at most 1/m over the initial randomness of the sketch for empty string and internal randomness of the algorithms.
We remark that we did not attempt to optimize the running time of either of our algorithms, or poly-log factors in the sketch sizes, and we believe that both parameters can be readily improved by usual amortization techniques of processing symbols in batches of size O(k).We believe that the update time in the last theorem can be improved to O(1) by buffering O(k) symbols that shall be inserted or removed without affecting the other parameters of the algorithm.
Another distinguishing feature of our decomposition procedure compared to the technique of CGK random walks is its parallelizability.CGK random walk seems inherently sequential whereas our decomposition procedure can be easily parallelized.We believe that our decomposition will allow for further applications beyond our simple sketches.

Related work
The problem of embedding edit distance to other distance measures, like Hamming distance, ℓ 1 , etc. has been studied extensively.In [CGK16], the authors have given a randomized embedding from edit distance to Hamming distance, where any string x ∈ {0, 1} n can be mapped to a string f (x) ∈ {0, 1} 3n , given a random string r ∈ {0, 1} log 2 n , such that, ED(x, y)/2 ≤ Ham(f (x), f (y)) ≤ O(ED(x, y) 2 ) with probability at least 2/3.Batu, Ergun and Sahinalp [BES06] have introduced a dimensionality reduction technique, where any string x of length n can be mapped to a string f (x) of length at most n/r, for any parameter r, with a distortion of O(r).They used the locally consistent parsing technique for their embedding.Ostrovsky and Rabani [OR07] gave an embedding from edit distance to ℓ 1 distance with a distortion of O( √ log n log log n).Jowhari [Jow12] also gave a randomized embedding from edit distance to ℓ 1 distance with a distortion of O(log n log * n).He used the embedding given by Cormode and Muthukrishnan [CM02] who showed that any string x of length n can be mapped to a vector f (x) of length m = O(2 n log n ), such that for any pair of strings x, y of length n each, Since the size of the vector was too large, [Jow12] used random hashing to get his final embedding.

Our techniques
We first provide the intuition for our technique.We would like to break a string x into small blocks obliviously so that when a string y is broken by the same procedure, the difference between x and y caused by the edit operations is confined within the corresponding blocks of x and y, and the overall decomposition is not affected by them.For random binary strings x and y this could be done fairly easily: look on all the (overlapping) windows of log n consecutive bits in each of the strings and for each window decide at random whether to make a break at that window or not.To make it consistent between x and y use some random hash function H : {0, 1} log n → {0, . . ., D − 1} so that if the hash function evaluates to 0 on a given window then start a next block of the decomposition.If we chose D suitably, say D ≥ 10k log n, then we are unlikely to start a new block in any window which is affected by the the at most k edit operations on x and y.In that case we obtain the desired decomposition.Hence, decomposing random strings x and y is easy.
The issue is what to do with non-random strings.Consider for example strings x and y that are very sparse, so they contain √ n ones sprinkled within the vast ocean of zeros.The hash function H will see mostly windows of 0's and occasionally a window of the form 0 i 10 log(n)−i−1 .The decomposition will have no effect on such strings despite the fact that the string might contain Ω( √ n) bits of entropy.However, we can compress such sparse strings: replace stretches of zeros by some binary encoded information about their length, and try to break the strings again.Still, this will fail if in our example the stretches of zeros are replaced by stretches of some repeated pattern such as (01) * .So we need slightly more general compression which will compress any log n bits into log(n)/2 bits.By repeating the sequence of steps: split and compress, we will eventually get the desired decomposition of each string.
Our actual algorithm mimics the above intuition.It is technically easier to work with a larger alphabet, so we extend the input alphabet Σ by adding special compression symbols into the work alphabet Γ. (Without loss of generalization we can assume that Σ is of size O(n 3 ) otherwise we can hash each symbol of our input strings using some perfect hash function into an alphabet of size O(n 3 ) without affecting the edit distance of a given pair of strings.)To split a string we will use a random hash function H : Γ2 → {0, 1} from a suitable hash family that we call (D, O(log n))-iterated pair-wise independent family, for D = Θ(k log n). 1 If the hash function is zero on a pair of consecutive symbols in a string, we start a new block of the decomposition on the first symbol in the pair, and this happens with probability roughly 1/D for our choice of H.
Then in each resulting block we replace stretches of repeated symbols by a special compression symbol from Γ representing the block, and we use a pair-wise independent hash function C : Γ 2 → (Γ \ Σ) to compress non-overlapping pairs of symbols into one symbol.This latter step requires some care as we have to make sure that we select non-overlapping pairs in the same way in x and y.For the selection of nonoverlapping pairs we use the locally consistent coloring of Cole and Vishkin [CV86,Lin87,Lin92] where the selection of pairs depends only on the context of O(log * n) symbols.The compression reduces the size of each block by a factor of 2/3.We repeat the compress and split process for O(log n) iterations until each compressed block of x is of size at most 2. Decompression of each block then gives us the desired decomposition of x. (See Fig. 1 for an illustration.) It is natural and convenient to represent each of the blocks by a context-free grammar which corresponds to the compression process.We can argue that the grammars will be of size O(D log n) with high probability.So we can represent each string by a sequence of small grammars so that if x and y are at edit distance at most k then at most k pairs of their grammars will differ, and the sum of the edit distances of differing pairs is the edit distance of x and y.Note, that edit distance of two strings represented by context-free grammars can be computed efficiently [GKLS22].These are the main ideas behind our decomposition algorithm, and we provide more details in Section 3 Building a sketch from the string decomposition is straightforward: We encode each grammar in binary using fixed number of bits, and we use off-the-shelf sketch for Hamming distance to sketch the sequence of grammars.As the Hamming distance sketch does not recover identical bits but only the mismatched bits we make sure that if two grammars differ then their binary encoding differ in every bit.Over binary alphabet this might be impossible but over large alphabets one could use error-correcting codes to achieve the desired effect of recovering the differing grammars; for simplicity we use the Karp-Rabin fingerprint of the whole grammar to encode the binary 0 and 1 distinctly.See Section 3.3 for the details of our encoding and Section 3.4 for details of the sketch for edit distance.
To design a rolling sketch for edit distance where we can extend the represented string by a new symbol or repeatedly remove the first symbol of the represented string we will employ our decomposition technique together with the rolling sketch for Hamming distance of Clifford, Kociumaka, and Porat [CKP19].We will argue that appending a new symbol to a string affects only some fixed number of grammars in the decomposition of a string.There is a certain threshold T so that except for the last T grammars the decomposition of a string stays the same regardless of how many other symbols are appended.Hence, we will keep a buffer of at most T active grammars corresponding to the recently added symbols, and upon addition of a new symbol we will only update those grammars.We are guaranteed that the grammars before this threshold will stay the same forever, so we can commit them into the rolling Hamming sketch (in the form of their binary encoding.)Similarly, we will keep a buffer of up-to T active grammars that capture the symbols that were deleted from the sketch most recently.Once they become "mature" enough we can commit them by removing their binary encoding from the rolling Hamming sketch.(See Fig. 3 for an illustration.)This allows to maintain a rolling sketch for edit distance.
Evaluation of an edit distance query on two rolling sketches will use their Hamming sketch to recover differing committed grammars.Together with the active grammars of inserted and deleted symbols this provides enough information for evaluating the edit distance query.Technical details are explained in Section 4. In Section 6 we give a table of parameters used throughout the paper.

Notations and preliminaries
For any string x = x 1 x 2 x 2 . . .x n and integers p, q, x[p] denotes x p , x[p, q] represents substring x ′ = x p . . .x q of x, and x[p, q) = x[p, q − 1].If q < p, then x[p, q] is the empty string ε. x[p, . . .] represents x[p, |x|], where |x| is the length of x. "•"-operator is used to denote concatenation, e.g x • y is the concatenation of two strings x and y.Dict(x) = {x[i, i + 1], i ∈ [n − 1]}, is the dictionary of string x, which stores all pairs of consecutive symbols that appear in x.For strings x and y, ED(x, y) is the minimum number of modifications (edit operations) required to change x into y, where a single modification can be adding a character, deleting a character or substituting a character in x.All logarithms are based-2 unless stated otherwise.For integers p > q, q i=p a i = 0 by definition regardless of a i 's.

Grammars
Let Σ ⊆ Γ be two alphabets and

Proposition 2.2 ([CKP19]
).There is a randomized algorithm working in time O(k log3 p) that given sketches sk Ham n,k,p (x) and sk Ham n,k,p (y) of two strings x and y of length ℓ ≤ n constructed using the same randomness decides whether Ham(x, y) ≤ k, and if so returns MIS(x, y), with probability of error at most 1/n over the randomness of the sketches and the internal randomness of the algorithm.
They also construct the following update procedures for their sketch.We will use them to construct a rolling sketch for edit distance.Corollary 2.5 of [CKP19] states that appending a character to a sketch of x can be done even faster namely in amortized time O(log p).

Locally consistent coloring
The following color reduction procedure allows for locally consistent parsing of strings.The technique was originally proposed by Cole and Vishkin [CV86] and further studied by Linial [Lin87,Lin92].
4. Out of every three consecutive symbols of F CVL (x) at least one of them is 1.
The first three items are standard for R = log * |Γ| + 10.The other two can be obtained by a simple modification of the output of the standard function.In the output, replace first in parallel each sequence 232 by 212, and then each sequence 323 by 313.This guarantees the fourth condition.To satisfy the fifth condition If |x| > 4 then replace the sequence at the beginning of the output as follows: if it starts by a word from {2, 3}{2, 3}1 replace it by 121, if it starts by {2, 3}1{2, 3}{2, 3} replace it by 1212, if it starts by {2, 3}1{2, 3}1 replace it by 1231.Then at the end of the sequence, replace 1{2, 3}1 by 123, and 1{2, 3}{2, 3}1 by 1212.This will increase the local dependency to at most R = log * |Γ| + 20.

Random hash functions
For sets U and V , we say that , where h is chosen uniformly at random from H. Proposition 2.5.Let H = {h : U → V } be a pair-wise independent hash system.Let U ′ ⊆ U where We will use the following class of randomly selected hash functions to chose the splitting points instead of a fully random function H from Γ 2 to {0, . . ., D − 1}.For integral parameters D and ℓ, we say that H : Γ 2 → {0, 1} is (D, ℓ)-iterated pair-wise independent function if H is obtained by selecting independently at random functions h 1 , . . ., h ℓ : Γ 2 → {0, . . ., ℓD − 1} from a pair-wise independent hash system and for each ab ∈ Γ 2 , H(ab) is set to 0 if ℓ i=1 h i (ab) = 0, and H(ab) is set to 1 otherwise.Such a hash function H has several useful properties for us: it can be described using O(ℓ • (log Γ + log D + log ℓ)) bits, at any point it can be evaluated in time polynomial in the bit length of the description of H (so for ℓ = O(log n) and D and Γ polynomial in n in time O(1)), for any pair of symbols ab ∈ Γ 2 , the probability that H(ab) = 0 is roughly 1/D, and for any sufficiently large set S ⊆ Γ 2 , the image of S under H will contain 0 with high probability.In particular we will use the following simple facts.

For any ab
For the other claim, by the previous proposition, for a pair-wise independent h i , If for all ab ∈ S, H(ab) ̸ = 0 then for all i ∈ {1, . . ., ℓ}, for all ab ∈ S, h i (ab) ̸ = 0. Hence by the independence of h

Decomposition algorithm
In this section we describe our main technical tool that we have developed.It is a randomized procedure that splits a string x into blocks B x 1 , B x 2 , . . ., B x s and for each block it produces a grammar of size at most x s is the decomposition for a string x and B y 1 , B y 2 , . . ., B x s ′ is the decomposition for a string y, obtained using the same randomness, where ED(x, y) ≤ k then with good probability, s = s ′ and B x i = B y i for all but k indices i.The edit distance of x and y can be calculated as ED(x, y) = i ED(B x i , B y i ) where i ranges over the differing blocks.First we provide an overview of the algorithm, specific details are given in the next sub-section.The decomposition procedure proceeds in O(log n) rounds.In each round, the algorithm maintains a decomposition of x into compressed blocks.In each round each block of size at least two is first compressed and then split.The compression is done by compressing pairs of consecutive symbols into one using a randomly chosen pair-wise independent hash function C ℓ : Γ 2 → Γ, where ℓ is the round number (level).Nonoverlapping pairs of symbols are chosen for compression using a locally consistent coloring so that every three symbols shrink to at most two.Prior to the compression of pairs we replace each repeated sequence a r of a symbol a, r ≥ 2, by a special character r a,r .
The splitting procedure uses a (D, O(log n))-iterated pair-wise independent hash function H ℓ : Γ 2 → {0, 1} to select places where to subdivide each block into sub-blocks, where D = O(k) is a suitable parameter.We start a new block at each consecutive pair of symbols ab, where H ℓ (ab) = 0. H is chosen so that for each ab ∈ Γ 2 , H(ab) happens with probability roughly 1/D.
After O(log n) rounds, each block is compressed into at most two symbols and we output a grammar that can generate the block.
For the correctness of the algorithm we will need to establish several properties of the algorithm.Some of these properties are related to behaviour on a single string x, others analyze the behaviour of the procedure on a pair of strings x and y of edit distance at most k.
The properties we want from the algorithm when it runs on x are the following: In each round, each block should be compressed by factor at least 2/3 while the size of the required grammar capturing the compression should be O(k).The former is achieved by the design of the compression procedure.The latter goal is provided by the property of the splitting procedure which makes sure that each block The grammar size will be proportional to this dictionary.
For the compression procedure we require that it preserves information so the function C ℓ is one-to-one on each Dict(B).Since the total size of all dictionaries is bounded by O(n) this can be easily achieved by picking C ℓ at random provided that its range size is Ω(n 3 ).
Additionally, we need the following property to hold on a pair of strings x and y of edit distance at most k with good probability: The splitting procedure should never split x or y in a region which is affected by edit operations that transform x to y (for some canonical choice of those operations.)The total size of those regions will be again O(k) so we can satisfy this property if each pair of symbols has probability at most 1/ O(k) to start a new block.This constrains the choice of the parameters for the splitting function H ℓ .
In the next section we describe the decomposition algorithm fully, and then we establish its properties.

Algorithm description
Let n be an upper bound on the length of the input string and k ≤ n be given.Set L = ⌈log 3/2 n⌉ + 3 to be an upper bound on the decomposition depth.Let Σ be an input alphabet of size at most n 3 , Σ c = {c 1 , c 2 , . . ., c Ln 4 } and Σ r = {r a,r , a ∈ Σ∪Σ c , r ∈ {2, 3, . . ., n}} be auxiliary pair-wise disjoint alphabets.Let Γ = Σ ∪ Σ c ∪ Σ r be the working alphabet, and # be a symbol not in Γ.Notice |Γ| = O(n 5 log n + |Σ|).
Main building blocks of the algorithm are two functions, Compress and Split.The first one compresses strings by a factor of 2/3, and the other splits strings at random points.Their pseudo-code is provided as Algorithm 1 and 2. We describe them next.
Compress.The function Compress(B, ℓ) takes as input a string B over alphabet Γ of length at least two, and an integer ℓ ≥ 1, which denotes the level number.Divide B into minimum number of blocks B 1 , . . ., B m , B = B 1 B 2 B 3 . . .B m , so that in each B i either all the characters are the same, i.e.B i = a r for some a ∈ Γ and r ≥ 2, or no two adjacent characters are the same.The first step is to compress the B i 's which contain repeated characters by simply replacing the whole B i with the symbol r a,|B i | , where a is the repeated character.Then for the remaining blocks, the following compression is applied: Let B i be an uncompressed block.Each character of B i is colored by applying . .B ′ s , such that for each B ′ j only the first character is colored 1.Now, according to Proposition 2.4, length of each B ′ j is either 2 or 3.If B ′ j = ab, replace it with C ℓ (ab) else if B ′ j = abc, replace it with C ℓ (ab) • c, where a, b, c ∈ Γ.The actual pseudo-code given below performs the compression of blocks of repeats in two stages, where in the first stage we replace the repeated sequence a r by r a,r • #, and then in the next stage we remove the extra symbol #.This simplifies analysis in Lemma 3.10.Assuming that C ℓ can be evaluated in time O(1), the running time of Compress(B, ℓ) is dominated by the time needed to compute Algorithm 1 Compress(B, ℓ) Input: String B over alphabet Γ of length at least two, and level number ℓ. Output: String B ′′ over alphabet Γ.
minimum number of blocks so that each maximal subword a r of B, for a ∈ Γ and r ≥ 2, is one of the blocks.for each i ∈ {1, . . ., m} do if B i = a r , where r ≥ 2 then Set B ′ i = r a,r • # and color r a,r by 1 and # by 2.3 ; Split.The function takes as input a string B over alphabet Γ of length at least two, and an integer ℓ ≥ 1.The function splits the string B into smaller blocks.The algorithm works as follows: The main recursive step of the algorithm is encompassed in function Process.The function gets a block B ∈ Γ * as its input.The block might have already been compressed previously, so the function also gets partial grammars that allow decompression of the block.If the block is already of length at most two, then the function outputs the block.Otherwise it compresses the block B using Compress, then it subdivides the compressed block using Split, and invokes itself recursively on each sub-block.For the output, each block is represented by a grammar.The grammar is reconstructed from the compressed block and its partial grammars by a simple bread-first search algorithm provided in the function Grammar.To decompose an input string x into blocks, we first apply function Split(x, 0) to x and then invoke Process(B, (), 1) on each of the obtained blocks B. Breaking the string x into sub-blocks guarantees that each block passed to Process has small dictionary whereas the dictionary of x could have been arbitrarily large.
r ∈ {a, b} for some r} ; end end For each r a,r appearing in any of the rules in G, add r a,r → a r to G.Return G.

Correctness of the decomposition algorithm
Our goal is to establish the following theorem which is a stronger version of Theorem 1.1: Theorem 3.1.Let k ≤ n be integers.Let x and y be a pair of strings of length at most n with ED(x, y) ≤ k.Let G x 1 , . . ., G x s and G y 1 , . . ., G y s ′ be the sequence of grammars output by the decomposition algorithm on input x and y respectively, using the same choice of random functions C 1 , . . ., C L and H 0 , . . ., H L .The following is true for n large enough: With probability at least 1 − 2/n, for all i ∈ {1, . . ., s} and j ∈ {1, . . ., s ′ }, |G x i |, |G y j | ≤ S. 3.With probability at least 9/10, s = s ′ , G x i = G y i , for all i ∈ {1, . . ., s} except for at most k indices i, and ED(x, y) = i ED(eval(G x i ), eval(G y i )).By union bound, all three parts happen simultaneously with probability at least 9/10 − 4/n which is ≥ 4/5 for n large enough.
To prove the theorem we make some simple observations about the algorithm, first.Proof.Let B = B 1 B 2 B 3 . . .B m be as in the procedure.Every block B i that equals to a r , for some a and r ≥ 2, is reduced to one symbol by the compression.The other blocks are colored using F CVL (•) and compressed.Unless a block B i is of size one, the coloring induces division of the block B i into subwords of size two or three, where the former is compressed into one symbol and the latter into two symbols.Hence, each such a block is compressed to at most 2/3 of its size.So the only blocks B i that do not shrink are of size one, and are sandwiched between blocks of repeated symbols (that shrink by a factor of at least two).The worst-case situation is when m is odd, blocks B i are of size one for odd i, and of size two for even i.
In that case the original string B shrinks to size ⌊ 2 3 |B|⌋ + 1.This proves the first inequality.The second inequality is also clear from the analysis above: The only time the string does not shrink is if it is of size one.
Corollary 3.3.On a string B of length at most n, the depth of the recursive calls of Process is at most L.
Indeed, from the previous lemma it follows that each block after ℓ compressions and splits is of size at most (2/3) ℓ |B| + 3. Hence, after L = ⌈log 3/2 n⌉ + 3 recursive calls Process must stop the recursion.Proof.First we add the starting rule to G. Then in each iteration of the main loop we can add a rule of the type c → ab to G from some D j .Hence, the number of such rules in G is at most 1 + j |D j |.Last, we add to G rules for symbols from Σ r that appear on right hand sides of rules in G.This increases the size of G by at most |B| + 2 j |D j |.If D j 's are stored using some efficient data structure such as binary search trees or hash tables indexed by the left hand side of rules, finding and adding each new rule to G takes time O(1).The size of C is bounded by O(|B| + |G|) so the nested loops make at most O(ℓ(|B| + |G|)) iterations in total.Hence, the total running time is bounded as claimed.
During processing of a string x, there are at most Ln calls to the function Split.(The actual number of calls is O(n) as the strings shrink exponentially but our simple upper bound suffices.)The probability that any one of them would produce a block with dictionary larger than 5D log n is at most Ln/n 3 .If all dictionaries are of size at most 5D log n then so are all the partial grammars produced by Process.We can conclude the next corollary which implies the second item of Theorem 3.1.
Corollary 3.6.For n large enough, on a string x of length at most n, processing the string x produces a sequence of grammars each of size at most S = 15DL log n + 3 with probability at least 1 − 1/n.
For the grammars produced by the algorithm to be deterministic, we need that each C ℓ is one-to-one on Dict(B) for each block B on which Compress(B, ℓ) is invoked.That will happen with high probability by a standard argument: Lemma 3.7.Let B ∈ Γ * be of length at most n and ℓ ∈ {1, . . ., L}.Let C ℓ : Γ 2 → {c i , (ℓ − 1)n 4 < i ≤ ℓn 4 } be chosen at random from a pair-wise independent family of hash functions.Then with probability at least 1 − |B|/n 3 , C ℓ is one-to-one on Dict(B).
Proof.For two distinct elements from Dict(B), the probability of a collision for randomly chosen C ℓ is at most 1/n 4 .By the union bound, the probability that C ℓ is not one-to-one on Dict(B j ) is at most During processing of a string x, there are at most Ln calls to the function Compress.For a fixed level ℓ ∈ {1, . . ., L}, the total size of blocks B for which Compress(B, ℓ) is invoked is at most n.By the previous lemma and the union bound, the probability that during any of those calls Compress(B, ℓ) uses a function C ℓ that is not one-to-one on Dict(B) is at most 1/n 2 .If all the hash functions C 1 , C 2 , . . ., C L that are used to compress blocks of x are one-to-one on their respective blocks then the grammars that Grammar produces will be deterministic, and they will evaluate to their respective blocks of x. (We can actually conclude a stronger statement that each C ℓ will be one-to-one on the union of all blocks at level ℓ with high probability.)We can conclude the next corollary which implies the first item of Theorem 3.1.
Corollary 3.8.For n large enough, on a string x of length at most n, with probability at least 1−L/n 2 , processing the string x produces a sequence of grammars At this point we can estimate the running time of the decomposition algorithm.We can let the algorithm fail, and produce some trivial decomposition of x, whenever Split produces a block with dictionary larger than 5D log n.If it does not fail, then all grammars are of size at most S which is O(k).There are at most |x| of them and their total size is at most 2|x| as each of the grammars G produces a string of size at least |G|/2.So time spent in Grammar(. . . ) is bounded by O(|x|).The total time spent in Compress(. . . ) is proportional to the sum of sizes of all non-trivial blocks over all levels of recursion which is O(|x|L) = O(|x|).(A more accurate estimate on the total size of blocks is O(|x|) since the blocks are shrinking geometrically in each iteration.)This means that the time to execute all calls to Compress is O(|x|LR) = O(|x|).The time spent in Split(. . . ) is dominated by the time needed to evaluate H ℓ .The number of evaluation points at a given level ℓ is proportional to the total size of all blocks at that level.Since H ℓ can be evaluated at a single point in time O(1), we get an upper bound O(|x|L) = O(|x|) on time spent in Split.Hence, in total the decomposition procedure runs in time O(|x|).Proposition 3.9.Given k ≤ n, the running time of the decomposition algorithm on a string x of length at most n is O(|x|) with probability at least 1 − 1/n.
It remains to address the properties of the algorithm run on a pair of strings x and y of edit distance at most k to establish Theorem 3.1.For the pair of strings x and y we fix a canonical decomposition of x and y to be a sequence of words w 0 , w 1 , . . ., w k , u i , . .
By the definition of edit distance such a decomposition exists: each pair (u i , v i ) represents one edit operation, and we fix one such decomposition to be canonical.Observe, if we now partition x into blocks B x 1 , . . ., B x s so that each B x i starts within one of the w j 's, and we partition y into blocks B y 1 , . . ., B y s so that each block B y i starts at the corresponding location in w j as B x i , then ED(x, y) = i ED(B x i , B y i ).
We need to understand what happens with the decomposition of x and y when we apply the Compress function.Let x = uwv and x ′ = Compress(x, ℓ) = u ′ w ′ v ′ , for some u, w, v, u ′ w ′ v ′ ∈ Γ * .We say that a symbol c in w ′ comes from the compression of w if either it is directly copied from w by Compress, or it is the image c = C ℓ (ab) of a pair of symbols ab where a belongs to w, or c = r a,r replaced a block a r where the first symbol of a r belongs to w. w ′ is the compression of w if it consists precisely of the symbols that come from the compression of w.Furthermore, we say a symbol c in w ′ comes weakly from the compression of w if either it is directly copied from w by Compress, or it is the image c = C ℓ (ab) of a pair of symbols ab where a or b belong to w, or c = r a,r replaced a block a r where some symbol of a r belongs to w. w ′ is the weak compression of w if it consists precisely of the symbols that come weakly from the compression of w.Notice, a weak compression of w might contain and extra symbol at the beginning compared to the compression of w.
The following lemma captures what compression does to the canonical decomposition of x and y. (See Fig. 2 for illustration.)Lemma 3.10.Let x and y be strings over Γ, and let x ′ = Compress(x, ℓ) and y ′ = Compress(y, ℓ).Let x = w 0 u 1 w 1 u 2 w 2 • • • u q w q and y = w 0 v 1 w 1 v 2 w 2 • • • v q w q for some strings w i , u i and v i where for i ∈ {1, . . ., q}, i is the compression of the same subword of w i in both x and y.
For each and ℓ we fix one choice of w ′ 0 , . . ., w ′ q , u ′ 0 , . . ., u ′ q , v ′ 0 , . . ., v ′ q satisfying the lemma.We will refer to it as the canonical decomposition of x ′ and y ′ induced by the decomposition of x and y as given by the lemma.
Proof.The first stage of Compress replaces maximal blocks of repeated symbols by shortcuts.To simplify our analysis first we will reassign blocks of repeated symbols among neighboring blocks of w i , u i and v i , resp., so each maximal block of symbols in x and y is fully contained in one of the words w i , u i or v i .
For i = 1, . . ., q − 1 we define words w (1) i and parameters a i , b i ∈ Γ and k i , k ′ i ∈ N as follows: If w i contains at least two distinct symbols let w i = a k i i w (1) i b k ′ i i so that k i and k ′ i are maximum possible, otherwise w i = a k i i for some a i and k i (k i might be zero), and we set w 0 and some symbol b 0 .Let w q = a kq q w (1) q for maximum possible k q and some symbol a q .For i = 1, . . ., q, we let u q and y = w Next, if there is a maximal block of symbols a r contained in u s and ending in u (1) t , s ̸ = t, we add all the symbols of the a r to the end of u  i , and u (1) i will become empty for s < i < t.)We do this for all maximal blocks of repeated symbols that span multiple u (1) i .We perform similar moves on v (1) i 's.After all of those moves we denote the resulting subwords by w (2) q .At this stage, each maximal block of repeated symbols in x or y is contained in one of the subwords w and v (2) i .The first stage of Compress replaces each maximal block a r , r ≥ 2, by a sequence r a,r #, and we apply this procedure on each subword w i , and v (2) i to obtain corresponding subwords w i | ≤ 4R + 28.This is because every u i is transformed into u (3) i by appending or prepending possibly empty block of repeated symbols, i.e., u (3) i = a r u i b r ′ for some a, b, r, r ′ , or removing its content entirely.Each block of repeats is reduced to two symbols so each u (3) i is longer than the original by at most 4 symbols.Similarly for v (3) i .Next, coloring function F CVL is used on parts of x and y that are not obtained from repeated symbols; the two symbols replacing each repeated block are colored by 1 and 2, resp.We refer to this as {1, 2, 3}coloring.At most R first and last symbols of each w (3) i might be colored differently in x and y as the color of each symbol depends on the context of at most R symbols on either side of the symbol, and that context might differ in x and y.Hence, only symbols near the border of w (3) i are colored the same in both x and y.The coloring is then used to make decisions on which pairs of symbols are compressed into one.
We will let u ′ i be the symbols that come from the compression of symbols in u (3) i , the first up-to R + 2 symbols of w (3) i−1 are considered to be compressed into symbols belonging to u ′ i .For i = 0, . . ., q, if |w i be the position of the first symbol in w (3) i | ≥ 2R + 3 set t x i to be the first position from left colored 1 among the symbols of w (3) q | < R + 3 then redefine s x q to t x q .Similarly, define s y i and t y i based on the {1, 2, 3}-coloring of y. Notice, i | ≥ 2R + 3 so s x i = s y i and t x i = t y i as the symbols R-away from either end of w (3) i are colored the same in x and y.We let u ′ i to be the compression of w i [1, s x i ) and similarly, v ′ i to be the compression of w i [1, s y i ).We let w ′ i be the compression of w (3) i [s y i , t y i ).Hence, u ′ i comes from the compression of at most |u i | + 2R + 5 ≤ 6R + 33 symbols.Since each symbol after a symbol colored 1 is removed by the compression, and each consecutive triple of symbols contains at least one symbol colored by 1, the at most 6R + 27 symbols are compressed into at most (6R + 33) • 2/3 + 2 = 4R + 24 symbols.So u ′ i is of length at most 4R + 24.Similarly for v ′ i .
The following generalization of the previous lemma will be useful to design a rolling sketch.It considers situation where x and y are prefixed by some strings u and v, resp., that we want to ignore from the analysis.The proof of the lemma is a straightforward modification of the above proof.
for some strings w i , u i and v i where for i ∈ {0, . . ., q}, is the compression of the same subword of w i in both x and y.
For ℓ ≥ 0 and j ∈ {1, . . ., s x ℓ }, let i be such that , for some u i , w i ∈ Γ * , then the decomposition of B x (ℓ, j) is the restriction of the decomposition of A x (ℓ − 1, i) to symbols of the m-th block of Split(A x (ℓ − 1, i), ℓ).Otherwise the decomposition of B x (ℓ, j) is undefined.Similarly for B y (ℓ, j).(See Fig. 2
q be their canonical decomposition induced by B x (ℓ, j) and B x (ℓ, j) as given by Lemma 3.10.
To conclude item 3 of Theorem 3.1 we want to argue that x and y are recursively split into sub-blocks that respect their canonical decomposition.So we want all splits of blocks to occur in matching parts of x and y.For A x (ℓ − 1, i) with canonical decomposition w 0 u 1 w 1 u 2 w 2 • • • u q w q we say that Split(A x (ℓ − 1, i), ℓ) makes undesirable split if it starts a new block at a position j that either belongs to one of the u 1 , u 2 , . . ., u q or is the first or last symbol of one of the w 0 , w 1 , . . ., w q .Recall, Split(A x (ℓ − 1, i), ℓ) starts a new block at each position j such that H ℓ (A x (ℓ − 1, i)[j, j + 1]) = 0. Since H ℓ is chosen at random a given position starts a new block with probability 1/D.
Similarly, for A y (ℓ − 1, i) with canonical decomposition we say that Split(A y (ℓ − 1, i), ℓ) makes undesirable split if it starts a new block at a position j that either belongs to one of the v 1 , v 2 , . . ., v q ′ or is the first or last symbol of one of the w ′ 0 , w ′ 1 , . . ., w ′ q ′ .If A x (ℓ − 1, i) and A y (ℓ − 1, i) have matching canonical decomposition (that is q = q ′ and each w j = w ′ j ) and both Split(A x (ℓ − 1, i), ℓ) and Split(A y (ℓ − 1, i), ℓ) make no undesirable split then A x (ℓ − 1, i) and A y (ℓ − 1, i) are split in the same number of blocks with matching canonical decomposition as they are split at the same positions in the corresponding w j 's.
For given ℓ ∈ {0, . . ., L}, if no undesirable split happens during Split(A x (ℓ ′ −1, i), ℓ ′ ) and Split(A y (ℓ ′ − 1, i), ℓ ′ ), for any ℓ ′ < ℓ and i, then for each ℓ ′ < ℓ, the number of blocks B x (ℓ ′ , i) and B y (ℓ ′ , i) will be the same, i.e., s x ℓ ′ = s y ℓ ′ , and blocks B x (ℓ ′ , i) and B y (ℓ ′ , i) will have matching canonical decomposition.The total number of u j 's in canonical decomposition of all B x (ℓ ′ , i), i = 1, . . ., t x ℓ ′ , will be at most k, and similarly for v j 's.Thus, there will be at most (4R + 24 + 2)k + 2 positions where an undesirable split can happen in Split(A x (ℓ − 1, i), ℓ) for any i.Similarly, there are at most (4R + 26)k + 2 positions where an undesirable split can happen in Split(A y (ℓ − 1, i), ℓ).By union bound, the probability that an undesirable split happens in some Split(A y (ℓ − 1, i), ℓ) or Split(A y (ℓ − 1, i), ℓ), for some ℓ and i, is at most 2(4R + 28)k(L + 1)/D ≤ 11Rk(L + 1)/D ≤ 1/10.Thus, if no undesirable split happens there are at most k indices i for which the canonical decomposition of B x (ℓ, i) contains some u j .All other blocks B x (ℓ, i) have a canonical decomposition consisting of a single block w 0 , for various w 0 depending on ℓ and i.Similarly, the canonical decomposition of B y (ℓ, i) contains v j if and only if B x (ℓ, i) contains u j .Blocks B y (ℓ, i) that do not contain v j are identical to B x (ℓ, i) so they have the same grammar.
Hence, if no undesirable split happens, item 3 of Theorem 3.1 will be satisfied.
The following theorem generalizes item 3 of Theorem 3.1 and it will be useful to construct the rolling sketch in Section 4. Theorem 3.12.Let u, v, x, y ∈ Σ * be strings such that |ux|, |vy| ≤ n and ED(x, y) ≤ k.Let G x 1 , . . ., G x s and G y 1 , . . ., G y s ′ be the sequence of grammars output by the decomposition algorithm on input ux and vy respectively, using the same choice of random functions C 1 , . . ., C L and H 0 , . . ., H L .With probability at least 1 − 1/5 the following is true: There exist integers r, r ′ , t, t ′ such that Its proof is a minor modification of the proof above.We start with the canonical decomposition of and follow the compression and split procedures.We want to argue that during each split operation, all splits occur either in w j 's and are the same on ux and vy, or they occur in u or v where we do not care for them.Again we define a split to be undesirable if it starts a new block at a position j that belongs to one of the u 0 , u 1 , . . ., u k , v 0 , v 1 , . . ., v k or it is the position of the first or last symbol of w 0 , w 1 , . . .or w k .Inductively we maintain that whenever a block B ux (ℓ, i) contains a descendant of the compression of u j , its corresponding block B vy (ℓ, i ′ ) contains a descendant of the compression of v j .(Here, the correspondence is counting from the highest index i to the lowest and similarly for i ′ , so B ux (ℓ, i) corresponds to B vy (ℓ, i ′ ) if i − i ′ = s ux ℓ − s vy ℓ .)If the blocks contain a descendant of u 0 and v 0 , resp., then we apply Lemma 3.11 to construct a descendant decomposition after their compression.For all other blocks that contain some w j , u j or v j we use Lemma 3.10 to construct its descendant decomposition.We do not care for decomposition of blocks B ux (ℓ, i) that are descendants of u but do not contain u 0 , and similarly we do not care for decomposition of blocks B vy (ℓ, i) that are descendants of v but do not contain v 0 .(They might be decomposed arbitrarily so the number of blocks that are descendants of u might differ from the number of blocks that are descendants of v.) Inductively, there are at most 2(4R + 28)(k + 1) positions where an undesirable split can happen in blocks B ux (ℓ, i) and B vy (ℓ, i) for given level ℓ.In total there are at most 2(4R + 28)(k + 1)(L + 1) positions where an undesirable split can happen.Thus, the probability of making an undesirable split during a run of the algorithm is bounded by 2(4R + 28)(k + 1)(L + 1)/D ≤ 22Rk(L + 1)/D ≤ 1/5.If no undesirable split ever happens then the symbols that are weak compression of symbols from x and y are contained withing the corresponding blocks B ux (ℓ, i) and B yv (ℓ, i ′ ).For the blocks B ux (ℓ, i) and B vy (ℓ, i ′ ) that contain descendants of u 0 and v 0 it is fine if their prefixes that descend from u and v, resp., which are to the left of the descendants of u 0 and v 0 , are split differently in B ux (ℓ, i) and B vy (ℓ, i ′ ).This does not affect the correspondence between blocks B ux (ℓ, i) and B vy (ℓ, i ′ ) that weakly come from x and y.This concludes the proof of Theorem 3.12.

Encoding a grammar
We will set a parameter N ≥ n 3 to be a suitable integer: Let F KR : {0, 1} * → {1, . . ., N } be a hash function picked at random, such as Karp-Rabin fingerprint [KR87], so for any two strings u, v ∈ {0, 1} * , if Set M = 3S • ⌈1 + log |Γ|⌉.We will encode a grammar G over Γ of length at most S given by our decomposition algorithm by a string Enc(G) over alphabet {1, . . ., 2N } of length M .The encoding is obtained as follows: First, order the rules of the grammar G lexicographically.Then encode the rules in binary one by one using 3 • ⌈1 + log |Γ|⌉ bits for each rule.(The extra bit allows to mark unused symbols.)This gives a binary string of length at most M , which we pad by zeros to the length precisely M .We call the resulting binary string Bin(G).Compute h G = F KR (Bin(G)).We replace each 0 in Bin(G) by h G , and each 1 in Bin(G) by N + h G to obtain the string Enc(G).Clearly, Enc(G) is a string over alphabet {1, . . ., 2N } of length exactly M .The encoding can be computed in time O(M ).For completeness, we encode any grammar G of length more than S or that uses rules with more than two symbols on the right as By the property of F KR the following holds.
Lemma 3.13.Let G, G ′ be two grammars of size at most S output by our decomposition algorithm.Let F KR be chosen at random.

Edit distance sketch
Let n and k ≤ n be two parameters, and p ≥ 2N + 1 be a prime such that p ≥ (nM ) 3 .For a string x ∈ Σ * of length at most n, we compute its sketch by running first the decomposition algorithm of Theorem 3.1 to get grammars G 1 , G 2 , . . ., G s .Encode each grammar G i by encoding Enc(G i ) from Section 3.3 using the same F KR picked at random.Concatenate the encoding to get a string Calculate the Hamming sketch sk Ham n ′ ,m ′ ,p (w) on w for strings of length n ′ = nM and Hamming distance at most k ′ = kM from Section 2.2.Set the sketch sk ED n,k (x) = sk Ham n ′ ,k ′ ,p (w).The calculation of sk ED n,k (x) can be done in time O(nk) as the number of grammars is at most n and each grammar requires O(k) time to be encoded into binary.The Hamming sketch can be constructed in time O(nk).(We believe that on average we expect only O(n/k) grammars to be produced for a given string x so the actual running time should be O(n) on average.)Theorem 3.14.Let x, y ∈ Σ * be strings of length at most n such that ED(x, y) ≤ k.Let sk ED n,k (x) and sk ED n,k (y) be obtained using the same randomness for the decomposition algorithm and the same choice of F KR .With probability at least 2/3, we can calculate ED(x, y) from sk ED n,k (x) and sk ED n,k (y).
Assume that the output of the decomposition algorithm on x and y satisfies all the conclusions of Theorem 3.1.In particular, for x we get eval(G ) and for y we get eval(G y 1 ) • • • eval(G y s ), for some s ≤ n, each of the grammars is of size at most S, ED(x, y) = i ED(eval(G x i ), eval(G y i )), and the number of pairs G x i and G y i where for each of the pairs where G x i and G y i differ.In order to determine ED(x, y), we recover the (Hamming) mismatch information between Enc(G ) and sk ED n,k (y).That gives grammars G x i and G y i , for all i where G x i ̸ = G y i .(Whenever the two grammars differ, their encoding differ in every symbol by Lemma 3.13 so we can recover them from the Hamming mismatch information.)Calculating the edit distance of each of the pair of differing grammars using the algorithm from Proposition 2.1 we recover ED(x, y) as the sum of their edit distances.
The sum is correct unless some of the assumptions fail: The probability that the grammar decomposition fails (does not have properties from Theorem 3.1) for the pair x and y is at most 1/5 for n large enough.The probability that the choice of F KR fails (two distinct grammars have the same encoding) is at most 2kM/N < 1/n by the choice of N .The probability that the Hamming distance sketch fails to recover the mismatch information between all the grammars is at most 1/n.So in total, the probability that the output of the algorithm is incorrect is at most 1/3.To decide whether ED(x, y) > k we note that on input x and y, the Hamming sketch either outputs the correct mismatched places if their number is ≤ k ′ or it outputs ∞ if there are more mismatches than that or the sequences sketched by the Hamming sketch are of different length.(We assume that the Hamming sketch knows the number of symbols it is sketching.)In the ∞-case we know that there are more than k different pairs of grammars or the decomposition of x and y failed, and we can report ED(x, y) > k.In the other case we try to calculate the edit distance of the differing pairs of grammars.If we spend more than O(k 2 ) time on it or we get a number larger than k then we report ED(x, y) > k.This correctly decides whether ED(x, y) > k with probability at least 2/3.
To prove Theorem 1.2 we build a more robust sketch by taking c log n independent copies of the sketch sk ED n,k .To calculate the edit distance of two sketched strings we run the edit distance calculation on each of the corresponding pairs of copies, and output the majority answer.A usual application of Chernoff bound shows that the probability of correct answer is at least 1 − 1/n for suitable constant c > 0.

Rolling sketch for edit distance
In this section we will construct the rolling sketch of Theorem 1.3.We will use two claims that will be proved in Section 4.1.The first one addresses how much a compression of a string w might change depending on what is appended to it.Lemma 4.1.Let ℓ ∈ {0, . . ., L} and v, u, w ∈ Γ * .Let w ′ u ′ = Compress(wu, ℓ) and let w ′′ v ′ = Compress(wv, ℓ), where w ′ is the compression of w when compressing wu and w ′′ is the compression of w when compressing wv.
The next lemma addresses how much the overall decomposition of a string x might change if we append a suffix z to it.Lemma 4.2.Let x, z ∈ Σ * , |xz| ≤ n.Let H 0 , . . ., H L , C 1 , . . ., C L be given.Let G x 1 , G x 2 , . . ., G x s be the output of the decomposition algorithm on input x, and G xz 1 , G xz 2 , . . ., G xz s ′ be the output of the decomposition algorithm on input xz using the given hash functions.Let T = L(3R + 6).

G
By the second part of Lemma 4.2 t ′ ≤ t + T ≤ 2T so we will commit at most T = O(1) grammars.It takes time O(M T ) = O(k) to prepare the binary encoding of each of the committed grammars, and O(k 2 ) to insert it into the Hamming sketch.The update of the active grammars takes O(k) time as described below.So in total this step takes O(k 2 ) time.Removing a symbol.Deletion buffer works in manner similar to insertion buffer, we add the removed symbol a to the active grammars, but when committing the grammar G r+1 , we use F KR -fingerprint of all the grammars G r−4T +1 , . . ., G r+1 to encode grammar G r−2T +1 which is then removed from the beginning of the sequence of grammars represented by the Hamming sketch (if r − 2T + 1 > 0), i.e., we update the Hamming sketch to reflect this removal.Similarly to appending a symbol, this step takes time O(k 2 ).Active grammar update.The update of active grammars G i 1 , . . ., G i t when appending a is done as follows.G 1 , . . ., G s , G i 1 , . . ., G i t represents ux so we need to calculate the grammars for uxa.We claim that only the active grammars might change: At some point, G s became committed so at that time there was T active grammars following it.If at that point the grammars together represented a string z, by appending more symbols to z we cannot change grammars G 1 , G 1 , . . ., G s according to the first part of Lemma 4.2.So appending a to ux will affect only the active grammars.
From the analysis in the proof of Lemma 4.2 it follows that for ℓ ∈ {0, . . ., 1} if B ux (ℓ, 1), . . ., B ux (ℓ, s xy ℓ ) is the trace of the decomposition algorithm on ux at level ℓ, and B uxa (ℓ, 1), . . ., B uxa (ℓ, s xya ℓ ) is the trace on uxa, then their difference spans at most ℓ(3R + 6) last symbols of B ux (ℓ, 1) • • • B ux (ℓ, s xy ℓ ).So instead of decompressing the active grammars completely, adding a and recompressing them back, we only decompress the necessary part of each trace B ux (ℓ, 1) we iteratively rewrite all level-ℓ symbols in the string using the appropriate grammars while only maintaining at most T last symbols of the resulting string.(Care has to be taken to maintain information about any sequence a r stretching from those T last symbols to the left.) We add a to the resulting string and re-apply compress and split procedures for levels 0, 1, . . ., ℓ − 1 to recompress only the part of the trace affected by modifications.As we perform the compression of symbols we maintain a set G of all grammar rules needed for decompression.(We initialize G with the union of all rules from the active grammars G i 1 , . . ., G i t minus the starting rules, and we iteratively add new rules coming from the recompression.)For the recompression we need to know the context of up-to R + 1 symbols preceding the modified part of the trace.On the other hand, the modification can affect the recompression of up-to R + 1 symbols to the left from the left-most modified symbol in the trace.Those R + 1 symbols all happen to be within the decompressed suffix of the trace of size at most T .
Eventually, we get a new level-L trace B uxa (L, s xya L − t ′ + 1), . . ., B uxa (L, s xya L ), for some t ′ .Each new grammar G ′i j is obtained by taking the grammar G ∪ {# → B uxa (L, s xya L − t ′ + j)} and removing from it all useless rules.This can be done in time O(|G|).(See Section 2.1).
Overall the update of active grammars on insertion of a single symbol will require O(LT ) = O(1) evaluations of split hash functions H 0 , . . ., H L , O(LT ) = O(1) evaluations of compress hash functions C 1 , . . ., C L , and O(T (LT + t j=1 |G i j |)) time to produce the new grammars.As the total size of the grammars is O(k) and the time to evaluate H ℓ at a single point is O(1), the overall time for the update of active grammars is O(k).We provide a more detailed description of the update procedure in Section 5. Edit distance evaluation.Consider strings x and y of length at most m and edit distance at most k.Consider the rolling sketch sk Rolling m,k (x) for x obtained by inserting symbols ux and removing symbols u, for some u ∈ Σ * where |ux| ≤ m.Consider also the rolling sketch for y obtained by inserting symbols vy and removing symbols v, for some v ∈ Σ * where |vy| ≤ m.Both sketches should use the same randomness that is to start from the same sketch for empty string.function to have access also to the previous at most T committed grammars (to have the proper context for re-compression).Our rolling sketch algorithm has those committed grammars available in appropriate buffers.Thus we will assume that the update function is always invoked with exactly T + 1 grammars, unless x is decomposed into less that T + 1 grammars.Some of the first few grammars from the output of the update procedure should be discarded as they correspond to grammars that should stay the same.In particular, if there are t active grammars and s committed grammars then we should discard the first min(s, T + 1 − t) grammars from its output.The following statement encapsulates the properties of our update procedure UpdateActiveGrammars().
Here we assume that the decomposition algorithm does not fail neither on x nor on x • a with respect to producing correct deterministic grammars so the first two parts of Theorem 3.1 are satisfied for x and y = x•a, and the choice of functions C 1 , . . ., C L and H 0 , . . ., H L .For the simplicity of our implementation, we assume a stronger property of C 1 , . . ., C L , that each C ℓ is one-to-one on the union of all blocks of x and x • a at level ℓ. (See remark after Lemma 3.7.)

Auxiliary functions
Our update algorithm uses several simple and straightforward auxiliary functions we describe next.Function DecompressSymbol(c, G, ℓ, t) takes a symbol c ∈ Γ and if it is a level-ℓ symbol compressed by the grammar G then it returns its decompression truncated to the length of at most t symbols.Otherwise it returns the original symbols c.
Function CompressWithGrammar(B, ℓ) is an extension of Compress(B, ℓ) that in addition to compressed block B at level ℓ returns the set of grammar rules used for the compression of B at this level.
Finally, function FindCompressedPrefix(Z, p, ℓ) returns the length of the smallest prefix of a string Z that decompresses into at least p symbols at level ℓ.
Remove from G ′ unnecessary rules to get G ′ i (as in Section 2.1).
Append G ′ i to AG ′ .end Return AG ′ .Function Recompress(B, Z, F, u, r, ℓ) gets a sequence B = (B 0 , . . ., B s ) of blocks that represent compression of the updated Z ℓ−1 (after adding a) up-to level ℓ − 1.It also gets the original Z ℓ , the splitting depth string F ℓ , the number of symbols u ℓ that were decompressed from Z ℓ to get the original Z ℓ−1 and the parameter r ℓ that indicates that the first r ℓ symbols of Z ℓ−1 are a partial decompression of the repeat symbol Z ℓ [u].It outputs a sequence of blocks B ′ that represent the updated block Z ℓ compressed up-to level ℓ, and a set of rules G ′ that were used for compression at level ℓ.
Blocks B 1 , . . ., B s can be independently compressed and split at level ℓ.The block B 0 needs a special treatment though as it needs to be combined with its possible remainder in Z ℓ .This is done in function RecompressFirstBlock(B 0 , Z, F, u, r, ℓ).Remaining blocks for the output Recompress() are obtained from Z ℓ by splitting it into blocks according to F ℓ .
start a new block at position i.The running time of Split(B, ℓ) is dominated by the time to evaluate H ℓ at |B| − 2 points.Algorithm 2 Split(B, ℓ) Input: String B over alphabet Γ of length at least two, and level number ℓ. Output: A sequence of strings (B 0 , B 1 , . . ., B s ) over alphabet Γ.

Algorithm 4
Grammar(B, (D 1 , D 2 , . . ., D ℓ ), ℓ) Input: String B ∈ Γ * , a sequence of partial grammars D i over Γ for decompressing B. Output: The smallest grammar G for B based on the grammars D i .Let C = {c ∈ Σ c : c appears in B or r c,r appears in B for some r}.// Symbols needed to decompress B

s
and remove them from the other u(1) i , i = s + 1, . . ., t. (Notice, w (1) i = ε for s < i < t because of the definition of w (1) resp., might get different colors.All the other symbols of w

i
, and the last up-to R+3 symbols of w (3) i−1 .Next we specify precisely which symbols of w (3) i and w
The running time of the comparison algorithm is O(k 2 ): The Hamming mismatch information can be recovered in time O(kM ) = O(k 2 ) (Proposition 2.2), then we build the ≤ k mismatched grammars in time O(k 2 ), and run the edit distance computation on the pairs of grammars in time i<k O(k + k 2 i ) ≤ O(k 2 ), where k i is the edit distance of the i-th pair of mismatched grammars.(We interrupt the edit distance computation if it takes more time than O(k 2 ) which would indicate ED(x, y) > k.)

Algorithm 5
DecompressSymbol(c, G, ℓ, t) Input: A symbol c, a grammar G, a level ℓ, maximum output size t ≥ 2. Output: Decompresses c if it was compressed at level ℓ.Returns at most t symbols of the decompression.ifc ∈ Σ ℓ c then let a, b ∈ Γ be such that c → ab ∈ G.Return ab. if c ∈ Σ ℓr then let a ∈ Γ, r ∈ N be such that c = r a,r .Return a min(t,r) .Return c.Function DecompressString(Z, G, ℓ) decompresses all level-ℓ compression symbols in a string Z ∈ Γ * using the grammar G, and returns the resulting decompressed string.Algorithm 6 DecompressString(Z, G, ℓ) Input: A string Z, a grammar G, and level ℓ.Output: Decompresses z at level ℓ.Y = ε.for i = 1 to |Z| do Y = Y • DecompressSymbol(Z[i], G, ℓ, ∞).Return Y .Function DecompressSymbolLength(c, ℓ) returns the length of the decompression of a symbol c at level ℓ.Algorithm 7 DecompressSymbolLength(c, ℓ) Input: A symbol c, a level ℓ.Output: Returns the length of decompression of c at level ℓ. if c ∈ Σ ℓ c then return 2. if c ∈ Σ ℓ r then let a ∈ Γ, r ∈ N be such that c = r a,r .Return r.Return 1. Algorithm 8 CompressWithGrammar(B, ℓ) Input: String B over alphabet Γ, and level number ℓ. Output: String B ′′ over alphabet Γ, and set of applied rules G ′ .if|B| ≤ 1 then return B, ∅.Set G ′ = ∅.Divide B = B 1 B 2 B 3 . . .B m intominimum number of blocks so that each maximal subword a r of B, for a ∈ Γ and r ≥ 2, is one of the blocks.for each i ∈ {1, . . ., m} do if B i = a r , where r ≥ 2 then Set B ′ i = r a,r • # and color r a,r by 1 and # by 2.G ′ = G ′ ∪ {r a,r → a r }; end else Set B ′ i = B i and color each symbol of B ′ i according to F CVL (B i ) end Set B ′ = B ′ 1 B ′ 2 • • • B ′ m ,B ′′ = ε, and i = 1.while i < |B ′ | do if B ′ [i + 1] = # then B ′′ = B ′′ • B ′ [i] else B ′′ = B ′′ • C ℓ (B ′ [i, i + 1]);

Algorithm 12
UpdateActiveGrammars(AG, a) Input: List of grammars AG = (G 1 , . . ., G t ) representing a string x, and a symbol a. Output: Updated list of grammars AG ′ representing string x • a. // Construct a set of rules G, initial compressed string Z L and splitting depth string F L .For i = 1, . . ., t, let # → v i be the starting rule in G i .Set G = t i=1 G i \ {# → v i }.Set Z L = v 1 and F L = 0 • (L + 1) |v 1 |−1 .For i = 2, . . ., t, set Z L = Z L • v i and F L = F L • SplittingDepth(G i ) • (L + 1) |v i |−1 .// Perform partial decompression for ℓ = L to 1 do Z ℓ−1 , F ℓ−1 , u ℓ , r ℓ = PartiallyDecompress(Z ℓ , F ℓ , ℓ). end // Perform re-compression Z 0 = Z 0 • a; B = Split(Z 0 , 0); for ℓ = 1 to L do B ′ , G ′ = Recompress(B, Z ℓ , F ℓ , u ℓ , r ℓ , |Z ℓ−1 |, ℓ) G = G ∪ G ′ B = B ′ end Let B = (B 1 , . . ., B t ′ ).AG ′ = (). for [CKP19]left hand side of the rule, and ab or a r is the right hand side of the rule.# is the starting symbol.The size |G| of the grammar is the number of rules in G.We only consider grammars where each a ∈ Γ ∪ {#} appears on the left hand side of at most one rule of G, we call such grammars deterministic.(Weassume that rules of the form c → a r are stored in implicit (compressed) form.)Theeval(G) is the string from Σ * obtained from # by iterative rewriting of the intermediate results by the rules from G. If the rewriting process never stops or stops with a string not from Σ * , eval(G) is undefined.Observe, that we can replace each rule of the type c → a r by a collection of at most 2⌈log r⌉ new rules of the other type using some auxiliary symbols.Hence, for each grammar G there is another grammar G ′ using only the first type of the rules such that eval(G) = eval(G ′ ) and|G ′ | ≤ |G| • 2⌈log |eval(G)|⌉.Using a depth-first traversal of a deterministic grammar G we can calculate its evaluation size |eval(G)| in time O(|G|).Given a deterministic grammar G and an integer m less or equal to its evaluation size, we can construct in time O(|G|) another grammar G ′ of size O(|G|) such that eval(G ′ ) = eval(G)[m, ...].G ′ will use some new auxiliary symbols.Given a deterministic grammar G, There is an algorithm that on input of two grammars G x and G y of size at most m computes the edit distance k of eval(G x ) and eval(G y ) in time O((m + k 2 ) • poly(log m + n)), where n = |eval(G x )| + |eval(G y )|.The Hamming distance of x and y is Ham(x, y) = |MIS(x, y)|.There exist various sketches for Hamming distance, which allow to compute Hamming distance with low error probability [KOR98, FIM + 06].Moreover, [PL07, CKP19] also allow to retrieve the mismatch information.For our purposes we will use the sketch given by Clifford, Kociumaka, and Porat[CKP19].Let k ≤ n be integers and p ≥ n 3 be a prime.[CKP19]givea randomized sketch for Hamming distance sk Ham n,k,p : {1, . . ., p − 1} * → {0, . . ., p − 1} k+4 computable in time O(n) with the following properties.2