Generic Non-recursive Suffix Array Construction

The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving GSACA’s non-competitive real-world performance. There is a super-linear algorithm DSH, which relies on the same sorting principle and is faster than DivSufSort, the fastest SACA for over a decade. The purpose of this article is twofold: We analyse the sorting principle used in GSACA and DSH and exploit its properties to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform (eBWT) and a bijective version of the Burrows-Wheeler transform (BBWT) in linear time. We call the algorithm “generic,” since it can be used to compute the regular suffix array and the variants used for the BBWT and eBWT. Our suffix array construction algorithm is not only significantly faster than GSACA but also outperforms DivSufSort and DSH. Our BBWT-algorithm is faster than or competitive with all other tested BBWT construction implementations on large or repetitive data, and our eBWT-algorithm is faster than all other programs on data that is not extremely repetitive.


INTRODUCTION
The suffix array contains the indices of all suffixes of a string arranged in lexicographical order.It is arguably one of the most important data structures in stringology, the topic of algorithms on strings and sequences.It was introduced in 1990 by Manber and Myers [1990] for on-line string searches [Manber and Myers 1990] and has since been adopted in a wide area of applications including text indexing and compression [Ohlebusch 2013].Although the suffix array is conceptually very simple, constructing it efficiently is not a trivial task.
18:2 J. Olbrich et al.When n is the length of the input text, the suffix array can be constructed in O(n) time and O(1) additional words of working space when the alphabet is linearly-sortable (i.e., the symbols in the string can be sorted in O(n) time) [Goto 2019;Li et al. 2022;Nong 2013].1However, algorithms with these bounds have historically not always been the fastest in practice.For instance, DivSufSort has been the fastest suffix array construction algorithm (SACA) for over a decade although having a super-linear worst-case time complexity [Bertram et al. 2021;Fischer and Kurpicz 2017].
To the best of our knowledge, the currently fastest suffix sorter in practice is libsais, which appeared as source code in February 2021 on Github2 and has not been subject to peer review in any academic context.The author claims that libsais is an improved implementation of the SA-IS algorithm and hence has linear time complexity [Nong et al. 2009].
The only non-recursive linear-time suffix sorting algorithm GSACA was introduced in 2015 by Baier [2015] and is not competitive, neither in terms of speed nor in the amount of memory consumed [Baier 2015[Baier , 2016]].Generally, GSACA employs a kind of grouping principle, i.e., suffixes are assigned to groups that are refined until the suffix array emerges.Despite the new algorithm's entirely novel approach and interesting theoretical properties [Franek et al. 2017], there has been little effort in optimising it.In 2021, Bertram et al. [2021] provided a much faster SACA DSH using the same sorting principle as GSACA.Their algorithm beats DivSufSort in terms of speed, but has a super-linear time complexity.
A data structure closely linked to the suffix array is the Burrows-Wheeler transform (BWT) [Burrows and Wheeler 1994] introduced by Burrows and Wheeler in 1994.The BWT of a text S is the string obtained by assigning the ith symbol of the BWT to the last character of the ith lexicographically smallest conjugate of the input text.Notably, S can be restored from its BWT [Burrows and Wheeler 1994] but the BWT is usually easier to compress and to this date some of the best compression algorithms make use of the BWT or one of its variants [Baier 2021].From a theoretical point of view, the BWT is slightly unsatisfactory, since it is not a bijective transformation, that is, for the BWT of S there might be several other strings that have the same BWT.Consequently, one has to have additional information (e.g., which position in the BWT corresponds to the first/last position in S) or make assumptions about S (e.g., that S is nullterminated) to reverse the transformation.In 2007, Scott discovered a bijective variant of the BWT (Bijective Burrows-Wheeler Transform (BBWT), sometimes also called "BWT Scottified" or BWTS for short) [Gil and Scott 2012].The BBWT is the string obtained by assigning the ith symbol of the BBWT to the last character of the ith string in the list of all conjugates of all Lyndon factors of S sorted in infinite periodic order (see Definition 2.3) [Gil and Scott 2012;Kufleitner 2009].Mantaci et al. [2005] introduced the extended BWT (eBWT), an extension of the BWT in the sense that it is a BWT for a set M of primitive (i.e., non-periodic) strings [Mantaci et al. 2005], for which Hon et al. [2012] gave an O(n log n) construction algorithm [Hon et al. 2012].Similar to the BBWT, the eBWT consists of the last characters of the conjugates of the strings in M arranged in infinite periodic order.The approach of Hon et al. [2012] is to compute what we call the generalised circular suffix array (GSA • ) such that the ith entry of GSA • represents the ith smallest such conjugate.Bonomo et al. [2014] showed that it is possible to reduce the problem of computing the extended BWT to computing the BBWT (in linear time) and gave an O(n log n/log log n) algorithm for constructing the BBWT.Similar to Hon et al. [2012]'s eBWT-algorithm, they compute the cirular suffix array (SA • ) where the ith entry represents the ith smallest conjugate [Bonomo et al. 2014].
More recently, Bannai et al. [2021] showed that the SA-IS algorithm can be modified to compute SA • in linear time [Bannai et al. 2021].Boucher et al. [2021] then extended the notion of the eBWT to multisets of general strings (i.e., without the restriction to primitive strings made in Mantaci et al. [2005]) and simplified the algorithm of Bannai et al. [2021].They provide two implementations using their algorithm, cais 3 and PFP-eBWT. 4The former computes GSA • of the string collection and derives the eBWT from it, while the latter first applies a variation of the preprocessing technique prefix-free parsing (PFP) [Boucher et al. 2019], applies cais to the parse and then derives the eBWT from the result and the lexicographically sorted dictionary [Boucher et al. 2021].Note that their implementations only work correctly on multisets of primitive strings. 5 Also note that the term "extended Burrows-Wheeler Transform" has not exclusively been used to refer to the "original" eBWT as defined by Mantaci et al. [2005], but generally to BWT-variants for collections of strings.Those other variants all (implicitly or explicitly) append terminator-symbols to the input strings and thus their output differs from the original eBWT.Moreover, for most variants, the order in which the input strings are given influences the BWT [Cenzato and Lipták 2022].We only consider the original eBWT as defined by Mantaci et al. [2005].
Our Contributions.This article extends our previous work on optimising the GSACA suffix array construction algorithm, published in Olbrich et al. [2022].Specifically, we show that small changes to our algorithm are sufficient to compute the bijective Burrows-Wheeler Transform or the extended Burrows-Wheeler Transform instead of the suffix array.
An important intermediate state of the GSACA algorithm is the Lyndon grouping where the suffixes are sorted and grouped according to their respective longest Lyndon prefixes.Our contributions are threefold.
First, we show that the BBWT can be derived from the Lyndon grouping in the same way as the suffix array, and thereby we obtain linear-time BBWT construction algorithms.
Second, we show that a slight change in the initialisation of our BBWT algorithm is sufficient to compute the eBWT instead (even for non-primitive input strings).Notably, this is possible without explicitly sorting the input strings (as would be required for transforming an arbitrary BBWTalgorithm into an algorithm for the eBWT [Bonomo et al. 2014]).
Third, we provide several techniques that significantly improve the performance of Baier's algorithms for computing the Lyndon grouping and deriving the suffix array, BBWT or eBWT from it.Our resulting linear-time SACA is faster than GSACA and DSH, which employ the same sorting principle but do not exploit certain properties of Lyndon words.Specifically, on real-world text, our SACA implementation is more than 25% faster than DSH and more than 65% faster than Baier's GSACA implementation.Although it is not on par with libsais on real-world data, it significantly improves Baier's sorting principle and positively answers whether the precomputed Lyndon array can be used to accelerate GSACA (posed in Bille et al. [2020]).Our BBWT construction program is significantly faster than the other linear-time BBWT construction programs we are aware of, and faster on some of our test data than the previously fastest program (which has quadratic worstcase time complexity).Our eBWT-algorithm is also significantly faster than "the only tool up to date that computes the eBWT according to the original definition" [Cenzato and Lipták 2022] on data that is not extremely repetitive.
The rest of this article is structured as follows: Section 2 introduces the definitions and notations used throughout this article.In Section 3, the grouping principle is investigated and a description of our algorithms is provided.In Section 4 our algorithms are evaluated experimentally and compared to other relevant suffix array, BBWT and eBWT construction algorithms.Finally, Section 5 concludes this article and provides an outlook on possible future research.

PRELIMINARIES
For i, j ∈ N 0 , we denote the set {k ∈ N 0 : i ≤ k ≤ j} by the interval notations [i .. j] = [i .. j + 1) = (i − 1 .. j] = (i − 1 .. j + 1).For an array A, we analogously denote the subarray from i to j by . We use 0-based indexing, i.e., the first entry of the array A is A [0].
A string S of length n over an alphabet Σ is a sequence of n characters from Σ.We denote the length n of S by |S | and the ith symbol of S by S[i − 1], i.e., strings are zero-indexed.In this article, we assume any string S of length n to be over a totally ordered and linearly sortable alphabet (i.e., we can sort the characters in S in O(n)).Analogous to arrays, we denote the substring from i to j by be the empty string ε.For two strings u and v and an integer k ≥ 0, we let uv be the concatenation of u and v and denote the k-times concatenation of u by u k .
A string S is primitive if it is non-periodic, i.e., S = w k implies w = S and k = 1.For any string S there is a unique primitive string w and a unique integer k such that S = w k .We call w and k the root and period of S and denote them by root(S) and period(S), respectively.
The suffix i of a string S of length n is the substring S[i .. n) and is denoted by In the rest of this article, we use S = decedacebceece$ as our running example.We have, for instance, S 1 = ecedcebceece$ = S[1]S 2 .
We assume totally ordered alphabets.This induces a total order on strings.Specifically, we say a string S of length n is lexicographically smaller than another string S of length m if and only if there is some ≤ min{n, m} such that S[0 .. ) = S [0 .. ) and either n = < m or S[ ] < S [ ].If S is lexicographically smaller than S , then we write S < lex S .
The suffix array SA of S is an array of length n that contains the indices of the suffixes of S in increasing lexicographical order.That is, SA forms a permutation of [0 .. n) and Definition 2.1 (pss-tree [Bille et al. 2020]).Let pss be the array such that pss[i] is the index of the previous smaller suffix for each Note that pss forms a tree with -1 as the root, in which each i ∈ [−1 .. n) is represented by a node and pss[i] is the parent of node i.We call this tree the pss-tree.Further, we impose an order on the nodes that corresponds to the order of the indices represented by the nodes.In particular, if Analogous to pss[i], we define nss[i] := min j ∈ (i .. n] : S j < lex S i as the next smaller suffix of i.Note that S n = ε is smaller than any non-empty suffix of S, hence nss is well-defined.Figure 1 shows the suffix array, nss and the pss-tree of our running example.Definition 2.2.Let P i be the set of suffixes with i as next smaller suffix, that is Generic Non-recursive Suffix Array Construction 18:5 Fig. 1.Lyndon prefixes of all suffixes of S = decedacebceece$ and the corresponding suffix array, nss-array, pss-array, and pss-tree.Each box indicates a Lyndon prefix.For instance, the Lyndon prefix of S 9 = ceece$ is L 9 = cee.Note that L i is exactly S[i] concatenated with the Lyndon prefixes of i's children in the pss-tree (see Lemma 3.20).For example, L 8 = S[8]L 9 L 12 = bceece.
For instance, in our running example, we have Definition 2.3 (Infinite Periodic Order).For the infinite periodic order, we compare the infinite concatenation of strings lexicographically.That is, for strings S and S , we write S < ω S if and only if the infinite concatenation S ∞ = SSS . . . is lexicographically smaller than the infinite concatenation S ∞ = S S S . . . .For instance, ab < lex aba < lex abb and abb > ω ab > ω aba (since abbabb . . .> lex abab . . .> lex abaaba . . .).Note that S ∞ = S ∞ holds if and only if root(S) = root(S ).Thus, < ω is not an antisymmetric relation on strings in general (e.g., a ≤ ω aa and aa ≤ ω a but a aa).6However, we will only use the infinite periodic order for comparing primitive strings (where ≤ ω is antisymmetric).Also note that the infinite periodic order is equivalent to the lexicographical order if neither of the strings in question is a prefix of the other.
For a string S of length n, the ith conjugate is defined as . n) and denoted by conj i (S).
A non-empty string S is in its canonical form if and only if it is the lexicographically minimal among its conjugates.If S is additionally strictly smaller than all of its other conjugates, then S is a Lyndon word.Equivalently, S is a Lyndon word if and only if S is lexicographically smaller than all its proper suffixes [Duval 1983].
The Lyndon prefix of S is the longest prefix of S that is a Lyndon word.We let L i denote the Lyndon prefix of S i .Note that a string of length one is always a Lyndon word, hence the Lyndon prefix of a non-empty string is also non-empty.Lemma 2.4 (Lemma 15, Franek et al. [2016]).For each non-empty string S, we have Theorem 2.5 (Chen-Fox-Lyndon Theorem, Chen et al. [1958]).Any non-empty string S has a unique Lyndon factorisation, that is, there is a unique sequence of Lyndon words (Lyndon factors) [Chen et al. 1958] The Lyndon Factorisation of our running example is 1, the outermost boxes exactly correspond to the Lyndon factors).Burrows-Wheeler Transform (BBWT)).The bijective Burrows-Wheeler Transform (BBWT) of a string S is the string obtained by taking the last characters of the conjugates of the Lyndon factors of S arranged in infinite periodic order.

Definition 2.6 (Bijective
Figure 2 shows how the BBWT of our running example can be obtained.Burrows-Wheeler Transform (eBWT)).The extended Burrows-Wheeler Transform eBWT of a multiset M of strings is the string obtained by taking the last characters of the conjugates of the strings in M arranged in infinite periodic order.

Definition 2.7 (Extended
Analogous to the original BWT, for reconstructing M from the eBWT one needs the set of indices of the strings in M in the sorted list of conjugates.Figure 3 shows how the eBWT of M = {b, bcbc, b, abcbc, bc} can be obtained. We assume the RAM model of computation, that is, basic arithmetic operations can be performed in O(1) time on words of length O(log n) bits, where n is the size of the input.Reading and writing an entry A[i] of an array A can also be performed in constant time when i and

GSACA
In the following, we fix a string S of length n over a linearly sortable alphabet.
We start by giving a high level description of the sorting principle based on grouping by Baier [2015Baier [ , 2016]].Very basically, the suffixes are first assigned to ordered groups, which are then refined until the suffix array emerges.The algorithm consists of the following steps.
-Initialisation: Group the suffixes according to their first character.
-Phase I: Refine the groups until the elements in each group have the same Lyndon prefix.
-Phase II: Sort elements within groups lexicographically.We will later show that Phase II can also be used to derive BBWT instead of SA, and that a minor change in the preconditions to this BBWT-algorithm suffices to turn it into an algorithm for the eBWT.
Definition 3.1 (Suffix Grouping, Adapted from Bertram et al. [2021]).Let S be a string of length n and SA the corresponding suffix array.A group G with group context α is a tuple д s , д e , |α | with group start д s ∈ [0 .. n) and group end д e ∈ [д s .. n) such that the following properties hold: (1) All suffixes in SA[д s .. д e ] share the prefix α, i.e., for all i ∈ SA[д s .. д e ] it holds S i = αS i+ |α | .
(2) α is a Lyndon word.We say i is in G or i is an element of G and write i ∈ G if and only if i ∈ SA[д s .. д e ].A suffix grouping for S is a set of groups G 1 , . . ., G m , where the groups are pairwise disjoint and cover the entire suffix array.Formally, if G i is a lower (higher) group than G j if and only if i < j (i > j).If all elements in a group G have α as their Lyndon prefix, then G is a Lyndon group.If G is not a Lyndon group, then it is called preliminary.Furthermore, a suffix grouping is Lyndon if all its groups are Lyndon groups, and preliminary otherwise.
Note that, by definition, any suffix grouping constitutes a partial order consistent with the lexicographical order.That is, for some G i , G j from a suffix grouping with i < j, we have S i < lex S j for all i ∈ G i , j ∈ G j .
To see why the notion of Lyndon groups is useful, consider the following two lemmata: Lemma 3.2.For strings wu and wv over Σ with u < lex wu and v > lex wv, we have wu < lex wv.
Proof.Note that there is no j ≥ 1 such that wv = w j , since otherwise v would be a prefix of wv and thus v < lex wv.-There is some j ≥ 1 such that wu = w j .
-If j |w | ≤ k |w | + , then wu is a prefix of wv.-Otherwise, the first different symbol in wu and wv is at index p = k |w | + , and we have (wu - . In all cases, the claim follows.
Lemma 3.3.For any i, j, L i < lex L j implies S i < lex S j .
Proof.Assume L i is a prefix of L j , otherwise there is a mismatching position and the claim follows immediately.By Lemma 2.4, we have nss That is, sorting the suffixes according to their Lyndon prefixes results in a valid partial order and thus suffix grouping.Intuitively, we can derive the suffix array from a Lyndon grouping using a kind of induced copying, because we have S i+ | L i | < lex S i for each i (Lemma 2.4).
With these notions, a suffix grouping is created in the initialisation, which is then refined in Phase I until it is a Lyndon grouping, and further refined in Phase II until the suffix array emerges.Figure 4 shows a Lyndon grouping with contexts of our running example.
We first deal with Phase II, since it is much less technical than Phase I.In Section 3.1, we recapitulate how Baier [2015] derives the suffix array from a Lyndon grouping, and in Section 3.2, we show how this algorithm can be modified to produce the BBWT instead.Section 3.3 then shows how these two almost identical algorithms can be optimised.In Section 3.4, we explain how a Lyndon grouping can be computed and describe our improvements over Baier's Phase I. Section 3.5 describes how the data structures needed for Phase I are set up.Finally, Section 3.6 shows that only a slight change in these initial data structures for our BBWT-algorithm is sufficient for it to compute the eBWT instead.4. We consider 6 + |ce| = 8 and 12 + |ce| = 14.The group containing 14 is lower than the group containing 8, hence S 12 is lexicographically smaller than S 6 .Thus, we know that SA[3] = 12, remove 12 from G 4 and repeat the same process with the emerging group G 4 = 4, 4, |ce| .As 6 is the only element of G 4 we know that SA[4] = 6.

Phase II
If we refine the groups in lexicographically increasing order (lower to higher) as just described, then each time a group G is processed, all groups lower than G are singletons.However, sorting groups in such a way leads to a superlinear time complexity.Bertram et al. [2021] provide a fastin-practice O(n log n) algorithm for this, broadly following the described approach.
To get a linear time complexity, Baier turns this approach on its head [Baier 2015[Baier , 2016]]: Instead of repeatedly finding the next smaller suffix in a group, we consider the suffixes in lexicographically increasing order and for each encountered suffix i, we move all suffixes that have i as the next smaller suffix (i.e., those in P i ) to new singleton groups immediately preceding their respective old groups as described above.
However, for this to work, we first need to find the smallest suffix.This is simply solved by assuming w.l.o.g. that S is nullterminated (i.e., the last character of S is smaller than all other characters in S) and thus that the lexicographically smallest group is known to be the singleton group containing SA[0] = n − 1.
For the correctness of this algorithm, we need two more properties: First, before iteration i (0based), SA[i] must be known, and second, the procedure of inserting the elements in P i must be well-defined.
For the former, assume that we want to process SA[j].Because SA[j] occurs by definition before SA[j] in SA, we must have inserted P SA[j] already (with SA[j] ∈ P SA[j] by definition).The second property is implied by the following Corollary 3.5.(Intuitively, all suffixes in P i have different Lyndon prefixes, because those Lyndon prefixes start at different indices but end at the same index i, hence they must be in different Lyndon groups.)Lemma 3.4.For any j, j ∈ P i , we have L j L j if and only if j j .
Proof.Let j, j ∈ P i and j j .By definition of P i , we have nss S[j .. nss[j]) and L j = S[j .. nss[j ]), L j and L j have different lengths, implying the claim.Corollary 3.5.In a Lyndon grouping, the elements of P i are in different groups.
Accordingly, Algorithm 1 correctly computes the suffix array from a Lyndon grouping.A formal proof of correctness is given in Baier [2015Baier [ , 2016]].Figure 5 shows Algorithm 1 applied to our running example.Note that Corollary 3.5 also implies that the order in which we consider the j ∈ P A [i] in Algorithm 1 has no influence on the correctness.A concrete implementation of Algorithm 1 is provided in Section 3.3.

Deriving the Bijective Burrows-Wheeler Transform
In this section, we show how the algorithm from the previous section can be altered to derive the BBWT instead of SA from the final Lyndon grouping.
Unlike in the previous section, we do not assume that S is necessarily nullterminated.Furthermore, here we assume that the Lyndon grouping we are starting from has the minimum number ALGORITHM 1: Phase II of GSACA [Baier 2015[Baier , 2016]].After execution, the array A is the suffix array.
Let k be the start of the group containing j; remove j from its current group and put it into a new group k, k, L j immediately preceding j's old group; A[k] ← j; end end of groups.That is, we require that suffixes are in the same group if and only if they have the same Lyndon prefix.(For deriving the suffix array as described in the previous section it suffices that the grouping is Lyndon, but there may be several Lyndon groups with the same context.)Fortunately, we obtain exactly such a minimum Lyndon grouping from Phase I as described in Section 3.4 (or Baier's implementation of Phase I [Baier 2015[Baier , 2016]]).
Instead of following the procedure implied by Definition 2.6 literally, we follow the approach from Bannai et al. [2021] in that we compute the circular suffix array from which the BBWT can be derived similarly to how the ordinary BWT can be derived from the suffix array.
Let v 1 , . . .,v k be the Lyndon factorisation of S and let L be the set of positions where a Lyndon factor starts in S, i.e., The following definition provides a natural bijection between the (multi)set of infinite concatenations of conjugates of Lyndon factors and the indices in [0 .. n).
p , where L p is the Lyndon prefix of S p and L ∞ p is its infinite concatenation.
Note that for i with p ∈ L and p ≤ i < nss[p], we have . For instance, in our running example (cf.Figures 1 and 2), we have L 12 = ce and 5 ∈ L with 5 ≤ 12 < nss[5] = 14.Thus, C 12 = ceacebceece . . .and conj 12−5 (acebceece The following definitions introduce the order that induces the aforementioned permutation.
Definition 3.7.We write i < inf j if and only if Reading the numbers in the second column in Figure 2 gives SA • for our running example.For i ∈ [0 .. n), we have That is, character i of the BBWT is the character preceding the character at index SA • [i] when we consider the Lyndon factors to be independently circular strings, i.e., the character preceding the first character of a Lyndon factor is the last character of that Lyndon factor (cf. Figure 2).
To establish the connection between GSACA and the BBWT, we now define the corresponding analogue to the nss-pointer.8Definition 3.9 (nsc).
That is, for i L, nsc(i) points to the next smaller conjugate of the Lyndon factor i belongs to (cf.Lemma 3.12).
In Phase II of GSACA the indices are sorted according to the corresponding suffix, which consists of Lyndon factors arranged in (lexicographically) decreasing order.That is, for i ∈ [0 .. n), we sort according to To construct SA • instead of SA, we sort according to . .For employing Phase II of GSACA, we need to establish that -the final Lyndon grouping provides a valid partial order, and -nsc(i) ≤ inf i holds for all i.Note that the latter point and Definition 3.9 imply that each nsc-chain eventually reaches some p with nsc(p) = p.From Definition 3.9, we can furthermore deduce that this p must be in L, i.e., the index of a Lyndon factor.Therefore, to adapt Phase II of GSACA, we need to be able to determine the positions of the Lyndon factors in SA • beforehand, analogously to how (in Phase II of GSACA) we need to know the position of n − 1 in SA, since any nss-chain eventually reaches n − 1 (recall that we assumed the text to be nullterminated).
Proof.Follows immediately from Definition 3.9 and Lemma 3.2.
The following lemma implies that a Lyndon group actually constitutes a valid partial order according to We proceed by induction on the minimum number of applications of nsc to i that gives p i , i.e., min By assumption, we have Consider the smallest index k where there is a mismatch between L j and v, i.e., L j not be a prefix of v and therefore k must exist.)Since L j = L i v, the longest common prefix of L j and v (which has length k − 1) can be factored into m ≥ 0 repetitions of L i followed by a proper (possibly empty) prefix of L i , i.e., Using the definition of k, we then have With the help of the previous lemma, we are now in a position to show that nsc(i) actually is the next smaller conjugate of i. Lemma 3.12.
Note that we have i nsc(i) and S nsc(i) < lex S i .Let be the minimum number of times nsc has to be applied to nsc(i) to obtain p, i.e., = min 18:13 (by definition of k) and L nsc k +1 (i) ≤ lex L nsc k (i) (by Lemma 3.10).By Lemma 3.11, this implies nsc k+1 (i) < inf nsc k (i) and thus the claim.Now all except one of the aforementioned requirements for applying the sorting principle from Phase II of GSACA to SA • are shown to be satisfied.The missing one regards the positions of the Lyndon factors in SA • and is given by the following lemma.Lemma 3.13.For i ∈ L and j L with L i = L j , we have j < inf i.
Proof.Note that j < inf i holds if and only if nsc(j) < inf nsc(i).Furthermore, Hence, the claim follows by Lemma 3.11.
Otherwise, the claim follows by induction on the number of times nsc has to be applied to j to reach an element in L (whose Lyndon prefix is then guaranteed to be not equal to L j ).Lemma 3.13 immediately implies that the Lyndon factors are last in their respective Lyndon groups.If there are multiple equal Lyndon factors, then Definition 3.7 also gives the relative order of Lyndon factors within a Lyndon group.Note that this definition regarding the relative order of equal conjugates of Lyndon factors is arbitrary as it has no effect on the BBWT (cf.Definition 2.6).
To find the positions of all other elements in linear time, we proceed almost exactly as in Section 3.1.That is, we iterate from left to right over SA • and upon encountering some i, we insert all j that have i as next smaller conjugate (i.e., nsc(j) = i) at the current begin of their respective groups.Intuitively, this is correct, because nsc(i) comes before i in SA • for any i whose index in SA • is not yet known (i.e., those not in L, cf.Lemma 3.12), analogously to how nss[i] comes before i in SA • for any i n − 1.
Analogously to how P i contains all elements that have i as next smaller suffix, we now define P i as the set of elements that have i as their next smaller conjugate (excluding L, since their positions in SA • are already known).Definition 3.14 ( P i ).For i ∈ [0 .. n) define

Note that
by Definitions 2.2, 3.9, and 3.14.Furthermore, for all i ∈ L, we have P nss[i] ∩ L = {i} and therefore Definition 3.14 finally enables us to formulate a concrete algorithm; Algorithm 2 is Algorithm 1 adapted to compute SA • instead of SA and Figure 6 shows Algorithm 2 applied to our running example.
Proof.First note that by Lemma 3.11 it suffices to correctly sort elements within Lyndon groups.
In the first for-loop, the positions of Lyndon factors of S are written to A. Note that Lyndon factors are larger (according to < inf ) than other elements within the same Lyndon group (Lemma 3.13).For i, j ∈ L with i > j, we have L i ≤ lex L j by definition, and thus i < inf j by Definition 3.7 (if L i = L j ) and Lemma 3.11 (if L i L j ).We iterate over L in increasing order (by index) and insert the positions of Lyndon factors at the current end of their group.Hence, after the first for-loop, the Lyndon factors are correctly placed in A.
We now proceed by induction on the number of iterations of the second for loop.Let ISA • be the inverse permutation of SA • .
After the kth iteration the following invariants hold true: (1) (3) For all j < k and p ∈ P SA Note that the second invariant immediately follows from the first and the fact that entries already written to A do not change.
ALGORITHM 2: Phase II of GSACA, modified to produce SA • instead of SA.After execution the array A is SA • .Induction base.The lowest Lyndon group only contains Lyndon factors: Assume there is a Lyndon prefix j that is not a Lyndon factor in the lowest Lyndon group.Then there is p ∈ L with p < j < nss [p].By definition, we then have L p < lex L j , which is a contradiction.
Therefore, after the first for-loop, we have . k] and all elements in j ∈SA • [0 .. k) P j are correctly placed in A.
Induction step.We now insert P SA • [k] .First note that the elements in P SA • [k] are in different Lyndon groups (Corollary 3.5 on page 9, by definition otherwise), hence the order in which we process them is irrelevant.Now consider some j ∈ P SA • [k] .To prove that j is inserted at index ISA • [j] it suffices to show that from j's Lyndon group exactly the smaller elements were inserted during earlier iterations.Consider some j with L j = L j and j < inf j.This implies nsc(j For Invariant 1, we have two cases: We will now modify Algorithm 2 such that the second for-loop is exactly the same as in Phase II of GSACA as shown in Algorithm 1.
Note that the only difference lies in the definition of the sets P A[i] and P A [i] .Recall that we have As the elements in L are inserted before the second for-loop, we can drop the "set-minus L, " because already inserted elements are never changed.The now only remaining difference is the processing of elements in L, where we insert P nss[s] instead of P s .By inserting nss[s] instead of s (at the end of s's group) for each s in L in the initialisation, this last difference vanishes.The array computed by this modified Phase II is called the shifted circular suffix array.Definition 3.16 (Shifted Circular Suffix Array).The shifted circular suffix array SA • is derived from SA • by changing each element in L to its next smaller suffix.Formally: Furthermore, note that, in Equation ( 1), we have the same substitution of SA Hence, SA • also has an advantage compared to SA • when deriving BBWT: There is no need to check whether an entry is in L, we simply have Therefore, we can find nss[i] for each i ∈ L in a single left-to-right scan of pss and finding their positions in SA • takes O(n) time (specifically how the current group ends are maintained and queried in total O(n) time is explained in Section 3.3).

Optimising Phase II
In this section, we describe our optimisation of Phase II of Baier's sorting principle.We use Algorithm 1 as a starting point, refine it into a more concrete algorithm, and then alter the order in which elements are inserted (to improve cache performance).Because we essentially optimise the for-loop from Algorithm 1, all optimisations apply equally to the construction of SA and BBWT as discussed at the end of the previous section.
Note that each element i ∈ [0 .. n − 1) has exactly one next smaller suffix, hence there is exactly one j with i ∈ P j and thus i is inserted exactly once into a new singleton group in Algorithm 1.Therefore, for each group from the Lyndon grouping obtained in Phase I, it suffices to maintain a single pointer to the current start of this group.In Baier [2015], these pointers are stored at the end of each group in A.9 This leads to them being scattered in memory, potentially harming cache performance.Instead, we store them contiguously in a separate array C, which improves cache locality especially when there are few groups.
Besides this minor point, there are two major differences between our Phase II and Baier's, both are concerned with the iteration over a P i -set.
The first difference is the way in which we determine the elements of P i for some i.The following observations immediately enable us to iterate over P i .Lemma 3.17.
assume there is some j < i − 1 such that nss[j] = i.By definition, S j > lex S i and S j < lex S k for each k ∈ (j .. i).But by transitivity, we also have S j > lex S i−1 , which is a contradiction, hence P i must be empty.Lemma 3.18.For some j ∈ [0 .. i), we have j ∈ P i if and only if j's last child is in P i , or j = i − 1 and S j > lex S i .
Generic Non-recursive Suffix Array Construction 18:17 Proof.By Lemma 3.17, we may assume P i ∅ and j + 1 < i, otherwise the claim is trivially true.If j is a leaf, then we have nss[j] = j + 1 < i, and thus j P i by definition.Hence, assume j is not a leaf and has j > j as last child, i.e., pss[j ] = j and there is no k > j with pss[k] = j.It suffices to show that j ∈ P i if and only if j ∈ P i .Note that pss[j ] = j implies nss[j] > j .
=⇒ : From nss[j ] = i and thus S k > lex S j > lex S j (for all k ∈ (j .. i)), we have nss[j] ≥ i. Assume nss[j] > i.Then S i > lex S j and thus pss[i] = j, which is a contradiction.
⇐= : From S i < lex S j < lex S j , we have nss[j ] ≤ i. Assume nss[j ] < i for a contradiction.For all k ∈ (j .. j ), pss[j ] = j implies S k > lex S j .Furthermore, for all k ∈ [j .. nss[j ]), we have Specifically, (if P i is not empty) we can iterate over P i by walking up the pss-tree starting from i − 1 and halting when we encounter a node that is not the last child of its parent. 10Baier [2015]  tests whether i − 1 (pss [j]) is in P i by explicitly checking whether i − 1 (pss[j]) has already been written to A. This is done by having an explicit marker for each suffix [Baier 2015[Baier , 2016]].Reading and writing those markers leads to bad cache performance, because the accessed memory locations are hard to predict (for the CPU/compiler).Lemmata 3.17 and 3.18 enable us to avoid reading and writing those markers.In fact, in our implementation of Phase II, the array A is the only memory written to that is not always in the cache.Lemma 3.17 tells us whether we need to follow the psschain starting at i − 1 or not.Namely, this is the case if and only if S i−1 > lex S i , i.e., i − 1 is a leaf in the pss-tree.This information is required when we encounter i in A during the outer for-loop in Algorithm 1, thus we mark such an entry i in A if and only if P i ∅.Implementation-wise, we use the most significant bit (MSB) of an entry to indicate whether it is marked or not.By definition, we have S i−1 > lex S i if and only if pss[i] + 1 < i.Since pss[i] must be accessed anyway when i is inserted into A (for traversing the pss-chain), we can insert i marked or unmarked into A. Further, Lemma 3.18 implies that we must stop traversing a pss-chain when the current element is not the last child of its parent.We mark the entries in pss accordingly, also using the MSB of each entry.In the rest of this article, we assume the pss-array to be marked in this way.
These optimisations unfortunately come at the cost of 2n additional bits of working memory for the markings.However, as they are integrated into pss and A there are no additional cache misses.
Let G[i] be the index of the group start pointer of i's group in C. Phase II with our first major improvement compared to Baier's algorithm is shown in Algorithm 3.
The second major change concerns the cache-unfriendliness of traversing and inducing the P isets (i.e., the do-while loop in Algorithm 3).This bad cache performance results from the fact that we have chains of memory accesses where the location of one entry depends on the entry preceding it in the chain (this is also known as pointer-chasing).For instance, we have to first know As each such location is essentially random, each access is likely to be a cache-miss.One obvious mitigation is to interleave the arrays pss and G such that G[p] and pss[p] are next to each ALGORITHM 3: Concrete implementation of Phase II of GSACA.After execution, the array A is SA.The array G maps each suffix to its Lyndon group and C maps the Lyndon groups that resulted from Phase I to their current start.The correctness immediately follows from the correctness of Algorithm 1 and Lemmata 3.17 and 3.18.other in memory and we can fetch both with only one cache-miss instead of two.Note that we have a prime example of pointer-chasing here, namely, the traversal of the pss-tree: The next pss-value (and the corresponding group start pointer) cannot be fetched until the current one is in memory.
We can almost entirely eliminate the cache-misses caused by pointer-chasing here: Instead of traversing the P i -sets one after another, we opt to traversing multiple such sets in a sort of breadthfirst-search manner simultaneously.Specifically, we maintain a small (≤ 2 10 elements) queue Q of elements (nodes in the pss-tree) that can currently be processed.Then, we iterate over Q and process the entries one after another.Parents of last children are inserted into Q in the same order as the respective children.After each iteration, we continue to scan over the suffix array and for each encountered marked entry i insert i − 1 into Q until we either encounter an empty entry in A or Q reaches its maximum capacity.This is repeated until the suffix array emerges.The queue size could be unlimited, but limiting it ensures that it fits into the CPU's cache.Figure 7 shows our Phase II on the running example and Algorithm 4 describes it formally in pseudocode.The benefit of this breadth-first search comes from the fact that we can start to fetch required data a few iterations of the repeat-loop in Algorithm 4 before it is required, and thereby ensure that it is in the CPU's cache when needed.Note that this optimisation is only useful when the queue contains many elements (i.e., s is large), otherwise there are not enough iterations of the repeatloop between inserting an element into the queue and inserting it into A, and we effectively have Algorithm 1 with some additional overhead.Fortunately, in real world data this is usually the case and the small overhead for maintaining the queue is more than compensated by the better cache performance (cf.Section 4).Theorem 3.19.Algorithm 4 correctly computes the suffix array from a Lyndon grouping.
Proof.By Lemmata 3.17 and 3.18, Algorithms 1 and 4 are equivalent for a maximum queue size of 1.Therefore, it suffices to show that the result of Algorithm 4 is independent of the queue size.Assume for a contradiction that the algorithm inserts two elements i and j with S i < lex S j belonging to the same Lyndon group with context α, but in a different order as Algorithm 1 would.This can only happen if j is inserted earlier than i.Note that, since i and j have the same Lyndon prefix α, the pss-subtrees T i and T j rooted at i and j, respectively, are isomorphic (see Bille et al. [2020]).In particular, the path from the rightmost leaf in T i to i has the same length as the path ALGORITHM 4: Breadth-first approach to Phase II.The constant w is the maximum queue size.When s is large, it is possible to improve the cache-performance of the repeat-loop by prefetching the used data a few loops ahead.This is not possible in Algorithm 3, because there the address of data accessed in one iteration of the do-while-loop depends on the data accessed in the previous iteration ("pointer-chasing").
from the rightmost leaf in T j to j.Thus, i and j are inserted in the same order as S i+ |α | and S j+ |α | occur in the suffix array.Now the claim follows inductively.
It is clear that Algorithm 4 has linear time complexity: Each index in [0 .. n) is inserted exactly once into Q, so the repeat-loop runs exactly n times in total.The inner while-loop iterates from i = 1 to n and thus runs exactly n − 1 times.In each iteration of the outer while-loop, at least one element is removed from Q (because s > 0) and thus there can be no more than n iterations.
The amount of working memory (i.e., without input and output) is constant besides the memory for C, G and pss (because the size of the queue Q is constrained by a constant).Both pss and G occupy n words of memory, while C is dependent on the number of groups in the Lyndon grouping.We apply an optimisation where for each element i of a singleton-group, ) and thus no entry for i's group in C is required.Therefore, C can have at most n/2 entries, and in total we have at most 2.5n words of working memory.Note that in most cases, the number of groups in a Lyndon grouping is small compared to the text size (cf.Section 4).

Phase I
In Phase I, a Lyndon grouping is derived from a suffix grouping in which the group contexts have length (at least) one.That is, the suffixes are sorted and grouped by their Lyndon prefixes.Lemma 3.20 describes the relationship between the Lyndon prefixes and the pss-tree that is essential to Phase I of the grouping principle.
Proof.By definition, we have c j+1 .Then, we have nss[c j ] < c j+1 , otherwise c j+1 would be a child of c j .As we have S nss[c j ] < lex S c j and S c j < lex S c j for each j ∈ [1 .. j) (by induction), we also have must be a child of i in the pss-tree, which is a contradiction.
We start from the initial suffix grouping in which the suffixes are grouped according to their first characters.From the relationship between the Lyndon prefixes and the pss-tree in Lemma 3.20 one can get the general idea of extending the context of a node's group with the Lyndon prefixes of its children (in correct order) while maintaining the sorting [Baier 2015].Note that any node is by definition in a higher group than its parent.Also, by Lemma 3.20 the leaves of the pss-tree are already in Lyndon groups in the initial suffix grouping.Therefore, if we consider the groups in lexicographically decreasing order (i.e., higher to lower) and append the context of the current group to each parent (and insert them into new groups accordingly), then each encountered group is guaranteed to be Lyndon [Baier 2015].Consequently, we obtain a Lyndon grouping.Figure 8 shows this principle applied to our running example.
Formally, the suffix grouping satisfies the following invariant during Phase I before and after processing a group: . k] such that c 1 , . . ., c j are in groups that have already been processed, -c j+1 , . . ., c k are in groups that have not yet been processed, and the context of the group containing i is Furthermore, each processed group is Lyndon.
Additionally and unlike in Baier's original approach, all groups created during our Phase I are either Lyndon or only contain elements whose Lyndon prefix is different from the group's context.This has several advantages, which are discussed below.Definition 3.21 (Strongly Preliminary Group).We call a preliminary group G = д s , д e , |α | strongly preliminary if and only if G contains only elements whose Lyndon prefix is not α.A preliminary group that is not strongly preliminary is called weakly preliminary.
The following lemma shows that a weakly preliminary group can always be split into a (lower) Lyndon group and a (higher) strongly preliminary group.Lemma 3.22.For any weakly preliminary group G = д s , д e , |α | there is some д ∈ [д s .. д e ) such that G = д s , д , |α | is a Lyndon group and G = д + 1, д e , |α | is a strongly preliminary group.Splitting G into G and G results in a valid suffix grouping.
Proof.Let G = д s , д e , |α | be a weakly preliminary group.Let F ⊂ G be the set of elements from G whose Lyndon prefix is α.By Lemma 3.2, we have results in a valid suffix grouping.Note that, by construction, the former is a Lyndon group and the latter is strongly preliminary.
For instance, in Figure 8 there is a group containing 2, 6, and 12 with context ce.However, 6 and 12 have this context as Lyndon prefix while 2 has ced.Consequently, 2 will later be moved to a new group.Hence, when Baier [2015] (and Bertram et al. [2021]) create a weakly preliminary group (in Figure 8 this happens while processing the Lyndon group with context e), we instead create two groups, the lower (Lyndon group) containing 6 and 12 and the higher (strictly preliminary) containing 2.
During Phase I, we maintain the suffix grouping using the following data structures: -An array A of length n containing the unprocessed Lyndon groups and the sizes of the strongly preliminary groups.Elements in Lyndon groups are marked light gray or green , depending on whether they have been processed already.Note that the applied procedure does not entirely correspond to our algorithm for Phase I; it only serves to illustrate the general sorting principle.The Lyndon prefixes are shown at the top for clarity.
-An array I of length n mapping each element s to the start of the group containing it.We call I [s] the group pointer of s. -A list C storing the starts of the already processed Lyndon groups.
Note that C is an input to Phase II.Transforming the array I into G as required for Phase II is trivial with the help of C after Phase I.In combination, C, G, and pss make up the entire input to Phase II, and the contents of A can be discarded after Phase I.
These data structures are organised as follows.Let G = д s , д e , |α | be a group.For each s ∈ G, we have I [s] = д s .If G is Lyndon and has not yet been processed, then we also have s ∈ A[д s .. д e ] for all s ∈ G and If G is Lyndon and has been processed already, then there is some j such that C[j] = д s .Otherwise, if G is (strongly) preliminary, then we have In contrast to Baier, we have the Lyndon groups in A sorted and store the sizes of the strictly preliminary groups in A as well [Baier 2015[Baier , 2016]].The former makes finding the number of children a parent has in the currently processed group easier and faster.The latter makes the separate array of length n used by Baier for the group sizes obsolete [Baier 2015[Baier , 2016] ] and is made possible by the fact that we only write Lyndon groups to A.
As alluded above, we follow Baier's approach and consider the Lyndon groups in lexicographically decreasing order while updating the groups containing the parents of elements in the current group.
ALGORITHM 5: Phase I: Traversing the groups [Baier 2015[Baier , 2016] ; process group д s , д e , ⊥ ; д e ← д s − 1; end Note that in Algorithm 5, д e is always the end of a Lyndon group.This is due to the fact that a child is by definition lexicographically greater than its parent.Hence, when a group ends at д e and all suffixes in SA(д e .. n) have been processed, the children of all elements in that group have been processed and it consequently must be Lyndon.Thus, Algorithm 5 actually results in a Lyndon grouping.For a formal proof see Baier [2015].
Of course, we have to explain how to actually process a Lyndon group.This is done in the rest of this section.
Let G = д s , д e , |α | be the currently processed group and w.l.o.g.assume that no element in G has the root −1 as parent (we do not have the root in the suffix grouping, thus nodes with the root as parent can be ignored here).Furthermore, let A be the set of parents of elements in G (i.e., As noted in Figure 8, we have to consider the number of children an element in A has in G. Namely, if a node has multiple children with the same Lyndon prefix, then of course all of them contribute to its Lyndon prefix.This means that we need to move two parents in A, which are currently in the same group, to different new groups if they have differing numbers of children in G.For example, while processing the group with context e in Figures 8 and 9 has two children in this group, while 6 only has one.Both are currently in the same group with context c.As dictated by Invariant 1, after processing the group with context e, 9 must be in a group with context cee and 6 must be in a (lower) group with context ce (because ce < lex cee).
Let A contain those elements from A with exactly children in G. Maintaining Invariant 1 requires that, after processing G, for some д ∈ [1 .. k] the elements in A ∩ G д are in groups with context α д α .Note that, for any < , we have α д α < lex α д α .Consequently, the elements in A ∩ G д must form a lower group than those in A ∩ G д after G has been processed [Baier 2015[Baier , 2016]].To achieve this, first the parents in A | G | are moved to new groups, then those in A | G |−1 and so on [Baier 2015[Baier , 2016]].
We proceed as follows.First, determine A and count how many children each parent has in G.Then, sort the parents according to these counts using a bucket sort.Because the elements of yet unprocessed Lyndon groups must be sorted in A, this sort must be stable.Further, partition each bucket into two sub-buckets, one containing the elements that should be inserted into Lyndon groups and the other containing those that should be inserted into strongly preliminary groups.Then, for the sub-buckets (in the order of decreasing count; for equal counts: first strongly preliminary then Lyndon sub-buckets) move the parents into new groups.11These steps will now be described in detail.
For brevity, we refer to those elements in A that have their last child in G as finalists.Partition A into F and N , such that the former contains finalists and the latter the non-finalists.
To determine the aforementioned sub-buckets, we associate a key with each element in A such that (stably) sorting according to these keys yields the desired partitioning.Specifically, for a fixed , let key(s) = 2 for each s ∈ F and key(s) = 2 + 1 for each s ∈ N .
As we need to sort stably, the bucket sort requires an additional array B of length |G|, and another array for the bucket counters.
Finding parents is done using the same pss array as in Phase II.Since A[д s ..Then, we copy Then, we sort A \ A 1 by the keys.That is, we count the frequency of each key, determine the end of each key's bucket and insert the elements into the correct bucket in A[д s + |A 1 | .. д s + |A|).Figure 9 shows how the data is organised during the sorting.

Reordering parents into Lyndon groups.
Let A be a sub-bucket of length k containing only parents that will now be moved to a Lyndon group and whose context is extended by α q for some fixed q.Note that the bucket sort ensures that A is sorted.Within each current preliminary group Reordering parents into strongly preliminary groups.Let A be a sub-bucket of length k containing only parents that are now moved to a strongly preliminary group.The general procedure is similar to the reordering into Lyndon groups, but simpler.First, we decrement the sizes of the old groups.In a second scan over A , we set the new group pointer as above, and in a third scan, we increment the sizes of the new groups.
Note that in the reordering step, we iterate two and three times, respectively, over the elements in a sub-bucket and that in each scan the group pointers are required.Furthermore, the group pointers are updated only in the last scan.As the group pointers are generally scattered in memory, it would be inefficient to fetch them individually in each scan for two reasons.First, a single group pointer could occupy an entire cache-line (i.e., we mostly keep and transfer irrelevant data in the cache).Second, the memory accesses are unnecessarily unpredictable.To mitigate these problems, we pull the group pointers into the temporary array B, that is, we set Of course, in this fetching of group pointers, we have the same problems as before, but during the actual reordering the group pointers can be accessed much more efficiently.
Note that in contrast to Baier [2015], we do not compute the parents on the fly during Phase I but instead construct the pss-array in advance using the linear-time pss-construction algorithm of Bille et al. [2020].There are two reasons for this, namely, first, determining the parents on the fly as done in Baier [2015] requires a kind of pointer jumping that is very cache-unfriendly and hence slow; and second, it is not clear how to efficiently determine on the fly whether a node is the last child of its parent.
Another difference that is speeding up the algorithm is that we only write Lyndon groups to A. This way, we do not have to rearrange elements in weakly preliminary groups when some of their elements are moved to new groups.Furthermore, it is possible to have the elements in Lyndon groups sorted in A, which makes determining the parents and their corresponding keys easier and faster.
The time complexity of Phase I is clearly linear, because each index in [0 .. n) is in exactly one group and each group is processed in time linear to its size (it is easily verified that each step involved in processing a group -determining the parents, sorting them, and moving them to new groups -takes linear time).
In terms of working memory, we require pss, I and C for maintaining the suffix grouping, where pss and I each require n words of memory.As explained at the end of Section 3.3, C requires at most m/2 words of memory when the already processed groups contain m elements in total.12For processing a group with distinct parents, the only additional memory needed is the array B of length .It is clear that can be at most (n − m)/2, since those parents cannot be in already processed groups.Consequently, we require at most 2n +m/2+(n − m)/2 = 2.5n words of working memory in total.Note that, because both the number of groups and the maximum size of a group are usually relatively small in practice, we can often use the memory in A previously occupied by already processed groups to store B and thus require less than the 2.5n words of working memory.

Initialisation
In the initialisation, the pss-array with markings must be computed.We use a variant of the lineartime pss-construction algorithm of Bille et al. [2020], which we adapted to mark each last child i using the most significant bit of pss [i].(This modification is straightforward and will thus not be explained here.)Further, the initial suffix grouping needs to be constructed.That is, we must create two groups (buckets) for each character, one Lyndon and one strictly preliminary, where the Lyndon groups contain exactly the leaves.Note that i is a leaf if and only if During a right-to-left scan over S, it is possible in constant time to decide for each i < n − 1 whether S i > lex S i+1 holds [Ko and Aluru 2003;Nong et al. 2009]. 13Thus, we can determine the size and start position for each group in O(n + σ ) time, where σ is the size of the alphabet.In a second right-to-left scan over S, we can then set the references to the group starts in the array I and write the leaves to SA.

Computing the Extended Burrows-Wheeler Transform
The original eBWT of a multiset M of strings is derived by taking the last characters of the conjugates of the strings in M arranged in infinite periodic order [Mantaci et al. 2005].The original definition of the eBWT assumed that M only contains primitive strings [Mantaci et al. 2005], but as in Boucher et al. [2021] we do not have this restriction.
For convenience, we assume that M is given in concatenated form as a single string T and a list of lengths l 0 , . . ., (At the end of this section, we explain that we do not actually need M to be given in concatenated form. For each i ∈ [0 .. n) there is some j ∈ [0 .. |M |) such that s j ≤ i < s j+1 .We call T s j .. s j+1 the source string of i.With a slight abuse of notation, in this section, we denote by T i the i − s j th conjugate of the source string T s j .. s j+1 , i.e., T i = T i .. s j + l j T s j .. i .As with the BBWT, we first define a total order on the indices [0 .. n) and then use that to define the eBWT.Definition 3.23.We write i < д inf j if and only if T i < ω T j or root(T i ) = root T j and i > j.Note the similarity to Definition 3.7.Like for the BBWT, the relative order in the case of equal roots is irrelevant for the eBWT, but was chosen to conform with the lexicographical order of the Lyndon factors so that it is easier to adapt our algorithm to compute the eBWT.
We can then define the permutation sorting the indices according to < д inf , analogous to the suffix array (lexicographic order of suffixes) or circular suffix array (infinite periodic order of the conjugates of the Lyndon factors): Proof.This immediately follows from the fact that the Lyndon factors of T directly correspond to the roots of the strings in M (Lemma 3.28 below), Definitions 3.24 and 3.8, and Lemma 3.11.
As a result of Lemma 3.25, Bannai et al. [2021] obtain their (linear time) eBWT construction algorithm from their BBWT algorithm [Bannai et al. 2021].However, computing the eBWT via Generic Non-recursive Suffix Array Construction 18:27 Fig. 10.Shown are the generalised circular suffix array, the lpss-tree, and the pss π (T ) -tree for M = {b, bcbc, b, abcbc,bc} (cf. Figure 3), where the indices refer to the concatenation T = bbcbcbabcbcbc.Note that the structures of the lpss-tree and the pss π (T ) -tree are very similar, with the only difference being the order of the children of the artificial root (and of course the indices).
the BBWT requires two preprocessing steps: First, the minimum conjugates of the strings in M must be found, and then these minimum conjugates must be sorted lexicographically.Boucher et al. [2021] showed that the algorithm of Bannai et al. [2021] can be adapted such that these preprocessing steps are not necessary [Boucher et al. 2021].We now demonstrate that our algorithm can be adapted such that it computes the eBWT of M without the preprocessing step of sorting the minimum conjugates.Although we still require the canonicalisation of the input strings, our implementation is much faster than Boucher et al. [2021]'s (cf.Section 4).
Adapting our algorithm to compute GSA • .In the following, we assume that the strings in M are in canonical form.
After canonicalising the input strings, the only algorithmic change compared to our SA •algorithm concerns the initialisation.Namely, it suffices to compute the pss-tree for each canonicalised input string separately.
Specifically, we construct an array lpss where lpss[i] points to the previous smaller suffix within the source string of i (i.e., each lpss-value is restricted to within the respective input string).Formally, for each i ∈ [0 .. n) with s j ≤ i < s j+1 , we define Figure 10 shows the lpss-array for M = {b, bcbc, b, abcbc, bc}.Intuitively, the correctness of our GSA • -algorithm (which is the same as our SA • -algorithm except that it operates on the lpss-tree instead of the pss-tree) follows from the fact that it behaves exactly as our SA • -algorithm would on the concatenation of the strings in M arranged in lexicographically decreasing order.This is because pss for this concatenation has the same structure as lpss (Lemma 3.32 below) and both Phase I and II operate only on the given pss-tree.
To actually prove this, we show that each created grouping is order-isomorphic to a grouping created when running our SA • -algorithm on the concatenation of the strings in M sorted in decreasing lexicographic order.The following definition provides that isomorphism (the proof that it actually is that order isomorphism is provided in Theorem 3.35).
Definition 3.26.Let π be the (unique results in the concatenation of the strings in M sorted in decreasing lexicographic order: The first two properties sort the start positions of the strings in M (with ties broken using the text positions), and the third property dictates that (the indices within) the strings in M should remain contiguous.
Figure 10 shows the permutation π , the permuted string π (T ) and the tree for pss π (T ) for an example.
Note that we do not need to compute π ; it only serves to prove that our approach is correct.We claim that the permutation π is an order isomorphism in the sense that i < inf compares the conjugates of Lyndon factors in π (T )).This claim is proven in Lemma 3.29.We say two groupings G 1 , . . ., G k and G 1 , . . ., G k are isomorphic, if and only if i ∈ G j implies π (i) ∈ G j and vice versa for each j ∈ [1 .. k].We will show later that our SA • -algorithm invoked with π (T ) and pss π (T ) treats each π (i) exactly as the same algorithm invoked with T and lpss treats i.That is, applying our SA • -algorithm to T and lpss results in GSA • by Lemma 3.25.Lemma 3.27.For two different Lyndon words u and w, we have u < lex w if and only if u n < lex w m for all n, m ∈ N + .
Proof.Let u and w be Lyndon words with u < lex w and consider any n, m ∈ N + .We show that u < lex w implies u n < lex w m ; the other direction then follows from symmetry (if u n ≤ lex w m then u > lex w cannot be true, because it would imply u n > lex w m ).
If u is not a prefix of w, then there is a mismatching character and u n < lex w m follows trivially.Thus, assume w = uv for a non-empty v.By definition of the lexicographic order, we have u n−1 < lex u n , and since w is a Lyndon word, we also have v > lex w = uv.Lemma 3.2 then implies u n < lex w m .Lemma 3.28.The Lyndon factors of π (T ) are exactly the roots of the strings in M. Formally, for each where p and are the period and the length of the root of T s j .. s j + l j .
Proof.Since the strings in M are in canonical form, their roots are Lyndon words.Now consider any two strings T [s j .. s j + l j ) and T [s j .. s j + l j ) in M with T [s j .. s j + l j ) < lex T [s j .. s j + l j ).By Lemma 3.27 this implies root(T [s j .. s j + l j )) < lex root(T [s j .. s j + l j )).By Definition 3.26, we also have π (s j ) > π (s j ).Therefore, π (T ) consists of the roots of the strings in M arranged in lexicographically non-increasing order.By the Chen-Fox-Lyndon theorem (Theorem 2.5), those roots are the Lyndon factors of π (T ).
Note that we now have proven Lemma 3.25.The following lemma is slightly stronger than Lemma 3.25 and shows that π is actually an order isomorphism.Lemma 3.29.For any i, j with i < д inf j, we have π (i) < π (T )  inf π (j), where < π (T ) inf refers to π (T ).
Proof.Consider any i and j with i < д inf j.By Definition 3.23, we have either T i < ω T j or root(T i ) = root T j and i > j.By Lemma 3.28, the conjugates of Lyndon factors of π (T ) starting at π (i) and π (j) are root(T i ) and root T j , respectively.According to Definition 3.7 this implies π (i) < π (T )  inf π (j) in both aforementioned cases and thus the claim.
Corollary 3.30.When SA • is the circular suffix array of π (T ), we have SA The following simple lemma shows that the relative lexicographic order of two suffixes originating from the same Lyndon factor can be decided using only that Lyndon factor.We will use it to prove that the structural similarity between pss π (T ) and lpss observed in Figure 10 is not coincidental.
Lemma 3.31.LetT [s .. e) be a Lyndon factor ofT .For any i, j ∈ [s .. e), we haveT Proof.Consider some i, j with T [i .. e) < lex T [j .. e).We show that this implies T . e) (otherwise the claim holds trivially as in the previous case).Note that Lemma 3.32.Let pss π (T ) be the pss-array of π (T ).Then for each i ∈ [0 .. n), we have pss π (T ) Proof.By Lemma 3.28, pss π (T ) is solely determined by the roots of the strings in M and their frequencies.Therefore, we may assume for ease of notation that all strings in M are primitive (and are thus Lyndon words).Now, consider some i ∈ [0 .. n) with π s j ≤ i < π s j + l j .Note that π (T ) π s j .. π s j + l j is a Lyndon factor of π (T ) by Lemma 3.28.There are two cases: -If i = π s j , then we have lpss[π −1 (i)] = −1 by definition of lpss.By simple properties of the Lyndon factorisation, π (T ) i has no previous smaller suffix and the claim holds.-Otherwise (i.e., π s j < i < π s j + l j ), we have π s since T s j .. s j + l j = π (T ) π s j .. π s j + l j is a Lyndon word.Lemma 3.31 then implies pss π (T ) [i] = π (p) and thus the claim.
As with the BBWT, we now formally define the indices of the next smaller conjugates, and later (Lemma 3.34) show that they correspond to nsc of π (T ) as defined for the BBWT in Section 3.2.Definition 3.33 (nsc ).Let p 1 < • • • < p k be exactly the indices where lpss is −1 and set p k+1 = n for convenience.For i ∈ [0 .. n) with p j ≤ i < p j+1 , we define nsc (i) as the next smaller conjugate of T i , i.e., In the first case, the root of the input word i belongs to (the Lyndon word T p j .. p j+1 ) has no nonempty smaller suffix after i, and thus the next smaller conjugate is the root itself.In the second case, this next smaller suffix coincides with the next smaller conjugate (as implied by Lemma 3.31).Note that we define the next smaller conjugate of a Lyndon word to be the Lyndon word itself, i.e., nsc p j = p j for each j.
In either case, the claim immediately follows from Definition 3.33.Now, we have shown that π relates lpss to pss π (T ) , nsc to nsc π (T ) and GSA • to SA π (T ) • , and we are in a position to show that our proposed GSA • -algorithm is correct.Theorem 3.35.Using our algorithm for SA • on T with lpss instead of pss results in GSA • .
Proof.We will show that π provides an isomorphism between the mth grouping G m,1 , . . ., G m,k m created during our proposed algorithm for GSA • and the mth grouping G m,1 , . . ., G m,k m created during our algorithm for SA • , where G m,i is defined as the group containing π (j) | j ∈ G m,i .Corollary 3.30 implies that the groups defined as such are valid according to Definition 3.1 (with SA of course swapped with SA • or GSA • ) and that this is sufficient for the correctness of our algorithm for GSA • .
For Phase I, we show this inductively.In the initial suffix grouping, the group in which an index i is placed purely depends on the ith character and whether i is a leaf in the lpss-tree.Therefore, Lemma 3.32 and Corollary 3.30 imply that the claim holds for the initial suffix groupings (m = 0).Now consider the ith (1-based) iteration of Algorithm 5, that is, we have a grouping . д e ]} beforehand.By Lemma 3.32, π provides a bijection between the multisets of parents, i.e., between {lpss[GSA • [j] . Since the changes to the grouping depend on those parents and the number of children in G i,k i +1−i (or G i,k i +1−i ) only, the groupings are still isomorphic after the ith iteration. 15ow consider Phase II, specifically the first for-loop in Algorithm 2 and let i ∈ G j be such that lpss[i] = −1, where G j = д s , д e , |α | is a group in the grouping resulting from Phase I.By Corollary 3.30, it suffices to show that i is inserted at index SA −1 • [π (i)].By Lemma 3.32 and Corollary 3.30, we have pss π (T ) [π (i)] = −1 and д s ≤ SA −1 • [π (i)] ≤ д e , respectively.Thus, it suffices to show that for Generic Non-recursive Suffix Array Construction 18:31 each i ∈ G j with lpss[i ] = −1, we have i < i if and only if π (i) < π (i ) (the for-loop iterates from left to right).Since G j is a Lyndon group and lpss[i] = lpss[i ] = −1, we have root(T i ) = root(T i ) which-in combination with Definition 3.26-implies that i < i holds if and only if π (i) < π (i ) holds.Now consider the second for-loop in Algorithm 2. By Definitions 3.14 and 3.33 and Lemmata 3.34, 3.17 and 3.18 we have for each i ∈ [0 .. n) where q is maximal such that lpss k [i] is the last child of lpss k+1 [i] for each k ∈ [0 .. q).It inductively follows that applying an implementation of the second for-loop in Algorithm 2 relying on Lemmata 3.17 and 3.18 to compute the P i -sets (such as Algorithms 3 and 4) with lpss instead of pss results in GSA • according to Lemma 3.29.
Similar to our BBWT algorithm we compute an array GSA • that can be derived from GSA • by replacing s j with s j+1 .Formally: This is done for the same reasons noted at the end of Section 3.2, namely, first, that deriving the eBWT from GSA • is simpler than deriving it from GSA • (cf.Equation ( 2)), and second, that it simplifies Phase II.
Implementation notes.Finding the minimum conjugate of a string can be done in linear time, specifically using at most n + d/2 character comparisons, where d is the length of the string's root [Shiloach 1981].The algorithm of Shiloach [1981] is a two-stage algorithm.The first stage rules out indices that cannot (uniquely) correspond to minimum conjugates at a rate of one index per comparison, but it can exclude at most half the indices.The second stage then finds the answer among the remaining indices and takes up to two comparisons per remaining index [Shiloach 1981].We only use the second stage of Shiloach [1981]'s algorithm (which requires at most 2n character comparisons), since it is already quite fast (cf.Section 4) and much simpler to implement.
Computing the index set I-i.e., the set of indices where GSA • contains the start index of a string in M (before canonicalisation)-can be trivially computed from GSA • and the indices of the smallest rotations of the input strings in O(n) using, e.g., the memory already allocated for G (cf. Section 3.3).
It is not necessary to use additional working space for the stringT composed of the canonicalised strings in M. Note that T is only needed in three places: computing lpss, setting up the initial group structure, and deriving the eBWT from GSA • .Setting up the initial group structure simply involves two scans over each input string (cf.Section 3.5) and thus can be trivially executed on the unmodified string.Since before computing lpss and after computing GSA • the array containing the references to the current groups (called I in Phase I/Section 3.4 and G in Phase II/Section 3.3 16 ) is not required, we can use it to temporarily store T (the concatenated canonicalised strings from M).Note that this is also the reason why we do not need M to be given in concatenated form; we do not immediately operate on M as given anyway.While needed, the indices of the smallest rotations of the input strings can be stored in the memory designated for returning the index set I. Therefore, our eBWT and BBWT construction algorithms require exactly the same amount of working memory.
Since canonicalising the input strings and computing lpss can be done in linear time, the linear time complexity of our eBWT algorithm immediately follows from the linear time complexity of our BBWT algorithm.

EXPERIMENTS
Our implementation FGSACA of the optimised GSACA as well as the BBWT and eBWT construction algorithms are publicly available. 17e compare our SACA with the GSACA implementation by Baier [2015Baier [ , 2016] ] and the double sort algorithms DS1 and DSH by Bertram et al. [2021].The latter two also use the grouping principle but employ integer sorting and have super-linear time complexity.DSH differs from DS1 only in the initialisation: in DS1 the suffixes are sorted by their first character while in DSH up to eight characters are considered [Bertram et al. 2021].We further include DivSufSort 2.0.2, since it is used by Bertram et al. [2021] as a reference [Bertram et al. 2021], as well as libsais 2.7.3 and sais-lite 2.4.1. 18Both libsais and sais-lite are implementations of the SA-IS algorithm [Nong et al. 2009], but the former uses cache-prefetching techniques to outperform sais-lite (and all other SACAs known to us) on real-world data.
We compare our eBWT construction algorithm with PFP-eBWT by Boucher et al. [2021], since PFP-eBWT is "the only tool up to date that computes the eBWT according to the original definition" [Cenzato and Lipták 2022], and with cais by the same authors, because we believe that it is a fairer comparison: Like our algorithm, cais computes GSA • and derives the eBWT from that, while PFP-eBWT uses Prefix-Free Parsing and applies cais to the parse and derives the eBWT from the result and the lexicographically sorted dictionary [Boucher et al. 2021].For extremely repetitive input (such as genomes from individuals of the same species), the total length of the parse and the dictionary are expected to be significantly smaller than the text itself, thus PFP-eBWT is expected to be faster on such data.Note that this means that it is possible to use our algorithm in PFP-eBWT instead of cais.
Besides cais, we are only aware of two non-trivial BBWT construction algorithms, namely, Yuta Mori's algorithm in OpenBWT 2.0.019 and mk-bwts by Neal Burns. 20The former is claimed by the author to have linear time complexity and seems to be an adaptation of the SA-IS suffix sorting algorithm, while the latter modifies an already computed suffix array until BBWT can be derived in the same way as we derive it from SA • [Bannai et al. 2021].Originally, mk-bwts uses DivSufSort for suffix sorting, but since libsais is much faster, we instead use it in our comparison.
All experiments were conducted on a Linux-5.4.0 machine with an AMD EPYC 7742 processor and 256GB of RAM.All algorithms were compiled with GCC 10.3.0 with flags -O3 -funroll-loops -march=native -DNDEBUG or-if applicable-the flags that they were distributed with.Since PFP-eBWT is a multi-stage program that reads data from and writes data to disk (an SSD in our case) in between stages, we report both the wall clock time (as with the other algorithms) and the system time (the CPU time spent in the kernel, e.g., for file access). 21We report the maximum resident memory as working memory (measured via the GNU Time utility).The data written to Generic Non-recursive Suffix Array Construction 18:33 The compressibility is measured as the number of runs r in the BWT (or eBWT in the case of string collections) divided by n (i.e., a larger number indicates worse compressibility).
(or read from) disk is not counted as working-memory. 22For all other algorithms, we first load the text into RAM and allocate the memory for the output (either SA, eBWT or BBWT) and then start measuring the wall clock time.For these, the working memory is the maximum amount of allocated memory excluding the memory for input and output.The memory for the input is read-only.
Each algorithm was executed five times on each test case, and we use the mean as the final result.We evaluated the SA-and BBWT-algorithms on data from the Pizza & Chili corpus23 and Manzini's Corpus. 24We also include strings from the artificial skyline string family, for which the SA-IS algorithm reaches maximum recursion depth [Bingmann et al. 2016].Moreover, we Note that for the eBWT, the number of Lyndon factors refers to the canonicalised strings from the input and thus corresponds to the number of sequences, except for reads, which contains a few non-primitive strings.For our eBWT-algorithm, this number has the same relevance as the number of Lyndon factors for our SACA (cf.Section 3.6).
include the strings a n−1 b and b n−1 a with n = 10 8 (aab.8,bba.8), which maximise the number of unique Lyndon prefixes (and thus the number of groups in the Lyndon grouping) and the number of Lyndon factors, respectively.To test the algorithms on large inputs for which 32-bit integers are not sufficient, we also use a dataset containing larger texts, namely, the first 10 10 bytes of the English Wikipedia dump from June 1, 2022 25 and the human DNA concatenated with itself. 26 The eBWT-algorithms were evaluated on 10 4 SARS-CoV-2 genomes (covid4), 27 a set of 10 7 real reads with 101bp each (reads), 28 the first 10 3 , 10 4 and 10 5 pages from above Wikipedia dump (articles3, articles4 and articles5), 29 and three sets of random strings random2_6, random4_4 and random6_2, where randomi_j consists of 10 j strings of length 10 i each. 30No string collection contained duplicate strings.
An overview of the test data is provided in Tables 1 and 2.
For the suffix array construction algorithms, the times are shown in Tables 3 and 4. In general, all linear-time algorithms were faster on the more repetitive datasets, on which the differences between those algorithms were also smaller.
DS1 is in most cases slower than DSH, the biggest outliers are the random texts where the initialisation of DSH is several times slower than it is on other texts (per input symbol).
Especially notable is the difference in the time required for Phase II: Our Phase II is up to 77% faster (tm29) than Phase II of DSH with a mean of 56% and a median of 60%.(The only case where our Phase II is slower than DSH's is aab.8.)Our Phase I is also significantly faster than Phase I of DS1 (median: 44%).Conversely, Phase I of DSH is much faster than our Phase I (median: 25%).However, this is only due to the more elaborate construction of the initial suffix grouping as demonstrated by the much slower Phase I of DS1 (which is the same as Phase I of DSH).Note that, unlike GSACA, DS1 and DSH, we need to compute the pss-array before Phase I, which takes between 6% (skyline28.txt) and 33% (world_leaders) of the total time (median: 16%).(GSACA also computes pss, but on the fly during Phase I and without the markings for the last children.)All times are given in seconds per 100 • 2 20 characters (100MiB).For cais, the init refers to the construction of the bit vector with rank-select-support that cais uses to mark the start indices of strings.Note that deriving the BBWT is comparatively slow in cais because of these rank-select queries.For mk-bwts, the SA-stage refers to the computation of the suffix array via libsais, SA −1 is the computation of the inverse suffix array and fix is the "Fix sort order"-stage where the suffix array adjusted so that BBWT can be derived.
(world_leaders) and 69% (random8) faster than OpenBWT (median: 64%).Compared to mk-bwts, GBBWT is faster on the cases where the "Fix sort order" stage becomes slow due to its quadratic time complexity and/or libsais is already slower than FGSACA.Interestingly, on Large, computing the inverse permutation of the suffix array takes much more time, approximately a quarter of the runtime of mk-bwts.This is the reason why GBBWT is faster than mk-bwts on that data set, despite the data being structurally similar to PC-Real, where libsais is usually faster than FGSACA.GBBWT is also between 57% (pitches) and 73% (fib41) faster than cais, with an average difference in speed of 68%.The amount of working memory is shown in Tables 8 and 9. OpenBWT only requires slightly more than one word per input character, mk-bwts requires two words per character (SA and SA's inverse), and GBBWT uses up to one word per character more than FGSACA.31Seemingly, cais always allocates the theoretical maximum amount of memory needed for the used SA-IS variant (per input character one word for SA • and another one for working memory), unlike OpenBWT, which only allocates as much working memory as needed.Additionally, cais uses a bit vector with rankselect-support (2 bits/input character).
Table 7 shows the times measured for the eBWT construction algorithms.Our algorithm GEBWT is always significantly faster than both cais and PFP-eBWT, except on the covid4-dataset, where  All times are given in seconds per 100 • 2 20 characters (100 MiB).cais only supports 32-bit indices, so it could not be run on articles5.The lower table gives the number of words in the parse and the size of the dictionary (measured as the sum of the length of the dictionary entries) as reported by PFP-eBWT.
Generic Non-recursive Suffix Array Construction 18:39 The SA-IS variants libsais and sais-lite as well as DivSufSort use only a constant amount of memory on our data set and are thus omitted.Note that for the BBWT-algorithms, the output is a character array and thus SA • is counted as working memory.
the strings in the set are extremely similar and thus the parse and dictionary are very small compared to the input texts (Table 7).In this case, the amount of working memory PFP-eBWT needs is also very small (Table 10).In the other cases, PFP-eBWT is significantly slower than cais, which results from the fact that here the dictionary is approximately as large as the input.On most other cases, PFP-eBWT's consumed memory sits between the working memory of GEBWT and cais.GEBWT and cais use the expected amount of memory, namely, a bit more than 12 and 8 bytes per Deriving the BBWT or eBWT from SA • or GSA • is a rather time-consuming step in our algorithms (especially on very large inputs), since the memory accesses to the input are virtually randomly distributed and hence cache-unfriendly.It may be worthwhile to investigate whether it is possible to effectively use the additional information present during Phase I and II.For instance, in Phase II, we fetch pss[s − 1] (lpss[s − 1]) of each entry s of SA • (GSA • ) and hence this cache-inefficiency could be mitigated by interleaving the input string with pss (lpss).

Fig. 3 .
Fig. 3. Constructing the eBWT of M = {b, bcbc, b, abcbc, bc}.The indices refer to the starting index in the concatenation T = bbcbcbabcbcbc of the strings in M. Strings in M are coloured ( ).Reading (from top to bottom) characters in the last column gives the eBWT.The corresponding index set is {0, 1, 2, 5, 6}.For the relative order of conjugates with equal roots see Definition 3.23.

Fig. 4 .
Fig. 4. Lyndon grouping G 1 , . . ., G 8 of decedacebceece$ with group contexts.The Lyndon prefixes and the suffix array are shown for improved clarity.Note that this grouping is the Lyndon grouping with the smallest number of groups.

Fig. 5 .
Fig. 5. Refining a Lyndon grouping for S = decedacebceece$ (see Figure 4) into the suffix array, as done in Algorithm 1. Already processed elements are coloured light gray while inserted but not yet processed elements are coloured green .Note that uncoloured entries are not actually present in the array but only serve to indicate the current Lyndon grouping.The Lyndon prefixes are shown at the top for clarity.

Fig. 6 .
Fig. 6.Deriving SA • of our running example decedacebceece$, given its Lyndon grouping.Already processed elements are coloured light gray while inserted but not yet processed elements are coloured green .The indices of Lyndon factors have a coloured border , cf. Figure 1.Note that uncoloured entries are not actually present in the array A but only serve to indicate the current Lyndon grouping.The Lyndon prefixes are shown at the top for clarity.

Fig. 7 .
Fig. 7. Refining a Lyndon grouping for S = decedacebceece$ (see Figure4) into the suffix array using Algorithm 4. Already processed elements are coloured light gray .Not yet processed and marked entries are coloured blue while inserted but unmarked and unprocessed elements are coloured green .Note that the uncoloured entries are not actually present in the array A but only serve to indicate the current Lyndon grouping.The Lyndon prefixes are shown at the top for clarity.
the pss-tree (otherwise nss[i] = i + 1 and the claim is trivial).For the last child c k , we have nss[c k ] = nss[i] from Lemma 3.18.Let j ∈ [1 .. k) and assume nss[c j ]

Fig. 8 .
Fig.8.Refining the initial suffix for S = decedacebceece$ (see Figure4) into the Lyndon grouping.Elements in Lyndon groups are marked light gray or green , depending on whether they have been processed already.Note that the applied procedure does not entirely correspond to our algorithm for Phase I; it only serves to illustrate the general sorting principle.The Lyndon prefixes are shown at the top for clarity.

Fig. 9 .
Fig.9.Shown is the memory layout during the bucket sort that is applied during the processing of a Lyndon group.The data in grey areas is irrelevant.p 1 < • • • < p m are the elements in A \ A 1 and k i = key(p i ).
д e ] is sorted by increasing index, children of the same parent are in a contiguous part of A[д s .. д e ].Hence, we determine A and the keys within one scan over A[д s .. д e ].Since in practice most elements have no sibling in the same Lyndon group, we treat those explicitly.Specifically, we move F 1 to A[д s .. д s + |F 1 |) and N 1 to B(|G| − |N 1 | .. |G|].Parents with keys larger than two are written with their keys interspersed to B[1 .. 2(|A| − |A 1 |)].Interspersing the keys is done to improve the cache-locality and thus performance.
|β | , the elements in A must be moved to a new Lyndon group following G \ A .For each element s in A , we decrement A[I [s]] (i.e., the size of the group currently containing s) and write s to A[I [s] + A[I [s]]].Afterwards, the new group start must be set (iff G is now not empty) to I [s] + A[I [s]] (the start of the old group plus the remaining size of the old group).To determine whether G is now not empty, we mark inserted elements in A using the MSB.If A[I [s]] has the MSB set, then we do not need to change the group pointer I [s].Generic Non-recursive Suffix Array Construction 18:25 e) is a Lyndon word, we have T [j + (e − i) .. e) > lex T [s .. e), and since T [s .. e) is also a Lyndon factor, we have T [s .. n) > lex T [e .. n).In combination, this gives T [j + (e − i) .. n) > lex T [s .. n) > lex T [e .. n) and thus the claim.
In Phase II, we need to refine the Lyndon grouping obtained in Phase I into the suffix array.Let G be a Lyndon group with context α and let i, j ∈ G. Since S i = αS i+ |α | and S j = αS j+ |α | , we have S i < lex S j if and only if S i+ |α | < lex S j+ |α | .Hence, to find the lexicographically smallest suffix in G, it suffices to find the lexicographically smallest suffix p in {i + |α | : i ∈ G}.Note that removing p − |α | from G and inserting it into a new group immediately preceding G yields a valid Lyndon grouping.We can repeat this process until each element in G is in its own singleton group.As G is Lyndon, we have S k+ |α | < lex S k for each k ∈ G by Lemma 2.4.Therefore, if all groups lower than G are singletons, then p can be determined by a simple scan over G (by determining which member of {i + |α | : i ∈ G} is in the lowest group).Consider, for instance, G 4 = 3, 4, |ce| containing 6 and 12 from Figure be the end of the group containing i; remove i from its current group and put it into a new group k, k, |L i | immediately pss[p]; while p is the last child of pss[p]; // i.e., pss[p] is marked end end

Table 1 .
For Each File, the Number of Characters n, the Size of the Alphabet σ , Compressibility Ratio r /n, the Number of Lyndon Factors, and the Number of Groups in the Lyndon Grouping Are Given

Table 2 .
See Table 1 for the Legend

Table 3 .
Time for Constructing the Suffix Array (and Phases Is Applicable)