Parallel Longest Increasing Subsequence and van Emde Boas Trees

This paper studies parallel algorithms for the longest increasing subsequence (LIS) problem. Let n be the input size and k be the LIS length of the input. Sequentially, LIS is a simple problem that can be solved using dynamic programming (DP) in O(n log n) work. However, parallelizing LIS is a long-standing challenge. We are unaware of any parallel LIS algorithm that has optimal O(n log n) work and non-trivial parallelism (i.e., Õ(k) or o(n) span). This paper proposes a parallel LIS algorithm that costs O(n log k) work, Õ(k) span, and O(n) space, and is much simpler than the previous parallel LIS algorithms. We also generalize the algorithm to a weighted version of LIS, which maximizes the weighted sum for all objects in an increasing subsequence. To achieve a better work bound for the weighted LIS algorithm, we designed parallel algorithms for the van Emde Boas (vEB tree, which has the same structure as the sequential vEB tree, and supports work-efficient parallel batch insertion, deletion, and range queries. We also implemented our parallel LIS algorithms. Our implementation is light-weighted, efficient, and scalable. On input size 109, our LIS algorithm outperforms a highly-optimized sequential algorithm (with O(n log k)cost) on inputs with k ≤ 3 x 105. Our algorithm is also much faster than the best existing parallel implementation by Shen et al. (2022) on all input instances.


Introduction
This paper studies parallel algorithms for classic and weighted longest increasing subsequence problems (LIS and WLIS, see definitions below).We propose a work-efficient parallel LIS algorithm with Õ () span, where  is the LIS length of the input.Our WLIS algorithm is based on a new data structure that parallelizes the famous van Emde Boas (vEB) tree [74].Our new algorithms improve existing theoretical bounds on the parallel LIS and WLIS problem, as well as enable simpler and more efficient implementations.Our parallel vEB tree supports work-efficient batch insertion, deletion and range query with polylogarithmic span.
Given a sequence  1.. and a comparison function on the objects in , the LIS of  is the longest subsequence (not necessarily contiguous) in  that is strictly increasing (based on the comparison function).In this paper, we use LIS to refer to both the longest increasing subsequence of a sequence, and the problem of finding such an LIS.LIS is one of the most fundamental primitives and has extensive applications (e.g., [5,28,30,31,42,60,62,79]). In this paper, we use  to denote the input size and  to denote the LIS length of the input.LIS can be solved by dynamic programming (DP) using the following DP recurrence (more details in Sec. 2).
Sequentially, LIS is a straightforward textbook problem [29,39].We can iteratively compute dp[] using a search structure to find max  <,  <  dp[ ], which gives  ( log ) work.However, in parallel, LIS becomes challenging both in theory and in practice.In theory, we are unaware of parallel LIS algorithms with  ( log ) work and non-trivial parallelism ( () or Õ () span).In practice, we are unaware of parallel LIS implementations that outperform the sequential algorithm on general input distributions.We propose new LIS algorithms with improved work and span bounds in theory, which also lead to a more practical parallel LIS implementation.
Our work follows some recent research [13-15, 17-19, 34, 41, 44, 61, 64, 65] that directly parallelizes sequential iterative algorithms.Such algorithms are usually simple and practical, given their connections to sequential algorithms.To achieve parallelism in a "sequential" algorithm, the key is to identify the dependences [18,19,64,65] among the objects.In the DP recurrence of LIS, processing an object  depends on all objects  <  before it, but does not need to wait for objects before it with a larger or equal value.
An "ideal" parallel algorithm should process all objects in a proper order based on the dependencies-it should 1) process as many objects as possible in parallel (as long as they do not depend on each other), and 2) process an object only when it is ready (all objects it depends on are finished) to avoid redundant work.More formally, we say an algorithm is round-efficient [64] if its span is Õ () for a computation with the longest logical dependence length .In LIS, the logical dependence length given by the DP recurrence is the LIS length .We say an algorithm is work-efficient if its work is asymptotically the same as the best sequential algorithm.Work-efficiency is crucial in practice, since nowadays, the number of processors on one machine (tens to hundreds) is much smaller than the problem size.A parallel algorithm is less practical if it significantly blows up the work of a sequential algorithm.
Our algorithm is based on the parallel LIS algorithm and the phase-parallel framework by Shen et al. [64].We refer to it as the SWGS algorithm, and review it in Sec. 2. The phase-parallel framework defines a rank for each input object as the length of LIS ending at it (the dp value in Eq. ( 1)).Note that an object only depends on lower-rank objects.Hence, the phase-parallel LIS algorithm processes all objects based on the increasing order of ranks.However, the SWGS algorithm takes  ( log 3 ) work whp, Õ () span, and  ( log ) space, and is quite complicated.In the experiments, the overhead in work and space limits the performance.On a 96-core machine and input size of 10 8 , SWGS becomes slower than a sequential algorithm when the LIS length  > 100.
Theorem 1.1 (LIS).Given a sequence  of size  and LIS length , the longest increasing subsequence (LIS) of  can be computed in parallel with  ( log ) work,  ( log ) span, and  () space.
We also extend our algorithm to the weighted LIS (WLIS) problem, which has a similar DP recurrence as LIS but maximizes the weighted sum instead of the number of objects in an increasing subsequence.
where   is the weight of the -th input object.We summarize our result in Thm.1.2.
Theorem 1.2 (WLIS).Given a sequence  of size  and LIS length , the weighted LIS of  can be computed using  ( log  log log ) work,  ( log 2 ) span, and  ( log ) space.
Our primary techniques to support both LIS and WLIS rely on better data structures for 1D or 2D prefix min/max queries in the phase-parallel framework.For the LIS problem, our algorithm efficiently identifies all objects with a certain rank using a parallel tournament tree that supports 1D dynamic prefix-min queries, i.e., given an array of values, find the minimum value for each prefix of the array.For WLIS, we design efficient data structures for 2D dynamic "prefix-max" queries, which we refer to as dominantmax queries (see more details in Sec. 4).Given a set of 2D points associated with values, which we refer to as their scores, a dominantmax query returns the largest score to the bottom-left of a query point.Using dominant-max queries, given an object  in WLIS, we can find the maximum dp value among all objects that  depends on.We propose two solutions focusing on theoretical and practical efficiency, respectively.In practice, we use a parallel range tree similar to that in SWGS, which results in  ( log 2 ) work and Õ () span for WLIS.In theory, we parallelize the van Emde Boas (vEB) tree [74] and integrate it into range trees to achieve a better work bound for WLIS.
The van Emde Boas (vEB) tree [74] is a famous data structure for priority queues and ordered sets on integer keys, and is introduced in many textbooks (e.g., [27]).To the best of our knowledge, our algorithm is the first parallel version of vEB trees.We believe our algorithm is of independent interest in addition to the application in WLIS.We note that it is highly non-trivial to redesign Figure 1: Outline and contributions of this paper.and parallelize vEB trees because the classic vEB tree interface and algorithms are inherently sequential.Our parallel vEB tree supports a general ordered set abstract data type on integer keys in [0,  ) with bounds stated below.We present more details in Sec. 5. Theorem 1.3 (Parallel vEB Tree).Let U be a universe of all integers in range [0,  ).Given a set of integer keys from U, there exists a data structure that has the same organization as the sequential vEB tree, and supports: • single-point insertion, deletion, lookup, reporting the minimum (maximum) key, and reporting the predecessor and successor of an element, all in  (log log  ) work, using the same algorithms for sequential vEB trees; Our LIS algorithm and the WLIS algorithm based on range trees are simple to program, and we expect them to be the algorithms of choice in implementations in the parallel setting.We tested our algorithms on a 96-core machine.Our implementation is lightweighted, efficient and scalable.Our LIS algorithm outperforms SWGS in all tests, and is faster than highly-optimized sequential algorithms [50] on reasonable LIS lengths (e.g., up to  = 3 × 10 5 for  = 10 9 ).To the best of our knowledge, this is the first parallel LIS implementation that can outperform the efficient sequential algorithm in a large input parameter space.On WLIS, our algorithm is up to 2.5× faster than SWGS and 7× faster than the sequential algorithm for small  values.We believe the performance is enabled by the simplicity and theoretical-efficiency of our new algorithms.
We note that there exist parallel LIS algorithms [22,53] with better worst-case span bounds than our results in theory.We highlight the simplicity, practicality, and work-efficiency of our algorithms.We also highlight our contributions on parallel vEB trees and the extension to the WLIS problem.We believe this paper has mixed contributions of both theory and practice, summarized as follows.
Theory: 1) Our LIS and WLIS algorithms improve the existing bounds.Our LIS algorithm is the first work-and space-efficient parallel algorithm with non-trivial parallelism ( Õ () span).2) We design the first parallel version of vEB trees, which supports workefficient batch-insertion, batch-deletion and range queries with polylogarithmic span.
Practice: Our LIS and WLIS algorithms are highly practical and simple to program.Our implementations outperform the state-ofthe-art parallel implementation SWGS on all tests, due to better work and span bounds.We plan to release our code.
We use the work-span model in the classic multithreaded model with binary-forking [6,14,21].We assume a set of threads that share the memory.Each thread acts like a sequential RAM plus a fork instruction that forks two child threads running in parallel.When both child threads finish, the parent thread continues.A parallel-for is simulated by fork for a logarithmic number of steps.A computation can be viewed as a DAG (directed acyclic graph).The work  of a parallel algorithm is the total number of operations in this DAG, and the span (depth)  is the longest path in the DAG.An algorithm is work-efficient if its work is asymptotically the same as the best sequential algorithm.The randomized work-stealing scheduler can execute such a computation in  / +  () time whp in  on  processor cores [6,21,40].Our algorithms can also be analyzed on PRAM and have the same work and span bounds.
Longest Increasing Subsequence (LIS).Given a sequence  Dependence Graph [18,19,64,65].In a sequential iterative algorithm, we can analyze the logical dependences between iterations (objects) to achieve parallelism.Such dependences can be represented in a DAG, called a dependence graph (DG).In a DG, each vertex is an object in the algorithm.An edge from  to  means that  can be processed only when  has been finished.We say  depends on  in this case.Fig. 2 illustrates the dependences in LIS.
We say an object is ready when all its predecessors have finished.: An input for LIS, the dependences and ranks.An object depends on all objects before it and is smaller than it.The rank of an object is the LIS length ending at it, which is also its dp value.
When executing a DG with depth , we say an algorithm is roundefficient if its span is Õ ().In LIS, the dependence depth given by the DP recurrence is the LIS length .We note that round-efficiency does not guarantee optimal span, since round-efficiency is with respect to a given DG.One can design a different algorithm with a shallower DG and get a better span.
Phase-Parallel Algorithms and SWGS Algorithm [64].The high-level idea of the phase-parallel algorithm is to assign each object  a rank, denoted as rank(), indicating the earliest phase when the object can be processed.In LIS, the rank of each object is the length of the LIS ending with it (the dp value computed by Eq. ( 1)).We also define the rank of a sequence  as the LIS length of .An object only depends on other objects with lower ranks.The phase-parallel LIS algorithm [64] processes all objects with rank  (in parallel) in round .We call the objects processed in round  the frontier of this round.An LIS example is given in Fig. 2.
The SWGS algorithm uses a wake-up scheme, where each object can be processed  (log ) times whp.It also uses a range tree to find the frontiers both in LIS and WLIS.In total, this gives  ( log 3 ) work whp,  ( log 2 ) span, and  ( log ) space.Our algorithm is also based on the phase-parallel framework but avoids the wake-up scheme to achieve better bounds and performance.

Longest Increasing Subsequence
We start with the (unweighted) LIS problem.Our algorithm is also based on the phase-parallel framework [64] but uses a much simpler idea to make it work-efficient.The work overhead in the SWGS algorithm comes from two aspects: range queries on a range tree and the wake-up scheme.The  (log ) space overhead comes from the range tree.Therefore, we want to 1) use a more efficient (and simpler) data structure than the range tree to reduce both work and space, and 2) wake up and process an object only when it is ready to avoid the wake-up scheme.
Our algorithm is based on a simple observation in Lemma 3.1 and the concept of prefix-min objects (Definition 3.1).Recall that the rank of an object   is exactly its dp value, which is the length of LIS ending at   .Definition 3.1 (Prefix-min Objects).Given a sequence  1.. , we say   is a prefix-min object if for all  < , we have   ≤   , i.e.,   is (one of) the smallest object among  1.. .Lemma 3.1.In a sequence , an object   has rank 1 iff.  is a prefix-min object.An object   has rank  iff.  is a prefix-min object after removing all objects with ranks smaller than  .
We use Fig. 3 to illustrate the intuition of Lemma 3.1, and prove it in Appendix B.1.Based on Lemma 3.1, we can design an efficient yet simple phase-parallel algorithm for LIS (Alg.1).For simplicity, we first focus on computing the dp values (ranks) of all input objects.We show how to output a specific LIS for the input sequence in Appendix A. The main loop of Alg. 1 is in Lines 6-8.In round  , we identify the frontier F  as all the prefix-min objects and set their dp values to  .We then remove the objects in F  and repeat.Fig. 3 illustrates Alg. 1 by showing the "prefix-min" value pre  for each object, which is the smallest value up to each object.Note that this sequence pre  is not maintained in our algorithm but is just used for illustration.In each round, we find and remove all objects   with   = pre  .Then we update the prefix-min values pre  , and repeat.In round  , all identified prefix-min objects have rank  .
To achieve work-efficiency, we cannot re-compute the prefixmin values of the entire sequence after each round.Our approach is to design a parallel tournament tree to help identify the frontiers.Next, we briefly overview the tournament tree and then describe how to use it to find the prefix-min objects efficiently.
Tournament tree.A tournament tree T on  records is a complete binary tree with 2 − 1 nodes (see Fig. A tournament tree can be constructed by recursively constructing the left and right trees in parallel, and updating the root value. Using Tournament Tree for LIS.We use a tournament tree T to efficiently identify the frontier and dynamically remove objects (see Alg. 1).T stores all input objects in the leaves.We always round up the number of leaves to a power of 2 to make it a full binary tree.Each internal node stores the minimum value in its : smallest value up to this object (inclusive) subtree.When we traverse the tree at T [], if the smallest object to its left is smaller than T [], we can skip the entire subtree.Using the internal nodes, we can maintain the minimum value before any subtree and skip irrelevant subtrees to save work.
In particular, the function ProcessFrontier finds all prefix-min objects from T by calling PrefixMin starting at the root.Pre-fixMin(, LMin) traverses the subtree at node , and finds all leaves  in this subtree s.t. 1)  is no more than any leaf before  in this subtree, and 2)  is no more than LMin.The argument LMin records the smallest value in T before the subtree at T [].If the smallest value in subtree T [] is larger than LMin, we can skip the entire subtree (Line 13), because no object in this subtree can be a prefixmin object (they are all larger than LMin).Otherwise, there are two cases.The first case is when T [] is a leaf (Lines 14-16).Since T [] ≤ LMin, it must be a prefix-min object.Therefore, we set its dp value as the current round number  (Line 15) and remove it by setting its value as +∞ (Line 16).In second case, when T [] is an internal node (Lines 17-21), we can recurse on both subtrees in parallel to find the desired objects (Line 18).For the left subtree, we directly use the current LMin value.For the right subtree, we need to further consider the minimum value in the left subtree.Therefore, we take the minimum of the current LMin and the smallest value in the left subtree (T [2]), and set it as the LMin value of the right recursive call.After the recursive calls return, we update T [] (Line 21) because some values in the subtree may have been removed (set to +∞).We present an example in Fig. 4, which illustrates finding the first frontier for the input in Fig. 3.
We now prove the cost of Alg.The figure illustrates finding the first frontier for Fig. 3.The algorithm recursively traverses the tree from the root and maintains a LMin value for each subtree as the smallest value before this subtree.If LMin is smaller than the value at the subtree root, we skip the subtree.For example, the smallest value before the green node 39 ○ is LMin = 10.Therefore, no leaves in this subtree can be a prefix-min object, so this subtree is skipped.but skipped by Line 13. Executing Line 13 for subtree  means that 's parent executed Line 17, so 's parent is relevant.This indicates that a node is visited either because it is relevant, or its parent is relevant.Since every node has at most two children, the number of visited nodes is asymptotically the same as all relevant nodes.which is  (  log(/  )).Hence, the total number of visited nodes is: The last step uses the concavity of the function  () =  log 2 (1+   ).This proves the work bound of the algorithm.□ Note that the work bound of Thm.3.2 is parameterized on the LIS length .For small , the work can be  ( log ).For example, if the input sequence is strictly decreasing, Alg. 1 only needs  () work because the algorithm will find all objects in the first round in  () work and finishes.

Weighted Longest Increasing Subsequence
A nice property of the unweighted LIS problem is that the dp value is the same as its rank.In round  , we simply set the dp values of all objects in the frontier as  .This is not true for the weighted LIS (WLIS) problem, and we need additional techniques to handle the weights.Inspired by SWGS, our WLIS algorithm is built on an efficient data structure R supporting 2D dominant-max queries: for a set of 2D points (  ,   ) each with a score   , the dominant-max query (  ,   ) asks for the maximum score among all points in its lower-left corner (−∞,   ) × (−∞,   ).We illustrate a dominantmax query in Fig. 9 in the appendix.We will use such a data structure to efficiently compute the dp values of all objects.
We present our WLIS algorithm in Alg. 2. We view each object as a 2D point (  , ) with score dp[], and use a data structure R that supports dominant-max queries to maintain all such points.Initially dp[] = 0. We call   and  as the x-and y-coordinate of the point, respectively.Given the input sequence, we first call Alg. 1 to compute the rank of each object and sort them by ranks to find each frontier F  .This can be done by any parallel sorting with  () work and  (log 2 ) span.We then process all the frontiers in order.When processing F  , we compute the dp values for all  ∈ F  in parallel, using dp score (dp value) among all objects in the lower-left corner of the object .Finally, we update the newly-computed dp values to R (Line 18) as their scores.
The efficiency of this algorithm then relies on the data structure to support dominant-max.We will propose two approaches to achieve practical and theoretical efficiency, respectively.The first one is similar to SWGS and uses range trees, which leads to  ( log 2 ) work and Õ () span for the WLIS problem.By plugging in an existing range-tree implementation [68], we obtain a simple parallel WLIS implementation that significantly outperforms the existing implementation from SWGS.The details of the algorithm are in Sec.4.1, and the performance comparison is in Sec. 6.We also propose a new data structure, called the Range-vEB, to enable a better work bound ( ( log  log log ) work) for WLIS.Our idea is to re-design the inner tree in range trees as a parallel vEB tree.We elaborate our approach in Sec.4.2 and 5.

Parallel WLIS based on Range Tree
We can use a parallel range tree [8,67] to answer dominant-max queries.A range tree [8] is a nested binary search tree (BST) where the outer tree is an index of the -coordinates of the points.Each tree node maintains an inner tree storing the same set of points in its subtree but keyed on the -coordinates (see Fig. 5).We can let each inner tree node store the maximum score in its subtree, which enables efficient dominant-max queries.In particular, for the outer tree, we can search (−∞,   ) on the -coordinates.This gives  (log ) relevant subtrees in this range (called the in-range subtrees), and  (log ) relevant nodes connecting them (called the connecting nodes).In Fig. 5, when   = 6.5, the in-range inner trees are the inner trees of points (2, 6) and (5, 1), since their entire subtrees falls into range (−∞, 6.5).The connecting nodes are (4, 5) and (6, 4), as their x-coordinates are in the range, but only part of their subtrees are in the range.For each in-range subtree, we further search (−∞,   ) in the inner trees to get the maximum score in this range, and consider it as a candidate for the maximum score.For each connecting node, we check if its -coordinates are in range (−∞,   ), and if so, consider it a candidate.Finally, we return the maximum score among the selected candidates (both from the in-range subtrees and connecting nodes).Using the range tree in [14,67,68], we have the following result for WLIS.Theorem 4.1.Using a parallel range tree for the dominant-max queries, Alg. 2 computes the weighted LIS of an input sequence  in  ( log 2 ) work and  ( log 2 ) span, where  is the length of the input sequence , and  is the LIS length of .

WLIS Using the Range-vEB Tree
We can achieve better bounds for WLIS using parallel van Emde Boas (vEB) trees.Unlike the solution based on parallel range trees, the vEB-tree-based solution is highly non-trivial.Given the sophistication, we describe our solution in two parts.This section shows how to solve parallel WLIS assuming we have a parallel vEB tree.Later in Sec. 5, we will show how to parallelize vEB trees.
We first outline our data structure at a high level.We refer to our data structure for the dominant-max query as the Range-vEB tree, which is inspired by the classic range tree as mentioned in Sec.4.1.The main difference is that the inner trees are replaced by Mono-vEB trees (defined below).Recall that in Alg. 2, the RangeStruct implements two functions DominantMax and Update.We present the pseudocode of Range-vEB for these two functions in Alg. 3, assuming we have parallel functions on vEB trees.
Similar to range trees, our Range-vEB tree is a two-level nested structure, where the outer tree is indexed by -coordinates, and the inner trees are indexed by -coordinates.For an outer tree node , we will use   to denote the set of points in 's subtree and   as the inner tree of .Like a range tree, the inner tree   also corresponds to the set of points   , but only the staircase of   (defined below).Since the y-coordinates are the indexes of the input, which are integers within , we can maintain this staircase in a vEB tree.Recall that the inner tree stores the -coordinates as the key and uses the dp values as the scores.For two points For a set of points , the staircase of  is the maximal subset  ′ ⊆  such that for any  ∈  ′ ,  is not covered by any points in .In other words, for two input objects   and   in WLIS, we say   covers   if  comes before  and has a larger or equal dp value.This also means that no objects will use the dp value at  since   is strictly better than   .Therefore, we ignore such   in the inner trees, and refer to such a vEB tree maintaining the staircase of a dataset as a Mono-vEB tree.In a Mono-vEB tree, with increasing key (  ), the score (dp values) must also be increasing.We show an illustration of the staircase in Fig. 10 in the appendix.
Due to monotonicity, the maximum dp value in a Mono-vEB tree for all points with   <   is exactly the score (dp value) of   's predecessor.Combining this idea with the dominant-max query in range trees, we have the dominant-max function in Alg. 3. We will first search the range (−∞,   ) in the outer tree for the -coordinates and find all in-range subtrees and connecting nodes.For each connecting node, we check if their  coordinates are in the queried range, and if so, take their dp values into consideration.For In the Range-vEB, find the range of (−∞,   ), and let  node be the set of connecting nodes and  tree be the set of in-range inner (Mono-vEB) trees // For each in-range inner tree, find the max score up to coordinate    As mentioned, the value of   is the highest score from this inner tree among all points with an index smaller than   .Finally, we take a max of all such results (all   and those from connecting nodes), and the maximum among them is the result of the dominant-max query (Line 8).As the Pred function has cost  (log log ), a single dominant-max query costs  (log  log log ) on a Range-vEB tree.Querying dominant-max using a staircase is a known (sequential) algorithmic trick.However, the challenge is how to update (in parallel) the newly computed dp values in each round (the Update function) to a Range-vEB tree.We first show how to implement Update in Alg. 3 while assuming a parallel vEB tree.We later explain how to parallelize a vEB tree in Sec. 5.
Step 1. Collecting insertions for inner trees.Each point  ∈  may need to be added to  (log ) inner trees, so we first obtain a list   of points to be inserted for each inner tree   .This can be done by first marking all points in  in the outer tree R, and (in parallel) merging them bottom-up, so that each relevant inner tree collects the relevant points in .When merging the lists, we keep them sorted by the y-coordinates, the same as the inner trees.
Step 2. Refining the lists.Because of the "staircase" property, we have to first refine each list   to remove points that are not on the staircase.A point in   [ ] should be removed if it is covered by its previous point   [  Step 3. Updating the inner trees.Finally, for all involved subtrees, we will update the list   to   in parallel.Note that some points in   may cover (and thus replace) some existing points in   .We will first use a function CoveredBy to find all points (denoted as set ) in   that are covered by any point in   .An illustration of CoveredBy function is presented in Fig. 11 in the appendix.We will then use vEB batch-deletion to remove all points in  from   .Finally, we call vEB batch-insertion to insert all points in   to   .
In Sec.We now analyze the total cost of Update.In one invocation of Update, we first find all keys for each inner tree   that appears in .Using the bottom-up merge-based algorithm mentioned in Sec.4.2, each merge costs linear work.Similarly, refining a list   costs linear work.Since each key in  appears in  (log ) inner tree, the total work to find and refine all   is  (|| log ) for each batch, and is  ( log ) for the entire algorithm.
For each subtree, the cost of running CoveredBy is asymptotically bounded by BatchDelete.For BatchDelete and BatchInsert, note that the bounds in Thm.1.3 show that the amortized work to insert or delete a key is  (log log ).In each inner tree, a key can be inserted at most once and deleted at most once, which gives  ( log  log log ) total work in the entire algorithm.
Finally, the span of each round is  (log 2 ).In each round, we need to perform the three steps in Sec.4.2.The first step requires to find the list of relevant subtrees for each element in the insertion batch .For each element  ∈ , this is performed by first searching  in the outer tree, and then merging them bottom-up, so that each node in the outer tree will collect all elements in  that belong to its subtree.There are  (log ) levels in the outer tree, and each merge requires  (log ) span, so this first step requires  (log 2 ) span.
Step 2 will process all relevant lists in parallel (at most  of them).For each list, it calls Pred for each element in each list, and a filter algorithm at the end.The total span is bounded by  (log 2 ).
Step 3 requires calling batch insertion and deletion to update all relevant inner trees, and all inner trees can be processed in parallel.Based on the analysis above, the span for each batch insertion and deletion is  (log  log log ), which is also bounded by  (log 2 ).
Thus, the entire algorithm has span  ( log 2 ).Finally, as mentioned, we present the details of achieving the stated space bound in Appendix E by relabeling all points in each inner tree.□ Making Range-vEB Tree Space-efficient.A straightforward implementation of Range-vEB tree may require  ( 2 ) space, as a plain vEB tree requires  ( ) space.There are many ways to make : The integer by concatenating high-bit ℎ and low-bit  V : A vEB (sub-)tree / the set of keys in this vEB tree V.min (V.max) : The min (max) value in vEB tree V Pred( V, ) : Find the predecessor of  in vEB tree V Succ( V, ) : Find the successor of  in vEB tree V V.summary : The set of high-bits in vEB tree V V.cluster [ℎ] : The subtree of V with high-bit ℎ ( * ) P V, () : The survival predecessor of  ∈  in vEB tree V (used in Alg. 5).P () = max{ :  ∈ V \ ,  <  }. ( * ) S V, () : The survival successor of  ∈  in vEB tree V (used in Alg. 5).S () = min{ :  ∈ V \ ,  >  }. vEB trees space-efficient ( () space when storing  keys); we discuss how they can be integrated in Range-vEB tree to guarantee  ( log ) total space in Appendix E.

Parallel van Emde Boas Trees
The van Emde Boas (vEB) tree [74] is a famous data structure that implements the ADTs of priority queues and ordered sets and maps for integer keys and is introduced in textbooks (e.g., [27]).For integer keys in range 0 to  , single-point updates and queries cost  (log log  ), better than the  (log ) cost for BSTs or binary heaps.We review vEB trees in Sec.5.1.However, unlike BSTs or binary heaps that have many parallel versions [3,10,12,14,20,32,54,68,76,77,80], we are unaware of any parallel vEB trees.Even the sequential vEB tree is complicated (compared to most BSTs and heaps), and the invariants are maintained sophisticatedly to guarantee doubly-logarithmic cost.Such complication adds to the difficulty of parallelizing updates (insertions and deletions) on vEB trees.Meanwhile, for queries, we note that vEB trees do not directly support range-related querieswhen using vEB trees for ordered sets and maps, many applications heavily rely on repeatedly calling successors and/or predecessors, which is inherently sequential.Hence, we need to carefully redesign the vEB tree to achieve parallelism.In this section, we first review the sequential vEB tree and then present our parallel vEB tree to support the functions needed in Alg. 3.

Review of the Sequential vEB Tree
A van Emde Boas (vEB) tree [74] is a search tree structure with keys from a universe U, which are integers from 0 to  − 1.We usually assume the keys are -bit integers (i.e.,  = 2  ).A classic vEB tree supports insertion, deletion, lookup, reporting the min/max key in the tree, reporting the predecessor (Pred) and successor (Succ) of a key, all in  (log log  ) work.Other queries can be implemented using these functions.For instance, reporting all keys in a range can be implemented by repeatedly calling Succ in  ( log log  ) work, where  is the output size.
A vEB tree stores a key using its binary bits as its index.We use V to denote a vEB tree, as well as the set of keys in this vEB tree.We present the notation for vEB trees in Tab. 1, and show an illustration of vEB trees in Fig. 6.We use 13 as an example to show how a key is decomposed and stored in the tree nodes.A vEB tree is a quadruple (summary, cluster [•], min, max).V.min and V.max store the minimum and maximum keys in the tree.When V is empty, we set V.min = +∞ and V.max = −∞.For the rest of the keys (other than min/max), their high-bits (the first ⌈/2⌉ bits) are maintained recursively in a vEB tree, noted as V.summary.In Fig. 6, the high-bits are the first 4 bits, and there are two different unique high-bits (0 and 1).They are maintained recursively in a vEB (sub-)tree V.summary (the blue box).For each unique high-bit, the relevant low-bits (the last ⌊/2⌋ bits) are also organized as a vEB (sub-)tree recursively.In particular, the low-bits that belong to high-bit ℎ are stored in a vEB tree V.cluster [ℎ].In Fig. 6, five keys in V have high-bit 0 (4, 8, 10, 13, and 15).They are maintained in a vEB (sub-)tree as V.cluster [0] (the green box and everything below).Each subtree (summary and all cluster [•]) has universe size  ( √  ) (about /2 bits).This guarantees traversal from the root to every leaf in  (log log  ) hops.Note that the min/max values of a vEB tree are not stored again in the summary or clusters.For example, in Fig. 6, at the root, V.min = 2, and thus 2 is not stored again in V.cluster [0].Such design is crucial to guarantee doubly logarithmic work for Insert, Delete, Pred, and Succ.
Note that although we use "low/high-bits" in the descriptions, algorithms on vEB trees can use simple RAM operations to extract the corresponding bits, without any bit manipulation, as long as the universe size for each subtree is known.Due to the page limit, we refer the audience to the textbook [27] for more details about sequential vEB tree algorithms.

Our New Results
We summarize our results on parallel vEB tree in Thm.1.3.Both batch insertion/deletion and range reporting are work-efficientthe work is the same as performing them on a sequential vEB tree.In Alg. 2, the key range  = .Using the Range query, we can implement CoveredBy in Alg. 3 in  ( ′ log log ) work and polylogarithmic span, where  ′ is the number of objects returned.
Similar to the sequential vEB tree, batch-insertion is relatively straightforward among the parallel operations.We present the algorithm and analysis in Sec.5.2.1.Batch-deletion is more challenging, as once V.min or V.max is deleted, we need to replace it with a proper key  ′ stored in the subtree of V.However, when finding the replacement  ′ , we need to avoid the values in the deletion batch , and take extra care to handle the case when  ′ is the min/max of a cluster.We propose a novel technique: Survivor Mapping (see Definition 5.1) to resolve this challenge.The batch-deletion algorithm is illustrated in Sec.5.2.2,and analysis in Appendix B.3.For range queries, we need to avoid the iterative solution (repeatedly calling Succ) since it is inherently sequential.Our high-level idea is to divide-and-conquer in parallel, but uses delicate amortization techniques to bound the extra work.Due to the page limit, we summarize the high-level idea of Range and CoveredBy in Sec.5.2.3 and provide the details in Appendices C and D.

Batch
Insertion.We show our batch-insertion algorithm in Alg. 4, which inserts a sorted batch  ⊆ U into V in parallel.Here we assume the keys in  are not in V; otherwise, we can simply look up the keys in V and filter out those in V already.To achieve parallelism, we need to appropriately handle the high-bits and lowbits, both in parallel, as well as taking extra care to maintain the min/max values.We first set the min/max values at V (Line 2-5).If .min < V.min, we update V.min by swapping it with .min (Line 3); similarly we update V.max (Line 4).Since we need the batch  sorted when adding V.min and/or V.max back to  (Line 5), we need to insert them to the correct position, causing  () work.If  is not empty, we will insert the keys in  to V.summary and V.cluster.We first find the new high-bits (not yet in V.summary) from keys in , and denote them as  (Line 7) This step can be done by a parallel filter.For each new high-bit ℎ ∈  , we select the smallest key with high-bit ℎ and put them in an array  ′ (Line 8).The new subtrees in V.cluster [ℎ] are initialized by inserting the smallest low-bit { ()| ∈  ′ , high() = ℎ} in parallel (Line 9-11), after which all subtrees V.cluster [ℎ], ℎ ∈  are certified to be non-empty.The remaining new low-bits [ℎ] are gathered by the corresponding high-bits ℎ ∈  (Line 12).Finally, new high-bits  and each new low-bits [ℎ], ℎ ∈  are inserted into the tree in parallel recursively (Line 13-Line 16).
The correctness of the algorithm can be shown by checking that all min/max values for each node are set up correctly.Next, we analyze the cost bounds of Alg. 4 in Thm.5.1.Theorem 5.1.Inserting a batch of sorted keys into a vEB tree can be finished in  ( log log  ) work and  (log  ) span, where  is batch size and  = |U| is the universe size.
Proof.Let (, ) and  (, ) be the work and span of BatchInsert on a batch of size  and vEB tree with universe size .In each invocation of BatchInsert, we need to restore the min/max values, find the high-bits in  , initialize the clusters for the new high-bits, and gather the low-bits for each cluster.
All these operations cost  () work and  (log ) =  (log ) span.Then the algorithm makes at most √  + 1 recursive calls, each dealing with a universe size √ .Hence, we have the following recurrence for work and span: Note that each key in  falls into at most one of the recursions, and thus Since  is the number of distinct values to be inserted into the subtree with universe size , we also have  ≤  in all recursive calls.By solving the recursions above, we can get the claimed bound in the theorem.We solve them in Appendix B.2.Note that we assume an even total bits for .If not, the number of subproblems and their size become √︁ /2 + 1 and √ 2, respectively.One can check that the bounds still hold, the same as the sequential analysis.□ 5.2.2 Batch Deletion.The function BatchDelete(V, ) deletes a batch of sorted keys  ⊆ U from a vEB tree V. Let  = || be the batch size.For simplicity, we assume  ⊆ V.If not, we can first look up all keys in  and filter out those that are not in V in  ( log log  ) work and  (log  + log log  ) span.We show our algorithm in Alg. 5.The main challenge to performing  deletions in parallel is to properly set the min and max values for each subtree .
When the min/max value of a subtree  is in , we need to replace it with another key in its subtree that 1) does not appear in , and 2) needs to be further deleted from the corresponding cluster [•] (recall that the min/max values of a subtree should not be stored in its children).To resolve this challenge, we keep the survival predecessor and survival successor for all  ∈  wrt. a vEB tree, defined as follows.
Definition 5.1 (Survivor Mapping).Given a vEB tree V and a batch  ⊆ V, the survival predecessor P () for  ∈  is the maximum key in V \  that is smaller than .If no such key exists, P () := −∞.Similarly, the survival successor S() for  ∈  is the minimum key in V \  that is larger than , and is +∞ if no such key exists.⟨P, S⟩ are called the survival mappings.
P (•) and S(•) are used to efficiently identify the new keys to replace a deleted key.For instance, if V.max ∈  (then it must be .max),we can update the value of V.max to P (.max) directly.
Alg. 5 first initializes the survival mappings (Line 2) as follows.For each  ∈ , we set (in parallel) P () as its predecessor in V if this predecessor is not in , and set P () = −∞ otherwise.Then

5: Batch Deletion Algorithm for vEB tree
Input: A vEB tree V and a batch of keys  ⊆ V in sorted order Output: Update V by deleting all keys  ∈  1 Function BatchDelete(V, )

Experiments
In addition to the new theoretical bounds, we also show the practicality of the proposed algorithms by implementing our LIS (Alg. 1) and WLIS algorithms (Alg. 2 using range trees).Our code is lightweight.We use the experimental results to show how theoretical efficiency enables better performance in practice over the existing results.We plan to release our code.
Experimental Setup.We run all experiments on a 96-core (192hyperthread) machine equipped with four-way Intel Xeon Gold 6252 CPUs and 1.5 TiB of main memory.Our implementation is in C++ with ParlayLib [11].All reported numbers are the averages of the last three runs among four repeated tests.
Input Generator.We run experiments of input size  = 10 8 and  = 10 9 with varying ranks (LIS length ).We use two generators Figure 7: Experimental results on the LIS and WLIS.We vary the output size for each test."Ours"= our LIS algorithm in Alg. 1 using 96 cores."Ours (seq)"= our LIS algorithm in Alg. 1 using one core."Ours-W"=our WLIS algorithm in Alg. 2 using 96 cores."Seq-BS"= the sequential Seq-BS algorithm based on binary search."Seq-AVL"= the sequential Seq-AVL algorithm based on the AVL tree."SWGS"= the parallel algorithm SWGS from [64].See more details in Sec. 6. Figure 8: Experimental results of Self-relative Speedup."Ours-Line"= our LIS algorithm in Alg. 1 using a line pattern generator."Ours-Range"= our LIS algorithm in Alg. 1 using a range pattern generator."Seq-BS-Line"= Seq-BS algorithm using a line pattern generator."Seq-BS-Range"= Seq-BS algorithm using a range pattern generator.The data generators are described at the beginning of Sec. 6.
and refer to the results as the range pattern and the line pattern, respectively.The range pattern is a sequence consisting of integers randomly chosen from a range [1,  ′ ].The values of  ′ upper bounds the LIS length.When  is large, and the largest possible rank of a sequence of size  is expected to be 2 √  [48].To generate inputs with larger ranks, we use a line pattern generator that draws   as  • +  for a sequence  1... , where   is an independent random variable chosen from a uniform distribution.We vary  and   to achieve different ranks.For the weighted LIS problem, we always use random weights from a uniform distribution.
Baseline Algorithms.We compare to standard sequential LIS algorithms and the existing parallel LIS implementation from SWGS [64].We also show the running time of our algorithm on one core to indicate the work of the algorithm.SWGS works on both LIS and WLIS problems with  ( log 3 ) work and Õ () span, and we compare both of our algorithms (Alg. 1 and 2) with it.
For the LIS problem, we also use a highly-optimized sequential algorithm from [50] For WLIS, we implement a sequential algorithm and call it Seq-AVL.This algorithm maintains an augmented search tree, which stores all input objects ordered by their values, and supports rangemax queries.Iterating  from 1 to , we simply query the maximum dp value in the tree among all objects with values less than   , and update dp[].We then insert   (with dp[]) into the tree and continue to the next object.This algorithm takes  ( log ) work, and we implement it with an AVL tree.
Due to better work and span bounds, our algorithms are always faster than the existing parallel implementation SWGS.Our algorithms also outperform highly-optimized sequential algorithms up to reasonably large ranks (e.g., up to  = 3 × 10 5 for  = 10 9 ).For our tests on 10 8 and 10 9 input sizes, our algorithm outperforms the sequential algorithm on ranks from 1 to larger than 2 √ .We believe this is the first parallel LIS implementation that can outperform the efficient sequential algorithm in a large input parameter space.Longest Increasing Subsequence (LIS).Fig. 7(a) shows the results on input size  = 10 8 with ranks from 1 to 10 7 using the line generator.For our algorithm and Seq-BS, the running time first increases with  getting larger because both algorithms have work  ( log ).When  is sufficiently large, the running time drops slightly-larger ranks bring up better cache locality, as each object is likely to extend its LIS from an object nearby.Our parallel algorithm is faster than the sequential algorithm for  ≤ 3 × 10 4 and gets slower afterward.The slowdown comes from the lack of parallelism ( Õ () span).Our algorithm running on one core is only 1.4-5.5×slower than Seq-BS due to work-efficiency.With sufficient parallelism (e.g., on low-rank inputs), our performance is better than Seq-BS by up to 16.8×.
We only test SWGS on ranks up to 10 4 because it costs too much time for larger ranks.In the existing results, our algorithm is always faster than SWGS (up to 188×) because of better work and span.We believe the simplicity of code also contributes to the improvement.
We evaluate our algorithm on input size  = 10 9 with varied ranks from 1 to 10 8 using line the generator (see Fig. 7(b)) and with varied ranks from 1 to 6×10 4 using the range generator (see Fig. 7(c)).We exclude SWGS in the comparison due to its space-inefficiency, since it ran out of memory to construct the range tree on 10 9 elements.For  ≤ 3 × 10 5 , our algorithm is consistently faster than Seq-BS (up to 9.1×).When the rank is large, the work in each round is not sufficient to get good parallelism, and the algorithm behaves as if it runs sequentially.Because of work-efficiency, even with large ranks, our parallel algorithm introduces limited overheads, and its performance is comparable to Seq-BS (at most 3.4× slower).We also evaluate the self-relative speedup of our algorithm on input size  = 10 9 with rank 10 2 and rank 10 4 using both line and range generators.In all settings from Fig. 8, our algorithm scales well to 96 cores with hyperthreads, reaching the self-speedup of up to 25.6× for  = 10 2 and up to 37.0× for  = 10 4 .With the same rank, our algorithm has almost identical speedup for both patterns in all scales.Our algorithm outperforms Seq-BS (denoted as dash lines in Fig. 8) when using 8 or 16 cores.
Overall, our LIS algorithm performs well with reasonable ranks, achieving up to 41× self-speedup with  = 10 8 and up to 70× selfspeedup with  = 10 9 .Due to work-efficiency, our algorithm is scalable and performs especially well on large data because larger input sizes result in more work to utilize parallelism better.
Weighted LIS.We compare our WLIS algorithm (Alg.2) with SWGS and Seq-AVL on input size  = 10 8 .We vary the rank from 1 to 3000, and show the results in Fig. 7(d).Our algorithm is always faster than SWGS (up to 2.5×).Our improvement comes from better work bound (a factor of  (log ) better, although in many cases SWGS's work bound is not tight).Our algorithm also outperforms the sequential algorithm Seq-AVL with ranks up to 100.The running time of the sequential algorithm decreases with increasing ranks  because of better locality.In contrast, our algorithm performs worse with increasing  because of the larger span.
The results also imply the importance of work-efficiency in practice.To get better performance, we believe an interesting direction is to design a work-efficient parallel algorithm for WLIS.
Many previous papers propose general frameworks to study dependencies in sequential iterative algorithms to achieve parallelism [13,14,18,64].Their common idea is to (implicitly or explicitly) traverse the DG.There are two major approaches, and both have led to many efficient algorithms.The first one is edgecentric [14, 15, 17-19, 34, 44, 49, 64], which identifies the ready objects by processing the successors of the newly-finished objects.The second approach is vertex-centric [13,61,64,65,72], which checks all unfinished objects in each round to process the ready ones.However, none of these frameworks directly enables workefficiency for parallel LIS.The edge-centric algorithms evaluate all edges in the dependence graph, giving Θ( 2 ) worst-case work for LIS.The vertex-centric algorithms check the readiness of all remaining objects in each round and require  rounds, meaning Ω() work for LIS.The SWGS algorithm [64] combines the ideas in edgecentric and vertex-centric algorithms.SWGS has  ( log 3 ) work whp and is round-efficient ( Õ () span) using  ( log ) space.It is sub-optimal in work and space.Our algorithm improves the work and space bounds of SWGS in both LIS and WLIS.Our algorithm is also simpler and performs much better than SWGS in practice.

Conclusion
In this paper, we present the first work-efficient parallel algorithm for the longest-increasing subsequence (LIS) problem that has non-trivial parallelism ( Õ () span for an input sequence with LIS length ).Theoretical efficiency also enables a practical implementation with good performance.We also present algorithms for parallel vEB trees and show how to use them to improve the bounds for the weight LIS problem.As a widely-used data structure, we believe our parallel vEB tree is of independent interest, and we plan to explore other applications as future work.Other interesting future directions include achieving work-efficiency and good performance for WLIS in parallel and designing a work-efficient parallel LIS algorithm with  () or even a polylogarithmic span.

Figure 2
Figure2: An input for LIS, the dependences and ranks.An object depends on all objects before it and is smaller than it.The rank of an object is the LIS length ending at it, which is also its dp value.
4).It can be represented implicitly as an array T [1..(2 − 1)].The last  elements are the leaves, where T [] stores the ( − +1)-th record in the dataset.The first  − 1 elements are internal nodes, each storing the minimum value of its two children.The left and right children of T [] are T [2] and T [2 +1], respectively.We will use the following theorem about the tournament tree.Theorem 3.1.(Parallel Tournament Trees [14, 32]) A tournament tree can be constructed from  elements in  () work and  (log ) span.Given a set  of  leaves, in the tournament tree with size , the number of ancestors of all the nodes in  is  ( log(/)).

Figure 3 :
Figure 3: An illustration of Alg. 1.The figure also shows pre  as the smallest object up to this object (inclusive).If   = pre  , it is a prefix-min object.In round  , Alg. 1 finds all prefix-min objects, sets their DP values as  , removes them, and updates the pre  values.

Figure 4 :
Figure4: The parallel tournament tree for Alg. 1.The leaves store the input  1.. .An internal node stores the minimum value in its subtree.The figure illustrates finding the first frontier for Fig.3.The algorithm recursively traverses the tree from the root and maintains a LMin value for each subtree as the smallest value before this subtree.If LMin is smaller than the value at the subtree root, we skip the subtree.For example, the smallest value before the green node39 ○ is LMin = 10.Therefore, no leaves in this subtree can be a prefix-min object, so this subtree is skipped.

Figure 5 : 61 *Figure 6 :
Figure 5: An illustration of a 2D range tree.Outer tree is indexed by  (blue) and inner trees are indexed by  (red).
f -r e l a t i v e S p e e d u p # o f c o r e s f -r e l a t i v e S p e e d u p # o f c o r e s O u r s -L i n e O u r s -R a n g e S e q -B S -L i n e S e q -B S -R a n g e (a).LIS. = 10 2 .(b).LIS. = 10 4 .

Algorithm 1 :
The parallel (unweighted) LIS algorithm Input: A sequence  1.. Output: All dp values (ranks) of  1.. 1 int rank [1..] // rank [ ]: the LIS length ending at   .+∞) // Process all prefix-min objects // Deal with subtree rooted at T [ ].Find objects  s.t.: 1)  ≤ any object before it, and 2)  ≤ LMin.Collect such objects in a binary tree.12 Function PrefixMin(int , int LMin) 4 Initialize the tournament tree T 5 Function LIS(sequence  1.. ) 6 while T [1] ≠ +∞ do // T is not empty.7  ←  + 1 8 ProcessFrontier() // process the  -th frontier 9 return rank [1..] 10 Function ProcessFrontier() 11 PrefixMin(1, 13 if T [ ] > LMin then return NIL 14 if  ≥  then // Found a leaf node in the frontier.15 rank [ ] ←  // Set its rank as  .16 T [ ] ← +∞ // Remove the object.17 else // An internal node.Process two children in parallel.18 in parallel: 19  ←PrefixMin(2, LMin) Input52 31 45 26 61 10 39 44 Round 1 Objects   52 31 45 26 61 10 39 44 Prefix-Min   52 31 31 26 26 10 10 10 We then focus on the main loop (Lines 6-8) of the algorithm.The algorithm runs in  rounds.In each round, ProcessFrontier recurses for  (log ) steps.Hence, the algorithm has  ( log ) span.Next we show that the work of ProcessFrontier in round  is  (  log(/  )) work, where   = |F  | is the number of prefixmin objects identified in this round.First, note that visiting a tournament tree node has a constant cost, so the work is asymptotically the number of nodes visited in the algorithm.We say a node is relevant if at least one object in its subtree is in the frontier.Based 1 in Thm.3.2.Theorem 3.2.Alg. 1 computes the LIS of the input sequence  in  ( log ) work and  ( log ) span, where  is the length of the input sequence , and  is the LIS length of .Proof.Constructing T takes  () work and  (log ) span.on Thm.3.1, there are  (  log(/  )) relevant nodes.If Line 14 is executed (i.e., Line 13 does not return), the smallest object in this subtree is no more than LMin and must be a prefixmin object, and this node is relevant.Other nodes are also visited This can be done by the dominant-max query on R (Line 16), which reports the highest Algorithm 2: The parallel weighted LIS algorithm Input: A sequence  1.. .Object   has weight   Output: The DP values dp [1..] for each object   .The DP value of   , used as the score of the point.4 Point  [1..] 5 int dp [1..] // dp [ ]: the DP value of   .Stores points ⟨  ,   , dp  ⟩ with coordinate (  ,   ) and score dp  Supports DominantMax(, ): return the maximum score (the dp [ •] value) among all points (  ,   ) where   <  and   <  Supports Update(), where  = { ⟨  ,   , dp  ⟩ } is a batch of points: update the score of each point (  ,   ) to dp  10 RangeStruct R // Any data structure that supports DominantMax 11 Run Alg. 1. Sort the rank array and get all the  frontiers F 1.. .F  contains the indexes of all objects with rank .
− 1], or if any point in the Mono-vEB tree   covers it.The latter case can be verified by finding the predecessor  of   [ ]., and check if  has a larger or equal dp value than   [ ].If so,   [ ] is covered by  ∈   , so we ignore   [ ].After this step, all points in [] need to appear on the staircase in   .
5, we present the algorithms CoveredBy, BatchDelete and BatchInsert needed by Alg. 2, and prove Thm. 1.3.Assuming Thm.1.3, we give the proof of Thm.1.2.Proof of Thm.1.2.We first analyze the work.We first show that the DominantMax algorithm in Alg. 3 takes  (log  log log ) work.In Alg. 3, Line 3 finds  (log ) connecting nodes and inrange inner trees, which takes  (log ) work.Then for all  (log ) in-range inner trees, we perform a Pred query in parallel, which costs  (log log ).In total, this gives  (log  log log ) work for DominantMax.This means that the total work to compute the dp values in Line 16 in the entire Alg. 2 is  ( log  log log ).

Table 1 :
Notation for vEB trees.( * ): We drop the subscript with clear context.

2
Initialize survival mappings P and S with respect to  and V  ′ , P  ′ , S  ′ ) // Redirect the survival mapping P and S concerning elements in batch  after sequential deletion of  from vEB tree V 24 Function SurvivorRedirect( V, , , P, S) 23BatchDeleteRecursive( V.summary, , and call it Seq-BS.Seq-BS maintains an array , where  [ ] is the smallest value of   with rank  .Note that  is monotonically increasing.Iterating  from 1 to , we binary search   in , and if  [ ] <   ≤  [ + 1], we set  [] as  + 1.By the definition of  [•], we then update the value  [ + 1] to   if   is smaller than the current value in  [ + 1].The size of  is at most , and thus this algorithm has work  ( log ).This algorithm only works on the unweighted LIS problem.