Parallel Integer Sort: Theory and Practice

Integer sorting is a fundamental problem in computer science. This paper studies parallel integer sort both in theory and in practice. In theory, we show tighter bounds for a class of existing practical integer sort algorithms, which provides a solid theoretical foundation for their widespread usage in practice and strong performance. In practice, we design a new integer sorting algorithm, \textsf{DovetailSort}, that is theoretically-efficient and has good practical performance. In particular, \textsf{DovetailSort} overcomes a common challenge in existing parallel integer sorting algorithms, which is the difficulty of detecting and taking advantage of duplicate keys. The key insight in \textsf{DovetailSort} is to combine algorithmic ideas from both integer- and comparison-sorting algorithms. In our experiments, \textsf{DovetailSort} achieves competitive or better performance than existing state-of-the-art parallel integer and comparison sorting algorithms on various synthetic and real-world datasets.


Introduction
Sorting is one of the most widely-used primitives in algorithm design, and has been extensively studied.For many if not most cases, the keys to be sorted are fixed-length integers.Sorting integer keys is referred to as the integer sort problem.An integer sorting algorithm takes as input  records with integer keys in the range 0 to  − 1, and outputs the records with keys in non-decreasing order.Despite decades of effort in studying integer sorting algorithms, however, obtaining parallel integer sorting (IS) algorithms that are efficient both in theory and in practice has remained elusive.Theoretical Challenges.As a special type of sorting, integer sorting algorithms can outperform comparison sorting algorithms by using the integer encoding of keys.This claim is verified in many existing studies [6,45] as well as our experiments (see Fig. 1).As a result, in real-world applications (e.g., [21,53,56]), integer sort is usually preferred (instead of comparison sort) when the keys are integers.While it is not surprising that IS algorithms can outperform comparison sorts, we observe a significant gap in connecting the high performance with theory.Theoretical parallel IS algorithms with good bounds [3,4,8,33,43] are quite complicated, and we are unaware of any implementations of them.On the other hand, for the practical parallel IS solutions [6,10,45], we are unaware of "meaningful" analysis to explain their good performance for general  .The best-known bounds for them, as discussed in [45], are  ( log  ) work (number of operations) and polylog() span (longest dependence chain).However, note that the main use case for integer sort is when  = Ω(), since otherwise the simpler counting sort [47,55] can be used.In this case, the bounds for practical parallel IS algorithms are no better than comparison sorts ( ( log ) work and polylogarithmic span [10,14,15,26]).This leads to the following open question in theory: do the practical parallel IS algorithms indeed have lower asymptotic costs than comparison sort (and if so, under what circumstances)?Practical Challenges.As a special type of sorting, integer sorting algorithms should outperform comparison sorting algorithms by using the integer encoding of keys.Since integers are comparable, comparison sort can be considered as a baseline for sorting integers.Unfortunately, SOTA parallel IS algorithms do not consistently outperform comparisonbased sorting algorithms.One key reason is the inherent difficulty of dealing with duplicate keys in integer sorts.In principle, duplicate keys are beneficial for sorting algorithms.For example, samplesort can skip a recursive subproblem between two equal pivots; similarly, quicksort can separate keys equal to the pivot to avoid further processing them.Interestingly, such a case does not apply to integer sort.Existing parallel IS implementations follow the most-significant digit (MSD) framework that partitions all keys into buckets based on the integer encoding (i.e., 8-12 highest bits), and recurses within each bucket.As such, equal keys cannot be detected until the last recursion.Although some techniques can be used to detect special distributions (e.g., all keys are the same), to the best of our knowledge, no existing parallel IS implementation can benefit from duplicate keys in a general (provable) and non-trivial manner 1 .Fig. 1 and Tab. 3 show that on inputs with heavy duplicates, SOTA parallel IS algorithms can be slower than comparison sorts.Our Contributions.In this paper, we study parallel integer sort to overcome the aforementioned challenges both in theory and in practice.In theory, we answer the open question by proving that a class of parallel IS algorithms, including Dove-tailSort proposed in this paper, have  ( √︁ log  ) work for input size  and key range 0 to  − 1.This is asymptotically better Zipfian 0.6 1.00 1.10 1.28 1. 46   than comparison sorts for a large key range  =   (log ) .This explains why the existing parallel IS algorithms outperform comparison sorts in practice in many cases.We also show that we can achieve polylog( ) span whp for an unstable integer sort, and Õ (2 log  ) span for a class of practical stable parallel IS algorithms, such as our new DovetailSort and the integer sort in ParlayLib [10].We also prove that Dovetail-Sort has Θ() work for some special input distributions with heavy duplicates.In addition to the new bounds, our theoretical results (Sec.4) also explain the choices of parameters for many (if not all) practical parallel IS implementations.We believe that our theoretical contribution is important both for analyzing our own algorithm and for understanding parallel integer sort as a general algorithmic problem.
In practice, we propose DovetailSort (DTSort), a parallel integer sorting algorithm that is efficient and stable.DT-Sort combines the algorithmic ideas from both integer-and comparison-sorting algorithms.The goal is to detect and take advantage of heavily duplicate keys (heavy keys) as in samplesort (a commonly-used comparison sort), but preserve the low asymptotic work bound as in integer sort.The key idea are to 1) use sampling to identify heavy keys, separate them from the other keys (called light keys), and skip them in the recursive calls, and 2) exploit the bitwise encoding of the keys to split light keys into subproblems and deal with them recursively.Although the high-level idea of separating heavy and light keys is natural, an interesting challenge is that all keys need to be put in increasing order in the output, and thus the heavy and light keys have to be finally interleaved with each other.Therefore, the heavy-light separation idea requires dovetailing the keys at a low cost.To do this, DTSort first carefully puts the heavy keys close to their final destinations when they are identified.After the light keys have been sorted from the recursive calls, DTSort uses a dedicated algorithm to combine them into the final sorted order.We present more details in Sec. 3.

𝑨[1..𝒏] Original input array with size 𝑛 [𝒓]
The range of the integer keys 0, . . .,  − 1  Number of bits in a "digit" (sorted by each level)  = 2  The "radix" size  Base case size threshold  ′ Current (recursive) problem size  Number of remaining "digits" to be sorted Table 1.Notations and parameters in this paper.
Due to the algorithmic improvements, DTSort achieves consistently strong performance on various synthetic and real-world input instances.We evaluate DTSort using SOTA parallel sorting algorithms as baselines.We illustrate the results on 32-bit keys in Fig. 1, and present more in Sec. 6.Compared to the best baseline on 32-bit keys (PLIS), our algorithm is competitive (1.02×faster) for no or light duplicate keys, and can be up to 2.3× faster with more duplicates.The advantage of DTSort is larger on 64-bit keys (see Tab. 3).We also evaluate DTSort in several applications with real-world input data.In tens of the input instances we tested, DTSort is the fastest in most cases.We note that sorting is one of the most well-studied problems, and it is almost impossible for one algorithm to be the best for all input instances.The goal is to design algorithms that are always close to best in any use case.We believe DTSort achieves very good overall performance on a variety of input patterns.Our code is publicly available [22].

Preliminaries
2.1 Notations.We use  ( ()) with high probability (whp) in  to mean  (  ()) with probability at least 1− − for  ≥ 1.Throughout the paper, we consider the high probability bound in  for input size .We omit "in " with clear context.We use [ ] to refer to integers in the range 0 to  − 1.In our analysis, we use the term 2

Computational Models.
We use the work-span (or work-depth) model for fork-join parallelism with binary forking to analyze parallel algorithms [14,20].We assume a set of threads that share a common memory.A process can fork two child software threads to work in parallel.When both children complete, the parent process continues.The work of an algorithm is the total number of instructions and the span (depth) is the length of the longest sequence of dependent instructions in the computation.We can execute the computation using a randomized work-stealing scheduler in practice in  / +  () time whp on  processors [5,18,30].

Sorting and Integer
Sorting.Integer sort takes as input an array  of  records with a key function that maps each record to an integer key in range [𝑟 ].The goal is to reorder the records to be in non-decreasing order.As mentioned in Sec. 1, the main and most general use scenario of integer sorts is when  = Ω(), so we assume this in our theoretical analysis unless otherwise specified.Many existing integer sort algorithms consider the integer keys as radix- (base-) numbers, and work on each "digit" in a round.Usually each digit refers to every  bits in the integer and  = 2  .These algorithms are also referred to as radix sort.The classic sequential radix sort uses  = Θ(), starts from the least significant digits (LSD), and reorders all the records on every digit using a stable sort, until all bits are used up [20].It takes  ( log   ) work.
Unlike the sequential setting, existing parallel integer sort implementations start from the most significant digit (MSD).The algorithms (see Alg. 1) have two steps: 1) distributing all records into  = 2  buckets by their MSD, and 2) recursing to sort all buckets in parallel.This top-down approach allows all subproblems to be solved independently in parallel, which is more I/O friendly.The distribution step can be performed by a counting sort (see Sec. 2.4) with the bucket id as the MSD.Most practical parallel integer sorting algorithms [6,10,17,19,37,41,45,46,52,59,61] follow the framework in Alg. 1 and have good performance.However, if a tight bound exists for these parallel MSD algorithms remains open.
For simplicity, we assume to store the sorted result in the input array ("in-place").Here "in-place" refers to the algorithm interface.Our algorithm still uses  () extra space.A sorting algorithm is stable if the output preserves the relative order among equal keys as in the input, and unstable otherwise.Stability is required in many sorting applications.

Counting Sort (aka. Distribution).
Counting sort is a commonly-used primitive in parallel sorting algorithms to distribute records into (a small number of) buckets.It takes an input of  ′ records with a function to map each record to an integer (the bucket id) in [ ′ ] ( ′ ≤  ′ ).The goal is to sort the records by the bucket ids.The classic stable counting sort [55], which is widely used in practice, has  ( ′ ) work and  ( ′ + log  ′ ) span.Due to page limits, we put more discussions about this algorithm in Appendix B, but the audience can treat the counting sort as a black box.
There exists a randomized algorithm for unstable counting sort on  ′ = Ω(log2 ) records with  ′ =  ( ′ log  (1) ) buckets in  ( ′ +  ′ ) work and  (log ) span whp in  [47].However, this approach is complicated and is less frequently used in practice (discussions also given in Appendix B).

2.5
Comparison Sort, Semisort, and Sampling.Efficient and parallel sorting is widely studied [9,34,36,[48][49][50]54].Comparison sort and semisort are the other two types of sorting problems.Comparison sort takes an input of  records and a comparison function "<" to order two given records, and outputs the records in non-decreasing order.Semisort takes an input of  records and an equality-test "=" on two records, and reorders the records such that records with equal keys are adjacent in the output [23].SOTA parallel Algorithm 1: Parallel-MSD-Sort([0.. − 1], ) Input: : input array with keys in [𝑟 ]. : number of digits remained to be sorted.Each digit refers to  bits in the key.
On the root-level  = (log  )/. : base case threshold.Notes: We use the -th digit as the -th digit from low to high. 1 if  = 0 then return  // All bits are sorted algorithms for comparison sort [6,10] and semisort [23,32] all use sampling that can detect heavy duplicate keys.
The sampling scheme used in these papers was first proposed in [47].Given a parameter  and input size , the sampling scheme first selects  log  samples uniformly at random.It then sorts the samples, subsampling every (log )th element into a list  ′ .Based on Chernoff bounds, any key that appears more than once in  ′ must have Ω(/) occurrences in the input whp.In existing sorting/semisorting algorithms [6,10,23,32], each such key will be placed in its own bucket.Since all records in this bucket have the same key, no further sorting is needed.As such, these algorithms can be significantly faster when the input contains many duplicate keys.Our DTSort is inspired by this technique for detecting heavy duplicate keys.In Sec. 3, we will discuss how to apply this sampling scheme to an integer sort algorithm.

The DovetailSort Algorithm
We now present DovetailSort (DTSort), which follows the MSD sort framework in Alg. 1.By a careful design, DTSort can efficiently handle heavy duplicate keys, and is stable.
To detect and handle duplicate keys, we use the sampling idea in existing comparison-based sorts.However, we need to carefully combine this approach with the MSD sort to avoid blowing up the work bound to  ( log ).As mentioned, we want our algorithm to separate heavy and light keys, with the goal of avoiding sorting heavy keys in all recursive levels.The idea of identifying heavy/light buckets has been used in parallel semisort [23,32].However, integrating it into a parallel MSD sort is highly non-trivial.Unlike in semisort where heavy and light keys are completely independent and can be placed in any order in the output, all buckets in an integer sort must be finally placed in an increasing order.We refer to this step as the "dovetail merging" step, and discuss an efficient solution both in theory and in practice in Sec.3.4.We present DTSort in Alg. 2 and an illustration in Fig. 2. It consists of four steps, explained in the subsections below.It is an MSD sort using  = √︁ log  bits in each digit.We explain how this parameter enables a good work bound in Sec. 4. We use  as the original input size, and  ′ as the current recursive subproblem size.Separating them is needed to Samples: Step 1: Take samples (boxed), detect heavy keys, assign bucket ids 4 light buckets: for MSD (highest two bits) 00, 01, 10, 11 3 heavy buckets: for 4, 6, 9

⇒
Step 2: Distribute records to corresponding buckets Step 3: Recursively integer sort each light bucket on the next 2 bits Step 4: Merge heavy and light buckets within the same MSD MSD=00 rigorously analyze the algorithm and give the correct high probability bounds.We use heavy/light bucket to mean a bucket for heavy/light keys, and heavy/light record to mean a record with a heavy/light key.We use "MSD" to refer to the MSD in the current subproblem (see the -th digit in Alg. 2).

Step 1: Sampling
As mentioned earlier, the first step is to use sampling to determine heavy keys, and assign a bucket id to each heavy and light bucket.The theory of this idea is developed in previous work on samplesort and semisort [32,47].In this step, Θ(2  log ) samples are picked into an array .Recall that  is the number of bits to be sorted in a level.We then subsample every (log )-th key in , and output keys with more than one subsample to  ′ .As discussed in Sec.2.5, we define all keys in  ′ as heavy keys, and all other keys are light keys.According to the MSD (the highest  bits) of the keys, the key range is divided into  = 2  subranges.We call each key range sharing the same MSD an MSD zone.In a regular integer sort, each MSD zone forms a bucket.In our case, all light keys within the same MSD zone will share a bucket, but each heavy key has its own bucket.
Unlike samplesort and semisort that do not need to further handle heavy keys once they are recognized and distributed, in integer sort, we need to finally combine them with light keys in the same MSD zone.Hence, we wish to keep the heavy keys close to their final destination in the output from this very first step, in order to perform an efficient merge at the end of the algorithm.To do so, we will set the bucket id of a heavy bucket immediately after the light bucket which it should finally be integrated into.Namely, for each MSD zone, there is exactly one light bucket followed by some (possible zero) heavy buckets, which are all the heavy keys with the same MSD as this zone.An illustration is given in Fig. 2. A unique id is assigned to each bucket in a serial order.Multiple heavy buckets in the same MSD zone will be totally ordered by their full keys.In this way, the light buckets and heavy buckets in the same MSD zone are consecutive, which makes it easier to combine them in Step 4.
We explain this step in detail on line 3-14 in Alg. 2. To assign the bucket ids, we first obtain an array ℎ of all heavy keys in a sorted order (line 6), which is a pair (, ), where  is the key itself and  is the MSD zone of .We then use an array  (line 8) to maintain a pair (, ) for each light bucket (similar to ℎ), where  ∈ [2  ] is the MSD of the light bucket, and  = −1 is a dummy key to mark this is a light bucket.By merging ℎ and , all buckets will be re-organized in the desired order-all buckets will first be ordered by their MSD; within the same MSD, the first bucket is always the light one, followed by all heavy buckets sorted by their key.We can then assign ids to them serially.
. Illustration of the dovetail merging step.The example merges the buckets in MSD zone 01 in Fig. 2. We use a letter as subscription to distinguish different records with the same key.
After assigning the bucket ids, we build a hash map  to maintain the bucket id of the heavy keys, and use a lookup array  to map every MSD to its associated light bucket id.As mentioned, the total number of buckets is small, so the cost to maintain them is small both in time and space.With  and , we can determine the bucket id for any key  in constant time (the GetBucketID function in Alg.2): we first look up whether  is in  ; if so,  is a heavy key, and its bucket id can be found in  .Otherwise,  is a light key, and we can look up its bucket id in  by using the MSD of .

Step 2: Distributing
After the first step, we know the bucket id of all keys.In this step, we use the stable counting sort in Sec.2.4 with their bucket id as the key, similar to many existing parallel sorting algorithms [10,23], After this step, all records are stably reordered into the corresponding buckets.

Step 3: Recursing
After Step 2, keys are sorted across MSD zones, but keys in the same MSD zone are not necessarily sorted.In this step, we first sort each light bucket recursively.
As with a regular integer sort, the recursive call deals with the subproblem using the same integer sort routine on the next "digit".In DTSort, since all records in a heavy bucket have the same key, no further sorting is needed.Hence, we only recursively sort the light buckets (line 16).
Note that by using a recursive approach, we can potentially detect much more than 2  heavy keys if we consider all levels of recursion.Namely, the heaviest keys will be identified at the topmost level and bypass all recursive calls, and some medium-level heavy keys will be identified in the middle levels but still bypass a few recursive calls.This enables a balance in finding more heavy keys in total to save work, as well as to use a limited number of heavy keys in each subproblem to avoid a large amount of extra space in the counting sort in Step 2 (see a discussion in [23]).

Step 4: Dovetail Merging
After Step 2, all keys are sorted across MSD zones, and all heavy keys are ordered by their keys.Step 3 further sorts all light keys within each MSD zone.To compute the final output, we finally must merge the two sorted sequences (light records and heavy records) in each MSD zone.We refer to this step as the dovetail merging step, which is the DTMerge function on line 19 in Alg. 2.
Each DTMerge call deals with an MSD zone with  + 1 buckets  0.. stored consecutively in array .The first one  0 is the light bucket in this zone.The rest of the  buckets are heavy buckets.Given these special properties, the problem is equivalent to viewing the  heavy buckets as partitioning the light bucket into +1 blocks, and then to interleave them.Hence, we call this process dovetail merging.A naïve solution is to call a standard parallel merge to combine  0 with the rest  1.. , with linear work and polylogarithmic span.Despite the efficiency in theory, in practice, it requires two rounds of out-of-place global data movements-all records have to be merged into a temporary array, and then be copied back to the input array.Such data movements will cause more cache misses and may make the algorithm inefficient.There exists an in-place parallel merge algorithm in theory [31], but it incurs a large constant hidden in the asymptotic cost (moving each record at most four times), which may also be expensive in practice.To enable good performance, our goal is to reduce the data movements, with the additional consideration of moving less data out of the input array.
Our idea is based on two observations.First, being strictly in-place is unnecessary, since  () temporary space is required by the counting sort in Step 2. Therefore, we can reuse the space while minimizing the number of out-of-place memory access.The second useful observation is that the number of heavy keys is guaranteed to be small (at most 2  of them).As a result, to achieve a practical solution, our idea is to process buckets in the same MSD zone one by one sequentially (note that all MSD zones are still processed in parallel), and aim to achieve high parallelism within each bucket.
With these insights, we present our DTMerge algorithm with an illustration in Fig. 3 and pseudocode in Alg. 3. The high-level idea is to only copy the smaller one of the light bucket ( 0 ) or the heavy buckets ( 1.. ) out to a temporary array, so that we copy at most half the records.The larger half will be handled within the input array.WLOG, assume the current subproblem has more heavy records than light ones (line 2), and the other case can be handled symmetrically (line 13).In this case, we first move the light records to a temporary array  (line 3), and then move all heavy buckets to their final positions.To obtain the final positions of each heavy bucket, we first binary search all heavy keys (in parallel) in the light bucket  0 (line 1).Combining this information with the sizes of all buckets, we can infer the starting point of each heavy key.The next step is to (stably) move the entire bucket   to this starting point, and do so for all buckets one by one from  1 .
Ideally, we wish to copy all records in   to their final positions in parallel, but have to be careful to ensure that their final positions are "safe" to be overwritten.Since all heavy buckets are initially at the end of this MSD zone, their final destinations must be ahead of their initial positions.There are three cases for a bucket   's destination.The first case is when it contains light records, which is safe since all light records have been backed up in  .The second is when it overlaps with an earlier heavy bucket   ′ where  ′ < .It is also safe-since  ′ < , bucket  ′ must have been moved to its final destination.In the first two cases, we can safely move the records directly into the final destination in parallel (line 10).The last and the most interesting case is when the destination of the bucket overlaps with its own initial position (line 5).Our idea is inspired by the in-place circular shift algorithm [27,60] (see Phase 3 in Fig. 3): we can first flip (reverse the order) the bucket itself (line 7), and then flip the entire associated region in the array (line 8).Flipping an array can be performed in parallel and in-place by swapping the corresponding records.As such, the entire chunk of   is moved to an earlier position in a stable way.After dealing with all heavy buckets, we copy all records in  back to , all in parallel, directly to their final destination (line 12).In this way, our idea incurs out-of-place copies for at most half the records, and requires moving the remaining records at most twice, but entirely within .Within each bucket, all operations (direct copy or swapping records) are highly parallel.
Note that although buckets within the same MSD zone are processed sequentially, this only increases the span of the entire algorithm by a factor of log  because there are only  (2  ) heavy buckets.Interestingly, although a parallel merge gives better span bound than our DTMerge, even in theory it is not necessary, because the counting sort in Step 2 also requires  (2  ) span.We present more theoretical analysis in Thm.4.5.
In Sec.6.3, we experimentally study DTMerge and show it accelerates Step 4 compared to using a parallel merge.We believe the "dovetail" step is a unique challenge arising from detecting heavy keys in integer sorts, and our new technique solves it in an effective yet simple way.

Base Cases
For the recursive structure, we use two base case conditions.The first is as in regular integer sort when all digits have been sorted (Alg. 2 line 1).The second is when the problem size is smaller than 2 2 , where we apply a comparison sort.Using comparison sort as a base case is standard in implementations for parallel integer sort.Interestingly, instead of viewing it as an empirical optimization, our theoretical analysis indicates that this base case, as well as the threshold to use this base case, are very important to guarantee theoretical efficiency.We present more details in Sec. 4.

The Theory of Parallel MSD Sort
In this section, we provide an analysis for a class of MSD integer sorting algorithms based on the framework in Alg.Move light records from  to their final positions in  13 else Symmetric case: copy heavy keys out and merge including DTSort.Recall that the algorithm sorts a "digit" of  bits in each round by 1) distributing all records to the corresponding buckets by the current digit being sorted, and 2) sorting each bucket recursively.Finally, a base case of using comparison sort is applied to subproblems with sizes smaller than a parameter  .As discussed in Sec. 1 and 2, although this parallel MSD sort is widely used in practice, we are unaware of any tight analysis of it.To the best of our knowledge, the best work bound is  ( log  ) [45], which is no better than a comparison sort with  ( log ) work for the realistic setting when  = Ω().This is inconsistent with the excellent performance of existing parallel IS algorithms in practice, which are much faster than SOTA comparisonbased sorts for many inputs.In this paper, we bridge this gap between theory and practice by showing Thm.4.1, 4.4 and 4.5.Recall that the main use case for integer sorts is when  = Ω() (otherwise a counting sort has a better cost bound).We hence make this assumption on  in our analysis.

The Analysis of the General MSD Sort
As a warmup, and to illustrate some of the key theoretical ideas, we start by showing that we can achieve an unstable MSD sort with low work and span.Under the assumption of the standard RAM model that can handle integers with  (log ) bits (i.e.,  =   (1) ), the theorem implies  ( √︁ log ) work and  (log 1.5 ) span whp.The key in our analysis is to properly set up the parameters in Alg. 1. Once we do so, the analysis becomes relatively clean.The parameters to use are digit width  = √︁ log  and base case threshold  = 2  for some constant  ≥ 1.This indicates (log  )/ =  ( √︁ log  ) =  () recursive levels in the entire algorithm.We use the unstable counting sort [47] (see Sec. 2.4) for the distribution step, and use comparison sort with  ( ′ log  ′ ) work and  (log  ′ ) span [26] to handle base cases with size  ′ .The two main parts in Alg. 1 are the base cases (line 2) and the distribution step (line 3).We first analyze their work with the parameters shown above.We now prove the span.The span for the base case is  (log 2  ) =  () [26].The total span for one recursive call is the sum of  (log ) whp (for the distribution step) [14] and  () (line 5).Since there are  ( √︁ log  ) recursive levels, the total span bound is as stated.
□ We note that Thm.4.1 holds for  = Θ( √︁ log  ), but for simplicity we omit the Θ(•) notation here.Practical and Stable MSD Sorting Algorithms.Thm.4.1 shows the theoretical efficiency of parallel MSD algorithms.However, the counting sort [47] used in the analysis is not stable, and is unlikely to be practical due to too many rounds of data movements [32].Many practical algorithms choose to use the stable counting sort [55].We analyze it as follows.□ This indicates the same work bound as the unstable version.Although the span is slightly higher than that in Thm.4.1, this version enables stability, and is more practical.In fact, many existing practical parallel IS algorithms, such as PLIS [10], use a similar approach as in Thm.4.4.Theory-Guided Practice.Thm.4.4 shows  ( √︁ log  ) work for practical parallel MSD sort.This explains why the integer sort algorithms can outperform comparison sorts with  ( log ) work for realistic range of  =   (1) .Our parameter  = √︁ log  used in the analysis is also the choice for most practical MSD algorithms.When  = 2 64 , we have  = √︁ log  = 8 and  = 2  = 2 8 , which roughly matches the existing algorithms that empirically sort 8-12 bits in each recursive level, and use base case sizes from 2 8 to 2 16  [6,10,45].

The Analysis of DTSort
In this section, we show that our DTSort with the additional steps of sampling and dovetail merging is theoretically efficient.We first prove the following theorem.In the dovetail merging step, we need  (2  ) binary searches on  ′ records, so the total cost is  (2  log  ′ ), which is  ( ′ ) considering  ′ ≥ 2 2 and  = Ω().When moving a bucket of size , the total work is  () (either a direct copy or two rounds of flipping on at most 2 records).We also copy out and back at most  ′ /2 elements.Combining both parts, the work for the merging step is also  ( ′ ).Therefore, compared to the algorithm we used in Thm.4.4, DTSort uses asymptotically the same amount of work,  ( √︁ log  ).For the span, note that the algorithm used in Thm.4.4 has  (2  ) span in each recursive call.Again consider the two additional steps.Taking samples in parallel and sorting them requires  () span.We process  and  sequentially since there are at most 2  heavy buckets and exactly 2  light buckets.This gives  (2  ) span in each recursive call.
For the dovetail merging step, we need to move at most 2  heavy buckets.Moving each of them is either a direct copy, or two rounds of simple flipping.Both can be done by a parallel-for on all affected records, giving  (log ) span assuming binary forking.This gives  (2  log ) = Õ (2  ) span in one recursive call.Therefore the two additional steps both have Õ (2  ) span in each recursive call.Considering all  ( √︁ log  ) levels gives the stated span bound.□ We also note that for the dovetail merging step, if we use the simple parallel merge (i.e., our baseline solution), we can get the same span bound of  (2 √ log  √︁ log  ) as in Thm.4.4.Also, if we assume arbitrary forking (e.g., PRAM), we can also obtain the same span bound as in Thm.4.4.To avoid the minor subtlety here, we hide the logarithmic terms in the theorem by the Õ (•) notation.Our main claim is that DTSort can achieve  ( √︁ log  ) work and is highly parallel.In addition to the worst-case bound, we can also show that DTSort can achieve asymptotically better work bound on some real-world distributions with heavy duplicates.We show two special cases that DTSort achieves linear work in Thm.4.6 and 4.7.Here the exponential distribution means the frequencies of keys following the probability density function  (; ) =  − for  > 0 and a parameter  > 0.
Theorem 4.6.DTSort has  () work whp if the input key frequency exhibits an exponential distribution with   ≥ c/2  for some c > 1.Here  > 0 is the parameter of the exponential distribution, which gives probability density function  (; ) =  − for  > 0. Proof.We first consider the exponential distribution, where the input follows the probability density function  (; ) =  − for  > 0 and a parameter  > 0. Since we are considering integer sort, the randomly drawn real number is rounded to an integer, and randomly mapped to an integer in [𝑟 ].DTSort can detect a large fraction of the records as heavy keys when the input is drawn from an exponential distribution.Particularly, the sampling scheme in Alg. 2 can detect keys with more than c/2  occurrences as heavy keys whp for some constant c > 1 [32], or equivalently, keys with frequencies higher than c/2  .This indicates that an integer  with  − ≥ c/2  will be identified as heavy.The cumulative distribution function of the exponential distribution is  (; ) = 1 −  − for  ≥ 0. Thus, the total number of light records is no more than √ log  when plugging in  from the above threshold for heavy keys.However, since we are considering integer sort (keys are rounded to integers), we need to make sure that  ≥ 1, so we need the assumption that   ≥ c/2  .Even though we assume the worst case that the rest of the records will be processed for all the rest √︁ log  − 1 levels, the cost to process them is  (c Hence, the total work for DTSort on exponential distribution under the assumption is  ().□ Theorem 4.7.DTSort has  () work whp if there are no more than  ′ 2  distinct keys, for some constant  ′ < 1.
Proof.For simplicity, we use the same notation c as the previous proof.Similarly, keys with frequencies higher than c/2  will be detected as heavy.The maximum number of light records is no more than  ′ 2  • c ′ /2  <  ′ c ′ for a subproblem of size  ′ .By setting  ′ < 1/c, the number of light records will shrink by a constant fraction.The total cost in a recursive level is linearly proportional to the total size of all subproblems.Since the number of light records decreases geometrically across levels, the total work is asymptotically bounded by the root level, which is  ().□

Implementation Optimizations
We now discuss some optimizations we find useful when implementing DTSort.Overflow Bucket.Note that integer sort takes the input keys from a range from 0 up to  , but the actual input keys may be in a smaller range.For example, although the key can be 32-bit integers, the actual range can be much smaller than 2 32 .To take advantage of this, one possibility is to compute the maximal key by a parallel reduce, as in many existing implementations [10,37].Here we discuss a simple yet effective optimization to avoid this computation.Similar to the heavy/light separation of the keys, here we also use sampling to detect the key range.In each subproblem, DTSort sets the upper key range  ′ to be the largest value in the sample set .As such, DTSort skips leading zeros in  ′ since creating buckets for these bits are wasteful.Since the upper bound is computed by a sample set, DTSort creates an overflow bucket to hold records whose keys have at least a "1" for the skipped bits.The size of the overflow bucket is usually very small.We put the overflow bucket at the end, and comparison sort records in this bucket.As such, we do not need to compute the maximal key in the input.Minimizing Data Movement.In Alg. 2, for simplicity, in our description we assume we use the stable distribution mentioned in Sec.2.4 and write back the bucket to  (see line 15).The second copy-back is not necessary and should be avoided, since the performance of sorting algorithms can suffer due to excessive data movement.In our implementation, we actually maintain two arrays  and  .After the distribution, we use array  for future execution, until we see another distribution or a dovetail merging, and we then distribute or merge the results back to  (or the other array in the recursion).This approach will reduce the total data movement.Finally, if the data remains in  at the end of the algorithm, we need to move them back to  before finishing.

Experiments
We implemented DTSort in C++ and tested its performance.DTSort uses ParlayLib [10] to support fork-join parallelism.Experimental Setup.We run our experiments on a 96-core machine (192 hyper-threads) with 4 × 2.1 GHz Intel Xeon Gold 6252 CPUs (each with 36MB L3 cache) and 1.5TB of main memory.We always use numactl -i all to interleave the memory on all CPUs except for sequential tests.We run Tested Algorithms.We compare our DTSort with six baseline parallel algorithms, including four MSD integer sorting algorithms and two comparison-based algorithms (both are samplesorts).We present their information in Tab. 2. Synthetic Distributions.We used three standard realworld distributions to test all algorithms: Unif- (uniform distribution with  distinct keys), Exp- (exponential distribution with parameter 10 −5 ), and Zipf- (Zipfian distribution with parameter ), We also designed a special distribution Bit-Exponential (BExp-), which aims to simulate "worst-case" inputs for DTSort.For BExp-, every bit of a key is a 0 with probability 1/, and 1 otherwise.BExp controls the distribution of the bitwise encoding (rather than frequency) of keys.With a large , most subproblems will have a mix of light and heavy records, which is an adversarial case for DTSort due to the necessary dovetail merging for almost every subproblem.In fact, BExp is generally hard for all MSD algorithms because it makes subproblem sizes uneven.For simplicity, we say a distribution is heavy (or heavier than another) if it contains more duplicates, and light (lighter) otherwise.

Overall Performance on Synthetic Data
We present the running time of all tested algorithms on synthetic distributions in Tab. 3 with different key lengths.Standard Distributions.With 32-bit keys, DTSort has the best overall performance, which is 1.27-1.47×faster than the baselines.A heatmap for 32-bit results is also presented in Fig. 1.As expected, the other integer sorting algorithms are usually competitive to DTSort on light distributions, but can be much slower than DTSort on heavy distributions.Take the best baseline on 32-bit keys, PLIS, as an example.On light distributions, It is only 2-8% slower than DTSort, and is faster than comparison sorts.However, on the heaviest several distributions in each category, PLIS is almost always slower than the best comparison sort IPS 4 o, making its overall advantage to IPS 4 o small (only 4%).On the other hand, DTSort can take advantage of the heavy keys and outperforms IPS 4 o in all but two tests (which we study in more details below).Across all tests, DTSort can be up to 2.3× faster than PLIS.
With 64-bit keys, all algorithms perform slower than with 32-bit keys due to the increase in data size, but the integer sorting algorithms are affected more significantly.This matches the theory that the work bounds of these algorithms increase with the value of  (see Sec. 4).In this case, all baseline integer sorting algorithms are slower than the fastest comparison sort IPS 4 o on average, and many of them cannot outperform IPS 4 o even on some light distributions.On the contrary, DTSort is still faster than both tested comparison sorts in almost all tests, and is 1.2-1.5×faster on average.Among all tests on the standard distributions, DTSort performs the best on all but three tests, where samplesort IPS 4 o is better.Two of them are Unif-10 3 for both 32-and 64bits.The reason should be that, based on our parameter , the average frequency of Unif-10 3 is the threshold to distinguish heavy and light keys on the root level (about 10 6 ).We tested this on purpose to show the bad case for DTSort, since all "light keys" are reasonably heavy (right around the threshold), but they have to go through another level of recursion and have to be processed in the dovetail merging step.This also illustrates the inherent difficulty of dealing with duplicates for parallel IS algorithms.The last case where DTSort is slower than IPS 4 o is Zipf-1, and DTSort is within 5% slower.
In general, DTSort effectively takes advantage of different levels of heavy duplicates: it exhibits a similar trend as samplesort algorithms where the running time decreases with more duplicates.Meanwhile, as an integer sort, DTSort also has the benefit of being asymptotically more efficient than comparison sorts with realistic input key range  =   (1) .The Adversarial Distribution (BExp).As mentioned above, we design the BExp distribution to study an adversarial case of DTSort and other MSD algorithms.The distribution incurs very different sizes of MSD zones for all subproblems, which incurs load imbalance for all MSD algorithms.It also .The instances of one distribution is ordered by increasing number of heavy records from top to bottom.The fastest running time on each input instance is underlined."Avg." = geometric mean.The keys of RD need to be padded to multiples of 64 bits.We remove it from the 32-bit experiments because it is too slow after padding to 64 bits.
generates a mix of light and heavy buckets in almost all subproblems, which greatly increases the cost of the dovetail merging and offsets the benefit of identifying heavy keys.Almost all existing IS algorithms have poor performance on this distribution.Interestingly, even the comparison-based sorting algorithms perform much worse when the distribution is very skewed.Among the MSD algorithms, IPS 2 Ra and RD are significantly affected and can be up to 10× slower than the others.DTSort, PLIS, and RS are lightly affected on 32-bit keys, but in general can still be within 2× slower compared to their performance on standard distributions.A possible explanation is that all the three algorithms use similar work-stealing scheduler (cilk for RS, ParlayLib scheduler for DTSort and PLIS), which supports dynamic load balancing.This matches our scalability tests (reported in in Appendix C), where IPS 2 Ra and RD do not scale beyond 24 threads on this distribution.RS seems to be particularly resistant to the BExp distribution.One reason is that RS is unstable, and records with the same key do not need to be put in a specific order.This sacrifice on stability has a side-effect to benefit from the highly skewed distributions.
Overall, DTSort still achieves the best performance on 32-bit keys, and is still faster than comparison sorts in most cases.We believe this advantage comes from the removal of heavy keys from recursive problems, and thus alleviates the load-imbalance issue.In addition, the performance of DTSort is also significantly improved by using our proposed DTMerge algorithm.As we will show in Sec.6.3, as expected, on BExp distributions, the merging step is very expensive, which can take 50% of the entire running time if we simply use parallel merge.Our new algorithm accelerates this step by 1.7-2.8×,making the overall running time much lower.

Applications
Integer sort is widely used in practice.We choose two representative applications on graph and geometric processing to evaluate the performance of the tested algorithms.Graph Transpose.Given a directed graph  = ( , ), the graph transpose problem is to generate  ⊺ = ( ,  ⊺ ) where  ⊺ = {(, ) : (, ) ∈ }.Here we assume the edges are stored using the standard format of compressed sparse row.To compute the transposed graph, we only need to stably integer sort all edges using the key as the second vertex ; here the vertices with larger degrees are the "heavy keys"..704.8751.17 .8541.68 1.12 Table 4. Running time (in seconds) on multiple applications.We use 32-bit keys and 32-bit values.See Sec.6.2 for the detailed explanation on the keys and values.The fastest running time on each instance is underlined."" = input sizes."s.g." = segmentation fault."Avg." = the geometric mean on instances of the same application."-" = not applicable.
Compared to the best baseline (PLIS), DTSort is always faster by up to 1.3×.The samplesort IPS 4 o performs the best on CW, but performs much slower on small graphs.Overall, DTSort has the best performance across all graphs.Morton Sort.For a -dimensional point, a z-value is calculated by interleaving the binary representations of each coordinate.The process to sort the z-values of a set of points is called Morton sort, which orders multidimensional data in one dimension while preserving the locality of data points.It is widely used in geometry, graphics, and machine learning applications [11,12,28,29,42,57].
We tested seven datasets for Morton sort and list the result in Tab. 4. Three are real-world datasets, including GeoLife (GL) [58,62], Cosmo50 (CM) [40,58] and Open-StreetMap (OSM) [1].The other four are synthetic datasets by the widely-used generator Varden [24] that creates points with varying densities.On the synthetic distribution with varying densities, DTSort outperforms all other baselines by 1.2-3.4×.On the three datasets with real-world points, existing integer sorts (RS or PLIS) achieved the best performance on GL and CM, while DTSort is competitive (3-17% slower).On OSM, none of the integer sorts outperform the samplesort IPS 4 o.Interestingly, DTSort still achieved the best average performance on the three datasets (PLIS is also close).This may be due to the reason that DTSort combines the benefit for both comparison and integer sorts, and hence does not have a very bad worst-case performance.

Performance Study
Heavy Keys Detection Evaluation.To understand the performance gain by detecting heavy keys and processing them separately, we compare our implementation with (labeled as DTSort) and without (labeled as Plain) heavy keys detection in Fig. 4 (a) and (b).We select eight representative distributions, which are the lightest and heaviest cases for each distribution respectively.
As discussed in Sec. 3, the overhead of DTSort compared to Plain includes sampling to detect heavy keys, and merging heavy keys with light keys.On 32-bit inputs, when there are no or fewer heavy keys, DTSort is comparable with Plain, with at most 12% performance loss on BExp-10.It indicates that the overhead of sampling is negligible compared to other steps.With heavy duplicate keys, since heavy keys are gathered in separate buckets and do not need to be sorted in later recursions, DTSort outperforms Plain by up to 2.12× (on BExp-300).Overall, DTSort outperforms Plain by 25% on average.On 64-bit input, the performance improvement by using heavy keys detection becomes larger.DTSort consistently outperforms Plain, with a 1.50× speedup on average.Dovetail Merge Evaluation.In this section, we present an in-depth study of the dovetail merging step that implements the heavy-light separation idea.As mentioned in Sec.3.4, our baseline solution (parallel merge), is correct and even asymptotically better, but it requires to move all records between the input array and a temporary array.To achieve the best performance, we utilized the special property of this problem (i.e., very few heavy buckets) and proposed our DTMerge algorithm in Sec.3.4.We select seven representative distributions to evaluate the merge performance.For uniform distribution, we only use Unif-10 3 , since no merge is needed in the lightest and heaviest cases.For other distributions, we select the lightest and heaviest cases.
To illustrate the effectiveness of our DTMerge approach, we compare it against a standard parallel merge in Par-layLib [10] (referred to as PLMerge) as the baseline.When reporting the running time, we also run the algorithms without the merging step in all recursive calls.This will not give a correct sorting result, but indicates the running time of the other steps in DTSort.We present the result for both 32 bits and 64 bits in Fig. 4 (c) and (d).Since they generally have similar trends, we discuss the experiments based on 32 bits.
On lighter distributions (e.g., Exp-1 and Zipf-0.6), the overall improvement is marginal given that the dovetail merging step is fast anyway.However, on heavier (more skewed) distributions, this step can take a significant portion of the overall time if we use PLMerge, which is 17% on Exp-10 and 42% on Zipf-1.5.With our DTMerge, this step only takes 6% in Exp-10 and 28% for Zipf-1.5, improving the overall performance for both by about 20%.
For BExp, since almost all subproblems have a mix of heavy and light keys, the merging time is significant.Using PLMerge, the merging step takes 27-43% running time.Our DTMerge improved the performance of the merging step by up to 2.8×, leading to 1.1-1.38×faster sorting time.Scalability Tests.DTSort has good scalability to both the number of processors and input sizes.Due to page limit, we provide results on one distribution (Zipf-0.8)as an example in Fig. 4 (e) and (f), and give the full discussions in Appendix C.

Conclusion
We study the theory and practice of parallel integer sort.In theory, we show that a class of widely-used practical MSD integer sorting algorithms have  ( √︁ log  ) work and low span.The bounds are asymptotically lower than comparison sort for a large key range  =   (log ) , and provide a solid theoretical foundation for explaining the high performance of parallel integer sort.In practice, we propose DTSort to overcome a common challenge in existing MSD algorithms: the difficulty to detect and take advantage of duplicate keys.DTSort combines algorithmic insights from both integer and comparison sorting algorithms, and uses a special dovetail merging step to make it more efficient.DTSort achieves the same  ( √︁ log  ) work bound as other MSD integer sorting algorithms, and has the same advantage of processing heavy duplicates efficiently as samplesort algorithms.In our experiments, DTSort achieves competitive or better performance than state-of-the-art parallel integer and comparison sorting algorithms on various synthetic and real-world datasets.As a future work, we consider further improving the performance of parallel sorting algorithm by accelerating the distributing step, inspired by some existing work [6,31,45].
Our algorithm achieves the top-tier (almost linear) speedup, with up to 72× on 192 hyper-threads.
On the BExp distribution, the self-relative speedups are slightly worst for all algorithms.In particular, IPS 2 Ra, IPS 4 o and RD do not scale well (or become even worse) with increasing number of threads.As discussed in the main body, this is mainly because BExp creates subproblems with very different sizes, and thus is challenging to be handled by integer sorts due to load imbalance.In most cases, DTSort, RS, PLSS and PLIS still scale to at least 96 cores (RS's performance drops slightly with hyperthreading).This may be due to the use of a work-stealing scheduler in these implementations, which enables dynamic load balancing.Our algorithm still achieves almost-linear scalability, and has the best speedups in most cases.
Input Size Scalability.We test all algorithms with various input sizes for both 32-bit and 64-bit keys.For each distribution, we choose two representative tests (a heavy one and a light one), and present our results in Fig. 21 to 36.We remove RD from these figures since it has a serious scaledown issue on some of the distributions (needs more than two seconds with  = 10 7 ).We measure the input size scalability starting from  = 10 7 , because on an even smaller size, most algorithms only take less than ten milliseconds.Overall, our algorithm DTSort, PLIS, and PLSS scale well in a large range ( = 10 7 to 2 × 10 9 ).and our algorithm achieves the best performance in most cases.RS, IPS 2 Ra, and IPS 4 o do not perform well when the input sizes are relatively small ( ≤ 10 8 ), but they also achieve good performance when the input sizes are sufficiently large.Scalability with increasing input size (32-bit keys).Lower is better.Scalability with increasing input size (64-bit keys).Lower is better.

Figure 1 .
Figure 1.Heatmap to compare sorting algorithms on 10 9 records with 32-bit keys and 32-bit values.All numbers are running times relative to the best for each input.Raw data are in Tab. 3. The baseline algorithms are described in Tab. 2.

Theorem 4 . 5 .
The DTSort algorithm (Alg.2) is a stable integer sort with  ( √︁ log  ) work and Õ (2 √ log  ) span.Proof.DTSort can be viewed as an extension of the generic stable integer sort which we proved work/depth bounds for in Thm.4.4.Compared to the algorithm analyzed in Thm.4.4, DTSort has two additional steps: the sampling step and the dovetail merging step.We first show that the asymptotic work bound remains the same.Note that without sampling and merging, a subproblem with size  ′ has  ( ′ ) work.In DTSort, the total number of samples is | | = 2  log .We use base case threshold  = 2 2 , and assume  = Ω(), so the cost to sort the samples is  (| | log | |) =  (2  log  • ) =  (2 2 ) =  ( ′ ).Other parts in the sampling step have work proportional to | |, so the overall work in Step 1 is  ( ′ ).

Figure 4 .
(a) and (b): Analysis for the performance of heavy-key detection.Numbers are running time (lower is better) with or without heavy-key detection.(a) is for 32-bit keys and (b) is for 64-bit keys.(c) and (d): Analysis for the performance of dovetail merging.Numbers are running time (lower is better) using our dovetail merging algorithm or a baseline merging algorithm.(c) is for 32-bit keys and (d) is for 64-bit keys.(e) and (f): Scalability (higher is better) with varying number of threads and running time (lower is better) with varying input sizes on 32-bit key and 32-bit value pairs on one instance: Zipf-0.8.Full analysis is given in Appendix C. Discussions are in Sec.6.3.

1 ,
Algorithm 3: DTMerge(,  0 ,  1 , . . .  ) Input:  + 1 buckets consecutive in array .The first one  0 is a sorted light bucket.The other  buckets are heavy buckets sorted by their keys. 1 Binary search all heavy keys in  0 , accordingly get  1.. , where   is the starting index for heavy bucket   when the array is fully sorted.Move   to the final positions starting from [  ] 5 if The final positions for   overlap with the current positions of   then 6Let  be the current start point of 10 ParallelForEach  ← 0 to |  | − 1 do 11 Copy the -th record in   to [  + ] 4.2.With the chosen parameters, the total work for base cases (line 2) in Alg. 1 is  ( √︁ log  ).Proof.Let  1 , • • • ,   be the base case sizes, where   = .Due to the base case condition,   ≤ 2  , so the work to sort it is  (  log   ) =  (  ).The total work for all base cases is therefore  () =  ( √︁ log  ) ( is a constant).□ Lemma 4.3.With the chosen parameters, the distribution step (line 3) Alg. 1 in all recursive calls has work  ( √︁ log  ) whp.Proof.We start with the total work in each recursive level (i.e., those with the same parameter ).Assume all subproblems in this level have sizes  1 , • • • ,   , where   ≤ .Using the unstable counting sort [47], the distribution work is  (  + 2 Proof of Thm.4.1.Combining Thm.4.2 and 4.3 gives the work bound in Thm.4.1. ) whp for each subproblem.Due to the base case condition,   > 2  ≥ 2  since  ≥ 1. Hence the total work in each recursive level is  (  ) =  ().Consider all  ( √︁ log  ) recursive levels, the total work is  ( √︁ log  ).□ With the two lemmas, we can now prove Thm. 4.1.
First, note that the work analysis remains the same as Thm.4.1, since the stable counting sort has the same work as the unstable one.The only difference is that the stable version has span related to the number of buckets, which is  (log  + 2 ).Assuming  = Ω(), this is  (2  ).The total span of the entire MSD sorting is therefore  (2 √ log  √︁ log  ).

Table 2 .
Algorithms tested in our experiments.There are two versions of PLSS.Here we use the unstable but faster one.In-place means fully in-place ( () extra memory).eachtest six times and report the median of the last five runs.All running times are in seconds.Parameter Selection.DTSort mostly follows the theoretical analysis in Sec. 4, with the minor observation that using a variable radix width  (based on ) slightly improved the overall performance.Hence, DTSort uses  = log 2 √︁ log  ), leading to the same asymptotic work bound), similar to some existing IS algorithms such as PLIS.The base case threshold is  = 2 14 .

Table 3 .
Ra RS PLSS IPS 4 o Ours PLIS IPS 2 Ra RS RD PLSS IPS 4 o Running time (in seconds) on synthetic data with  = 10 9 Figure 5. Self-speedup with varying thread counts of all tested implementations on Unif-10 7 .Figure 6. Self-speedup with varying thread counts of all tested implementations on Unif-10 3 .Figure 8. Self-speedup with varying thread counts of all tested implementations on Exp-7.Figure 11.Self-speedup with varying thread counts of all tested implementations on BExp-30.Figure 12. Self-speedup with varying thread counts of all tested implementations on BExp-100.Figure 13.Self-speedup with varying thread counts of all tested implementations on Unif-10 7 .Figure 14.Self-speedup with varying thread counts of all tested implementations on Unif-10 3 .Figure 15.Self-speedup with varying thread counts of all tested implementations on Exp-2.Figure 16.Self-speedup with varying thread counts of all tested implementations on Exp-7.Figure 17. Self-speedup with varying thread counts of all tested implementations on Zipf-0.8.Figure 18. Self-speedup with varying thread counts of all tested implementations on Zipf-1.2.Figure 19.Self-speedup with varying thread counts of all tested implementations on BExp-30.Figure 20.Self-speedup with varying thread counts of all tested implementations on BExp-100.