High-Performance and Flexible Parallel Algorithms for Semisort and Related Problems

Semisort is a fundamental algorithmic primitive widely used in the design and analysis of efficient parallel algorithms. It takes input as an array of records and a function extracting a \emph{key} per record, and reorders them so that records with equal keys are contiguous. Since many applications only require collecting equal values, but not fully sorting the input, semisort is broadly applicable, e.g., in string algorithms, graph analytics, and geometry processing, among many other domains. However, despite dozens of recent papers that use semisort in their theoretical analysis and the existence of an asymptotically optimal parallel semisort algorithm, most implementations of these parallel algorithms choose to implement semisort by using comparison or integer sorting in practice, due to potential performance issues in existing semisort implementations. In this paper, we revisit the semisort problem, with the goal of achieving a high-performance parallel semisort implementation with a flexible interface. Our approach can easily extend to two related problems, \emph{histogram} and \emph{collect-reduce}. Our algorithms achieve strong speedups in practice, and importantly, outperform state-of-the-art parallel sorting and semisorting methods for almost all settings we tested, with varying input sizes, distribution, and key types. We also test two important applications with real-world data, and show that our algorithms improve the performance over existing approaches. We believe that many other parallel algorithm implementations can be accelerated using our results.


Introduction
The semisort problem takes as input an array of records with associated keys, and returns a reordered array such that all records with identical keys are contiguous.Importantly, the problem does not require all keys to appear in sorted order in the output, nor all records with distinct keys to be sorted.Several other important and widely-applicable problems are closely related to semisort, such as the histogram problem that counts the number of occurrences of each key, and the more general collect-reduce problem that computes the aggregate "sum" for each key, based on all the records.The "sum" function can be defined based on any associative (some-times also commutative) combine function (e.g., addition or maximum).Semisort, histogram, and collect-reduce are all widely used in different areas, but are often referred to using different names, e.g., groupBy/aggregation in databases [35,56], frequency in data analytics applications, reduceByKey/groupByKey in RDD in Spark [76], the shuffle step in the MapReduce paradigm [26], and others [46].As an example of the applicability of these problems, consider a database of sales receipts keeping the information of each sold lineitem.Useful operations to analyze trends in this data include quickly gathering lineitems from the same branch together (semisort), counting the number of sold items in each month (histogram), and obtaining the total sale of lineitems of each brand (collect-reduce).
Semisorting was first studied as a theoretical problem by Valiant to efficiently simulate parallel machine models (e.g., the PRAM) with other models [69].Sequentially, it is easy to semisort in  () time using a hash table, and theoretically-efficient parallel algorithms are also known [45].Today, in contrast to its initial development as a theoretical tool for machine simulations, semisort is widely used in the design and analysis of efficient and practical parallel algorithms, for example for graph analytics [1, 2, 6, 14, 20, 22, 27-30, 32, 33, 36, 37, 52, 53, 58, 61-63, 68], geometry problems [19,42,60,70,72], sequence algorithms and many others [13,17,39,43,48,49,66,67,75].However, there is a disconnect between theory and practice in these parallel applications.In all of the above-mentioned papers, semisort is only used in theoretical analysis to obtain better bounds by the theoretically-efficient parallel semisort algorithm [45], but is not used in practical implementations of these algorithms.In particular, for the papers that implement and evaluate their parallel algorithms, a comparison sort (usually the samplesort in [9,13,18]) is used.Although semisorting is asymptotically simpler than sorting, semisorting is avoided in practice in favor of sorting the data.
The only known parallel semisort algorithm and implementation is by Gu et al. [45] in 2015 (the GSSB algorithm), with  () expected work (number of operations) and space, and  (log ) span (longest dependencies) whp [17].Despite the asymptotic guarantees, the algorithm has not been widely used in practice for a few reasons.First, the algorithm uses many random accesses and is I/Ounfriendly since it heavily uses hash tables.Second, the algorithm interface also incurs performance overhead.Specifically, the algorithm assumes the records are associated with hashed keys with no duplicates rather than arbitrary keys (more details in Sec.2.3).This assumption requires additional steps to hash the original keys and resolve collisions subsequently, which may take time comparable to semisort itself.Although none of these issues increase the asymptotic bounds, they both contribute to performance slowdowns that are hard to avoid in a faithful implementation.Hence, the semisort implementation in [45] is not faster than many recent sorting algorithms [10,13,57] in practice.Meanwhile, the GSSB algorithm is not stable or (internally-)deterministic (i.e., the result may depend on runtime scheduling) due to the use of parallel hash tables.
In this paper, we revisit the semisort problem, with the goal of achieving a high-performance parallel semisort implementation with a flexible interface.We propose new parallel semisort algorithms that are efficient regarding work, I/O and space usage.Our flexible interface for semisort can also be extended to support efficient and parallel histogram and collect-reduce.Our algorithm takes any key type , and a user-defined hash function ℎ :  ↦ → [1, . . .,   ] to map keys to integers.In principle, the only extra information we need is an equality test =  :  ×  ↦ → Boolean.We observe that in most use cases, the key type also supports a less-than test <  :  ×  ↦ → Boolean to determine a total ordering, which can be used to improve the performance.We refer to the general semisort algorithm (only =  supported) and the version with <  as semisort = and semisort < , respectively.
Our algorithm builds on the strengths of GSSB [45], but substantially redesigns several components to overcome the existing performance issues of the GSSB algorithm.GSSB works in three steps (we review more details in Sec.2.3).It first uses samples to determine the heavy (frequent) and light (infrequent) keys, and constructs buckets for them based on estimated sizes from the samples.Each heavy key will be in a separate bucket, while multiple light keys can be grouped into the same bucket.The algorithm then scatters all records to their buckets by placing each record to a random slot in their bucket (or linear probe when the slot is occupied).Lastly, the algorithm refines each light bucket by radix sorting (the hashed keys) in each light bucket.The main issue in GSSB is that the scatter phase is implemented using a parallel hash table, which causes excessive random memory access, some space overhead, instability, and non-determinism.
To avoid the use of a parallel hash table, we propose an idea inspired by the I/O-efficient parallel samplesort [18]: when constructing buckets and scattering records, we split the input into consecutive subarrays, use auxiliary arrays to count the appearance of each bucket in each subarray, and distribute the records in each subarray based on the counts.This approach enables a cachefriendly access pattern to the input, allows us to obtain the exact size of each bucket, and is stable and race-free (and thus deterministic).However, since samplesort and semisort are quite different, using the idea in [18] does not directly enable high-performance.The challenge lies in choosing the best parameters for the number of heavy and light buckets.At a high-level, the samplesort in [18] creates a bucket for every sampled key.However, using too many buckets increases the size of the auxiliary counting array, which can greatly increase memory accesses.On the other hand, having more buckets is useful to improve parallelism, since each bucket can be processed independently in parallel.Specifically for semisort, we also wish to create more heavy buckets because heavy keys do not need to be refined and are easier to process.
To create the heavy and light buckets in the best way, we pro-  pose novel algorithmic ideas for semisort.First, we control the parameters to keep the number of buckets small, so that the auxiliary arrays fit in cache.This avoids excessive memory access to the auxiliary arrays (see more details about the auxiliary arrays in Sec.3.2 and Fig. 2).Second, we deal with each light bucket recursively in parallel.To enable efficient recursive calls, we carefully design optimizations to avoid extra space in recursive calls.Our new approach saves the main memory accesses for the auxiliary arrays, and more interestingly, identifies more heavy keys than GSSB with different degrees of "heaviness" using recursions.The "relatively heavy" keys in each light bucket can be identified and handled more efficiently and improve the overall performance.

Any input type Integer input type Ours= Ours< PLSS IPS
In addition to algorithmic improvement for performance, we also redesigned the algorithm interface.Our algorithm directly takes the input with any key type, a user-defined hash function ℎ, and an equality test (or less-than for semisort < ).This avoids the additional pre-and post-processing in GSSB.Due to the more flexible interface and algorithm design, our semisort algorithm can be easily extended to histogram and collect-reduce (see Sec. 3.5).
We tested our algorithms on a variety of benchmarks, with different core counts, input sizes, key lengths, and distribution patterns (uniform, exponential, and Zipfian).We summarize our results as a heatmap in Fig. 1.We also test two applications: graph transposing (reordering graph edge lists), and -gram on English texts.Both our semisort = and semisort < algorithms achieve high performance on almost all tests.For example, on 10 9 input data with 64-bit keys and 64-bit values over 15 distributions, our algorithm is 3.4× faster than the GSSB algorithm, and at least 1.35× faster than the best of the previous algorithms, both on average (geometric mean).Our algorithms also consistently perform well on the two applications with real-world data.In all but one application-input combination we tested in this paper, our algorithm is the fastest.Our code is publicly available [38], and we plan to integrate it into ParlayLib [13].

Problem Definitions
Given a sequence of records from a universe  , we define a key function key :  ↦ →  to define the key for each record, where  is the key type.We define =  as the equality test on .When applicable, we use <  as the less-than test on .Given a sequence of records , its key function key  , and the equality test =  on , the semisort problem is to reorder the records in  to  ′ such that all records with the same key are contiguous in  ′ .We also require the user to provide a family of hash functions ℎ :  ↦ → [1, . . .,   ], for some constant  ≥ 1.We call ℎ(•) the user hash function.
Given , key  , =  , and ℎ  , the histogram problem is to emit an array of key-value pairs  consisting of the unique keys of , with the value for each key equal to the number of times it appears in .The collect-reduce function takes the same arguments as semisort and two additional functions: a map function  :  → , and a reduce monoid (⊕  ,   ).The map function maps a record to a value of some type .The reduce operation ⊕  :  ×  →  combines values of type  with identity   .The collect-reduce function returns the array of key-value pairs  ∈  ×  consisting of the unique keys of , with the value associated with each key  equal to ⊕  ∈   ( ), where   = { ∈  | key  ( ) =   }.Note that histogram can be expressed as collect-reduce where the map function is the constant function 1, and the monoid is (+, 0).With clear context, we drop the subscripts for these operations and functions.

Computational Models and Other Notations
We use the work-span (or work-depth) model for fork-join parallelism with binary forking to analyze parallel algorithms [17,25], which is recently used in many papers on parallel algorithms [4,5,12,16,18,19,21,23,31,33,34,41,44,73,74].We assume a set of threads that share a common memory.A process can fork two child software threads to work in parallel.When both children complete, the parent process continues.The work of an algorithm is the total number of instructions and the span (depth) is the length of the longest sequence of dependent instructions in the computation.We can execute the computation using a randomized work-stealing scheduler [7,24] in practice.
To measure the memory access cost in an algorithm, we use the classic I/O model [3,40].We assume a two-level memory hierarchy.The processor is connected to the cache of size , and the cache is connected to an infinite-size main memory.Both cache and main memory are divided into blocks (cachelines) of size , so there are / cachelines in the cache.The CPU can only access the memory on blocks resident in the cache and it is free of charge.We assume an optimal offline cache replacement policy to transfer the data between the cache and the main memory, and a unit cost for each cacheline load and evict.The I/O cost of an algorithm is the total cost to execute this algorithm on this model.
We say that a sorting/semisorting algorithm is stable if the output preserves the relative order among equal keys from the input order, and otherwise we say that the algorithm is unstable.
We say an algorithm is race-free when no two concurrent operations in the algorithm can access the same memory access and at least one of them is a write [25].A race-free algorithm is (internally) deterministic [15], and has many advantages including ease Input: of reasoning about the code, verifying correctness, debugging, and analyzing the performance.In our algorithms, all operations in the algorithm are deterministic once we fix the random seed.

The GSSB Semisort Algorithm
We first review the existing GSSB semisort algorithm [45].As mentioned, the practical performance of GSSB is limited due to its excessive random memory accesses and restrictive interface.Our algorithm builds on the strengths of the GSSB, while overcoming the aforementioned limitations.The GSSB algorithm assumes the input as a sequence of hashed keys in range [0, . . .,   ] for some constant  ≥ 1, and semisorts the hashed keys.Sampling and Bucketing.This is a key technique in GSSB to handle heavily duplicate keys.GSSB first selects a sequence  of samples from the input sequence  with sample rate  =  (1/log ).The samples will be used to give an initial partition of the records into buckets, such that the same key always goes to the same bucket.Based on the samples, the keys are divided into heavy keys and light keys otherwise.We call the records with heavy (light) keys the heavy (light) records.The theory behind this idea is that if sufficient (Ω(log )) samples for a key  can be obtained, one can estimate the frequency of  (relatively) accurately.We call them the heavy keys or heavy records.Let   be the number of heavy keys identified by the samples.The algorithm will construct   heavy buckets, each for an individual heavy key.Meanwhile, a key  with few ( (log )) samples are unlikely to appear many times in the input, and we call them light keys or light records.The light records are grouped into   = Θ(/log 2 ) light buckets by using the hashed value to randomly map to one of the   buckets.Our new algorithm will also use a similar technique to detect heavy (duplicate) keys, but with different parameters for better performance.Size Estimation and Scattering.For a bucket with  samples, GSSB uses a size estimation function  () to upper bound bucket size whp.The algorithm will allocate an array of size (1 + ) () for this bucket for some constant  > 0. Then each record is scattered to a random position in the corresponding bucket.This is performed by using compare_and_swap, which atomically puts the record into the position, and re-picking another position using linear probing upon collisions or conflicts.Our new algorithms do not use this approach.
Local Sort and Pack.After scattering, all the heavy keys are collected in individual heavy buckets.Each light bucket can contain more than one light key type.The records in a bucket may not be contiguous due to the random scattering.GSSB then uses a radix sort (on the hashed keys) to refine light buckets (comparison sort is used in practice) and make them contiguous.A packing step is needed for heavy buckets to put records in contiguous slots.Our new algorithm also uses different approaches in this step.
The main performance issue in GSSB is the random access in the scatter phase-each record is assigned to a random location, and has to retry if necessary.GSSB hence needs  () random writes for the scattering phase, which is I/O-inefficient.This also requires slightly more space (and thus memory footprint) since we need to ensure a load factor  < 1.We will show how to overcome this issue, as well as to make our new algorithms stable and race-free in Sec. 3.
Another major issue of GSSB is its interface.GSSB assumes a collision-free hash function ℎ :  ↦ → [1, . . .,   ] that maps arbitrary key types to random integers (hashed keys), and the algorithm (and implementation) directly semisort the hashed keys, which are random integers.When using more realistic and practical hash functions with possible collisions, one has to perform preprocessing and postprocessing to deal with collisions.While such pre/postprocessing do not asymptotically increase the cost of the algorithm in theory, they can in practice incur significant time overheads comparable to semisort itself ( ()), and therefore make using semisort in applications prohibitively costly, relative to sorting.

Our New Algorithms
In this section, we present our algorithms for semisort and related problems.We present the useful notations in Tab. 1.For simplicity, we first focus on semisort and then explain how to modify it to adapt to histogram and collect-reduce in Sec.3.5.
Our semisort algorithm follows the same framework as GSSB, but employs novel techniques to improve the performance for all the steps.Our new algorithm is I/O-friendly, stable, and racefree.In contrast to GSSB, we do not require pre-hashing the keys.Our algorithm directly handles input records of any type, and extracts the hashed keys by applying the user hash function in the algorithm when needed.This generality in the interface also improves efficiency both in time and space-it avoids the pre-and post-processing, as well as the hash table to pre-hash keys and resolve collisions, which can incur another  () random reads and  () extra space.Our algorithm is stable-all records with the same key will be kept in the same order in the output.This feature is useful for collect-reduce and histogram and increases their generality, as discussed in Sec.3.5.Our algorithm is also race-free, which means no concurrent writes are needed to any shared memory position.This also makes our algorithms simple, practical, and internally-deterministic (i.e., the output does not depend on runtime scheduler).

Algorithm 1: The Semisort Algorithm
Input: The input array , a user hash function ℎ, and a comparison function comp (= or <).The original (top-level) input size is , and the current subproblem size is  ′ .Output: The semisort result in  (in-place) Parameters :  = 2  : number of light buckets.
18 parallel_for  : 0 ≤  <  ′ / do // For each subarray We start by overviewing the high-level idea, and then present more details in Sec.3.1 to 3.3.We discuss how to support histogram and collect-reduce in Sec.3.5.In Sec.3.6, we present the theoretical analysis of our algorithm, and discuss the choices of parameters in theory and in practice.Finally, we discuss the improvements of our algorithms over GSSB in Sec.4.1, and the comparison to existing sorting algorithms in Sec.4.2.

Reorder by bucket id
Figure 2: Our algorithm with a running example.We consider an input with  = 16 records with keys given.| | = 8 samples are taken, and keys with more than 2 samples are heavy keys.  =   = 2 in this example.We have  = 4 subarrays each with 4 records.We compute the counting matrix  and prefix array  as shown, and records can be distributed accordingly.Finally the local refining step recursively solves the 2 light buckets.
(2) Blocked Distributing.Next, it counts the exact number of records in each bucket.Given the bucket sizes, the algorithm distributes input records to their associated buckets in an I/Ofriendly manner.By performing exact counting, the temporary arrays used are only of size , and no parallel hash tables are necessary (as in GSSB).This distribution step makes the algorithm stable and race-free.(3) Local Refining.After Step 2, the heavy keys are at their final positions in the heavy buckets.For light buckets, unlike GSSB, our algorithm recursively semisorts them until the recursive input size is small enough (i.e., fitting in cache), at which point the keys are semisorted sequentially.This approach allows the algorithm to detect "medium-level" heavy keys in subsequent recursive rounds and also reduce the total number of I/Os.The pseudocode of our algorithm is given in Alg. 1, and a running example is given in Fig. 2. Next, we introduce the details of each step and explain why our decisions improve the performance of our algorithm.Since our algorithm uses recursive calls, we use  as the input size of the original (top-level) problem, and use  ′ as the current subproblem size in the recursive call.

Step 1: Sampling and Bucketing
The goal of the sampling and bucketing step is similar to GSSB-we want to identify heavy and light keys, which decides the bucket id for each record.Instead of having  (/log 2 ) light buckets as in GSSB, we use the number of heavy and light buckets (  and   ) as parameters.We use   as a tunable parameter, and set the upper bound of   accordingly as  (  ).We will later discuss in Sec.3.6 about how to pick   to achieve the best practical performance.
To determine heavy keys, we take a sequence of samples  of size Θ(  log ) by selecting each record uniformly at random (Lines 2-10 in Alg. 1).The light keys are keys appearing fewer than log  times, which indicates that their actual number of occurrences is small.We group multiple light keys into one bucket based on the hash keys.We create   light buckets by evenly partitioning the range of hashed keys (given by the user hash function) into   buckets.For simplicity, we use   = 2  as a power of 2, and the light bucket id of a key  is obtained by taking the last (least significant)  bits in the hash value of , i.e., the bucket id is (ℎ() mod   ).
The heavy keys are those appearing at least log  times in the samples; as in the analysis of GSSB, their actual number of occurrences is large whp.Given the sample size | | = Θ(  log ), the number of heavy keys   =  (  ).We will create   buckets with ids in [  ,   +   ) (the first   buckets are for the light keys).We use a sequential hash table  to store all heavy keys associated with their bucket ids, referred to as the heavy table, so that the later steps can look up whether a key is heavy in constant work.

Step 2: Blocked Distributing
Unlike GSSB, which uses a scattering step to place records to random positions in the buckets, our algorithm uses a more I/O-friendly and space-efficient approach.The goal of this step is to count the exact number of records in each bucket, and distribute all records to the associated buckets into contiguous slots.Since we have the exact count, we only need an array  of size  for all the buckets, making our algorithm space-efficient.This distribution step makes our algorithm stable and race-free.
Our idea is inspired by recent sorting algorithms [10,18,54,57].We first partition the sequence evenly into  ′ / subarrays, each with  records (recall that  ′ is the current subproblem size).We then process all the subarrays in parallel (Line 12), but sequentially within each individual subarray (Line 13).We count the number of records in each bucket in a ( ′ /) × (  +   ) matrix , which is referred to as the counting matrix.In particular,    is the number of records in subarray  falling into bucket .To do this, within each subarray , our algorithm determines which bucket each key  belongs to using the GetBucketId function (Line 27).This function first looks up the heavy table  to check if  is a heavy key, and if so, it obtains the bucket id  from  .Otherwise, the bucket id of a light key  is simply given by  = ℎ() mod   .We then increment the corresponding cell in    by one (Line 14).
We then distribute all records in the input to their corresponding buckets, using the information in .To do so, we compute the offset per subarray per bucket as a prefix array  that has the same size as .Array  can be computed using the prefix sum of , but in the column-major order (Line 15, see an illustration in Fig. 2).After the prefix array  is computed, we once again process each subarray and move each record to its corresponding bucket (Line 18-22) in the temporary array  .This step takes  (1) work per record-we use  (1) work to decide which bucket a record is in, and after that, we move the record and increment the offset counter in  .
We noticed that when picking the appropriate parameters, our approach is much faster than the corresponding step in GSSB in practice, mainly due to smaller memory footprint and fewer memory accesses.We will later show the analysis in Sec.3.6.

Step 3: Local Refining
After the previous step, we have all heavy keys stored contiguously in their corresponding heavy buckets, which are also their final positions in  .Light keys are still unsorted.We work on each light bucket in parallel by recursively semisorting each of them.We stop recursing and switch to the base case when the bucket size is small enough and fits in cache, which is decided by the parameter  (Line 1).For our experiments with input sizes ranging from 10 8 -10 9 , we typically need one more level of recursion before reaching the base case, if most of the keys are light keys.Since the base-case size fits in cache, the time to semisort the base cases is small.We provide two solutions for the base cases: semisort = and semisort < .semisort = .In the base case, semisort = uses a sequential hash table with chaining.We first build a hash table of size (1 + ) ′ for some constant  > 0.Then, we iterate over all keys and insert each key to the hash table with separate chaining.Finally, all records are packed to the output by looping over the hash table in order.Chaining allows the algorithm to maintain the order of the original input for records with the same key, and thus our algorithm is stable.Since each base case is small, the hash table can be maintained locally by each thread.semisort < .In the base case, semisort < uses a standard comparison sort.By using a stable comparison sort, we can also guarantee the stableness of our semisort.

In-place Optimization
Before the recursive call in Alg. 1, we copy the temporary array  back to  (Line 23).This copy accesses the whole arrays  and  , which is expensive in practice.We note that we can save this copying by swapping the two arrays  and  in the recursive call.Namely, we skip Line 23 and in Line 25 we sort the light buckets in  , and use the corresponding part in  as the other array to take the output.For the base cases and the heavy buckets, if they happen to reside in  , we copy them back to .By doing this, we avoid the copying in Line 23.This reuses the auxiliary array  and also avoids allocating new memory in every recursive level.Since in most cases the recursion will reach the base case in two levels, the entire algorithm copies the data twice per record, first from  to  , and then from  back to .
Here we use "in-place" to indicate that the input and output of semisort are in the same array.Our algorithm still uses  () extra space.We will discuss how to reduce space usage in Sec. 6.

Supporting Histogram and Collect-Reduce
Using our semisort algorithm, the histogram and collect-reduce primitives can be supported with minor modifications.Here we will elaborate on collect-reduce since histogram can be considered as collect-reduce with values always equal to 1 for all records.
We still use the Sampling and Bucketing step to determine the heavy keys.Then in the Blocked Distributing step, it is unnecessary to distribute the heavy keys to their corresponding buckets.Instead, we first directly compute the reduced values (or counts for histogram) for the heavy records in each subarray (all the subarrays can be processed in parallel), and then reduce the results of all subarrays.In the base-case of the Local Refining step, we use the version based on hash tables.When any duplication is identified, we directly combine their values instead of chaining.Since the algorithm is stable, it works on any associative reduce functions (in particular, there is no need to be commutative).
Generally speaking, histogram and collect-reduce can be significantly faster than semisort when there are many heavy duplicate keys, as we do not need to distribute the heavy records and only need to distribute the "locally reduced value" for each heavy key in each subarray.When no or few duplicate keys are in the input, histogram and collect-reduce can perform slightly slower than semisort.This is because they perform almost identical computations as semisort to reorder records, but need an extra step to pack the keys and reduced values into the output.

Analysis and Parameter Choosing
Our new semisort algorithm has three parameters:  (subarray size),   (light bucket number), and  (base case size).Other parameters (e.g., the number of heavy buckets   ) are set accordingly.The values of  and   are fixed for all levels of recursions.To ensure the space usage is  (), we will assume   ≤  since the sizes of matrices  and  has size Θ(  • /).We also assume the sample set size | | =   log  =  ().In the following, we will use  as the original problem size, and use  ′ as the current size of the recursion.
We will analyze the cost bounds and show that our semisort algorithm is efficient under reasonable assumptions of modern multicore architecture.Then we will show how to select the parameters in practice for the best practical performance.Theoretical Analysis.We start with analyzing the number of recursion levels in our algorithm.Lemma 3.1.The number of recursion levels is  (log   (/)) whp for both semisort = and semisort < .
Proof.From the same analysis from GSSB [45], the number of records in each light bucket is  (/  ) whp.Therefore, the light bucket size shrinks by a factor of Θ(  ) whp in each level of recursion, and the number of recursive levels is  (log   (/)) whp.□ For simplicity in stating the bounds, we use  =  (log   (/)) to denote the number of recursion levels.We start with the work of the algorithms and present the result in Thm.3.2.Theorem 3.2.The work of semisort = is  () whp.The work of semisort < is  ( +  log ) whp.
Proof.We first show the work analysis for semisort = .We start with considering the top level of recursion.As assumed above, the number of samples is  (  log ) =  (), and thus the Sampling and Bucketing step has  () work.In the Blocked Distributing step, it takes  (1) work per record to find the bucket it belongs to.As mentioned above, we assume   ≤  so that the counting matrix  and prefix array  have sizes  (), and computing prefix sum also has  () work.The step to distribute the records to array  (lines [18][19][20][21][22] is also  () since each record is processed once.For each recursion level, this argument is still true, and the work of all the subproblems in one level adds up to  ().Assuming  recursion levels, the work before entering the base cases is  () for both semisort = and semisort < .For semisort = , the work of each base case is  ( ′ ), which gives  () total work for all base cases.For semisort < , the work of each base case is  ( ′ log  ′ ), where  ′ can be at most .Therefore the total base-case work is  ( log ) for semisort < .Combining the results gives the bounds in Thm.3.2.□ Although semisort < has a higher work, the overhead is caused by the comparison sort in base cases.However, the base cases fit in cache and are highly-optimized.In the experiments semisort < shows as good performance as semisort = in most cases.
We then analyze the span of semisort = and semisort < , and show that they are highly parallel.
Theorem 3.3.The span of semisort = is  (( +  log ) +) whp.The span of semisort < is  (( +   log ) + log ) whp.Proof.The Sampling and Bucketing step is executed sequentially with  (  log ) span.We note that this step can be easily parallelized [45], but our implementation still performs it sequentially, since it is cheap anyway.For the distributing step, we have two sequential for-loops (Lines 13 and 19), leading to  () span.Computing the prefix sum ( from ) has  (log ) span.In total, the span of one recursive level is  ( +   log ).Hence, considering  recursive levels, both algorithms have  (( +   log ) ) span before the base cases.semisort = uses sequential hash tables in base cases, which leads to  () span.semisort < uses a comparison sort in base case, which can achieve  (log ) span whp in theory [17] (our implementation coarsens the base case by using a sequential sort, since the base case size is small).Combining the results above gives the bounds in Thm.3.3.□ Considering both work and span, the parallelism (defined by the ratio between work and span) for both algorithms is roughly Θ(/) (in practice we choose  much larger than   and ).Given the number of processors  in a machine, our semisort algorithm achieves sufficient parallelism if we can set / = Ω().
We analyze the I/O bound of the algorithms with our choices of parameters to make the bound optimal ( (/)).We make the assumption that / = Ω( 1/2 ) (recall that  and  are cache size and cacheline size, respectively).For reasonable values of  ≤ 10 12 , this assumption is true for both commodity machines (e.g., laptops) as well as more powerful servers.We present our results in Thm.3.4.Theorem 3.4.Assume / = Ω( 1/2 ), using parameters   = Θ( 1/4 ),  = Θ( 1/2 ), and  = Θ( 3/4 ), both semisort = and semisort < have I/O cost of  (/) whp.Proof.Given the parameters in the theorem, the number of recursive levels is  =  (1) whp.Therefore, we only analyze the top-level recursion.Since the sizes of  and  are  (  • (/)) =  ( √ ) =  (), all memory accesses to arrays  and  fully fit into cache except for the first access.When / = Ω( 1/2 ) and  = Θ( 1/2 ), we can choose  to fit the base cases in cache, such that the base cases can be solved without using additional main memory access after loading the data to the cache.The only cache misses are when accessing the input array  and the buckets  .The accesses to  are all serial accesses.For  , we are writing serially from (  +   ) • (/) pointers as stored in the  matrix.Even when all the pointers are non-consecutive, only (  +   ) • (/) =  ( 1/2 ) cachelines are active at any time, and they all fit in cache.For every pointer, there is one cache miss every  accesses to the array  .Therefore, the total I/O cost to generate  is  (/).Note that this analysis is true for both the root level (when input size is ), as well as the recursive levels (the total sizes of  and  for all subproblems in the same recursive level are still (  +   ) • / =  ( 1/2 )).In summary, both semisort = and semisort < have I/O cost  (/) whp assuming / = Ω( 1/2 ), which improves the  () I/O bound of GSSB by a factor of  ().The bound is optimal, since loading the input needs Ω(/) I/Os.□ Since I/O-efficiency is one of our main design goals, we use the parameters in Thm.3.4 to present the work and span bounds below.Theorem 3.5.Assume / = Ω( 1/2 ), and parameters   = Θ( 1/4 ),  = Θ( 1/2 ), and  = Θ( 3/4 ).semisort = has  () work,  ( 3/4 ) span, and  (/) I/O cost.semisort < has  ( log ) work,  ( 3/4 ) span, and  (/) I/O cost.All bounds are whp in .
Parameters in our Implementations.The performance of our semisort algorithm is reasonably consistent for a large parameter range.The best parameters of each input instance can be different, decided by input size, heavy record ratio, etc.In our implementation and all experiments, we pick   = 2 10 ,  = /5000 (at most 5000 subarrays in all subproblems in one recursive level), and  = 2 14 .These numbers satisfy the conditions in the theoretical analysis in Thm.3.5 when  = 10 8 to 10 9 .We set the number of samples | | = 500 log , so we can have at most   = 500 heavy keys.We set up these parameters to ensure that the matrices  and  and the base cases are small enough to fit in the last-level cache for modern multicore machines.

Improvements over GSSB
In this section, we compare and discuss the improvements of our algorithm(s) over the existing semisort algorithm GSSB.Flexible Interface.Recall that GSSB requires hashed keys (integers) as input, which needs a pre-and post-processing to resolve collisions.Our algorithm supports arbitrary key types  with =  or <  , with a user hash function.For integer keys, we provide the option to use the identity function, resulting in semisort-i = and semisort-i < , which can be much faster in many cases, although we note that these versions do not admit as good theoretical bounds.Our interface also supports histogram and collect-reduce with minor changes.Low Space Usage.In the Blocked Distributing step, we compute the exact counts for the buckets, so when distributing the keys, the total size of the buckets is , instead of Θ() as in GSSB (their buckets need to have a load factor smaller than 1 because of random scatter).Other than space overhead, GSSB also needs a costly packing step.I/O-Efficiency.Our algorithm also uses several techniques to enable a better memory access pattern.We pick a small number of buckets (  = 2 10 ), as opposed to  (/log 2 ) of them in GSSB, such that the counting matrix  and its prefix sum  in our algorithms fit in cache (recall that we access them in column-major).As such, the Blocked Distributing step incurs no random accesses to the main memory.Stability and Determinism.Due to avoiding using parallel hash tables, our semisort algorithms (both semisort = and semisort < ) are stable and race-free.GSSB is not race-free (due to using parallel hash tables), and is unstable (heavy keys are in random order), and thus cannot support non-commutative operations in collect-reduce.

Relationship to Sample Sort and Integer Sort
Many ideas in our semisort algorithm are closely related to ideas in sorting algorithms, as we will discuss in this section.Samplesort.Samplesort is the general idea of using multiple pivots in quicksort; clearly this algorithm can be used to solve the semisort < problem.The algorithm selects  pivots, uses them to partition the input into  + 1 buckets, and sorts all of the buckets in parallel.We refer the audience to [10] for a detailed literature review on samplesort.We compare to the state-of-the-art samplesorts from ParlayLib [13] and IPS 4 o [10] in our experiments.
Similar to samplesort, our algorithm also partitions the input into buckets and processes them in parallel.However, samplesort is a comparison sort that requires the <  operation, and has an Ω( log ) lower bound in work, whereas our semisort = algorithm only requires  () work.The ParlayLib samplesort [13] uses one level of partition.IPS 4 o [10] (preliminary version as [59]) also uses a small number of samples and sort recursively.They use an implicit search tree (breadth-first traversing the tree that stores the sorted pivots) to find the bucket for each record, which is not required in our approach.They also use a smart approach for the distribution step, and we discuss this in Sec. 6. in a variety of places.Integer sort.Integer sorting can be used to semisort integer keys for semisort = , or to semisort the hash value of any key types with an extra step to resolve collisions.Unlike the sequential radix sort that starts from the least-significant bits, all parallel integer sort algorithms are top-down and first look at the most-significant bits.We refer to [57] for a detailed literature review for parallel integer sort.We compare to the state-of-the-art integer sorts from ParlayLib [13], RegionsSort [57], and IPS 4 Ra [10] in our experiments.
The major advantage of our semisort algorithm over integer sorting is that our algorithm can identify heavy keys.Consider a heavy key  and a light key  + 1.Our algorithm will put  in a separate heavy bucket, and only deal with  + 1 in a light bucket in the last step.For existing integer sorts [10,13,57], both keys are likely kept in the same bucket for all levels and separated only in the last round, which can result in significant wasted work and load imbalance.

Experiments
Experimental Setup.We run our experiments on a 96-core machine (with two-way hyper-threading) with 4 × 2.1 GHz Intel Xeon Gold 6252 CPUs processors (with 36MB L3 cache) and 1.5TB of main memory.We implement our algorithms in C++ using ParlayLib [13] for fork-join parallelism and some parallel primitives.We compile our code using clang version 14.0.6 with -O3 flag.We always use numactl -i all to interleave the memory on all CPUs except for sequential tests.We run each test four times and report the median of the last three runs.All running times are given in seconds.Baseline Algorithms.We compare our algorithms to the stateof-the-art comparison and integer sorting algorithms and collectreduce algorithms.We provide the list of the baseline algorithms we compare our algorithms against in Tab. 2. For fairness and consistency, we require the output to be written to the input array (i.e., in-place).We note that this is beneficial for PLSS, IPS 4 o, IPS 2 Ra, and RS as they are originally designed for the in-place setting.Some of the baselines only work for integer types (integer-only), including PLIS, GSSB, RS, and IPS 2 Ra.IPS 4 o and PLSS work on any input types (any-type).For the any-type algorithms, semisort < , PLSS and IPS 4 o require the less-than test <  , while our semisort = only needs the equality-test =  .GSSB assumes the input keys are already hashed and does not resolve collisions, so we also categorize it as integer-only.Among all implementations, all our algorithms and PLIS are stable.This also means that they can be applied to collect-reduce with arbitrary (associative) reduce operations, while the others also require the reduce operation to be commutative.We note that there are two versions of samplesort in ParlayLib.The stable one is slower and the unstable one is faster.Our experiments use the unstable but faster version.When comparing the average performance, we always use the geometric mean.
Our Algorithms.We use the two versions of our algorithm semisort < and semisort = that work on any-type.In tables and figures, we also use "Ours = " and "Ours < " to refer to them, and use "Ours ⊕ " to refer to our collect-reduce implementation.When comparing with the integer-only implementations, we use simplified versions without hashing (see Sec. 4.1), and call them semisort-i = and semisort-i < (or "Ours-i = " and "Ours-i < "), where the hash function is an identity function.The choices of parameters in our algorithms are in Sec.3.6.Input Distributions.We use three distributions for evaluating our algorithms: uniform(), exponential(), and Zipfian().If not specified, the default setting is  = 10 9 with 64-bit keys and 64-bit values.We also include tests of our algorithm on varying input sizes and key lengths (Figs.3b and 4).For uniform distribution, we test  = 10 1 , 10 3 , 10 5 , 10 7 , 10 9 .For exponential distribution, we test  = 1 × 10 −5 , 2 × 10 −5 , 5 × 10 −5 , 7 × 10 −5 , 1 × 10 −4 .For Zipfian distribution, we test  = 0.6, 0.8, 1, 1.2, 1.5.We use distributionparam to denote the input distribution with parameter param (e.g., uniform-10 9 ).We show relevant statistics of the inputs along with our results in Tab. 3. We present the number of distinct keys, the maximum frequency, and the ratio of keys with more than 500 log  occurrences, which is noted as "Heavy Freq." in Tab.measured for each distribution to indicate skewness of the data.For the synthetic data, we always set the value type the same as the key type.For most of the tests, we provide figures on one representative distribution, and provide more results in the appendix.

Overall Performance
We present the running time of all tested implementations with  = 10 9 64-bit keys with different distributions in Tab. 3, and a heatmap (normalizing all running times to the fastest on each test) in Fig. 1.On all but one test, our algorithms are always the best two implementations.Among any-type algorithms, our semisort = and semisort < are 1.03-2.47×and 1.04-2.53×faster (respectively) over the best of the other algorithms.For integer-only algorithms, our semisort-i = is 1.09-2.78×faster than the other algorithms.Our semisort-i < is about 20% slower than PLIS in one test, and is up to 2.82× faster than all baselines on all other tests.Overall, our algorithms are always faster than the baseline algorithms using geometric mean.Note that some of the baselines are competitive on some individual tests, such as IPS 4 o on uniform-10 3 and uniform-10 9 , PLIS on uniform-10 9 and Zipfian-0.6,and IPS 2 Ra on Zipfian-0.6.However, their performance can be unstable over different distributions.IPS 4 o is relatively fast on uniform distributions but performs worse on skewed distributions.We also compute the geometric means in Tab. 3 and Fig. 1 to compare the performance on each distribution.Based on these numbers, semisort = and semisort < have very close performance (within 5%).All the other algorithms are at least 30% slower than both of our implementations on average.We also show relative performance for 32-bit and 128bit keys in Figs. 5 and 6.On average, our algorithms are consistently the fastest.We note that not all comparisons are apple-to-apple comparisons.PLSS and IPS 4 o work for general sorting which is asymptotically harder than semisort.PLIS, RS, and IPS 2 Ra are for integer sorting, which is also slightly different than semisorting.Also, PLIS and all our implementations are stable while others are not (see Tab. 2).
Interestingly, the integer sort algorithms can be slower than comparison sorts on 64-bit keys.We tested on 32-bit and 128-bit keys and show the running time in Figs.19 to 24 in the appendix.Unsurprisingly, integer sort algorithms are usually faster than comparison sort algorithms on 32-bit keys, and get worse on 128-bit keys (PLIS is the only integer sort in Tab. 2 that supports 128-bit keys).On average, our algorithms are still the fastest on 32-and 128-bit keys, and the gap is smaller for 32-bit keys and larger for 128-bit keys.
One advantage of our algorithms is that they can identify heavy keys and use little further work (no local refining needed) on them.Thus, the running time of our algorithms decreases with more heavy keys (see Tab. 3).Many baseline algorithms also use optimizations on the heavy keys (e.g., PLSS), and they show a similar trend.Parallel Scalability.We present the scalability curves using different number of threads in Fig. 3a on one representative distribution (Zipfian-1.2, = 10 9 ), and for other distributions in Figs.7 to 12 in the appendix.All of our semisort algorithms, as well as PLSS, generally achieve the top-tier (almost linear) speedup, while other algorithms also scale well with increasing core counts.The self-speedup of semisort = and semisort < are 50-80×, The speedup numbers are slightly worse for semisort-i = and semisort-i < (30-50× speedups), as they save the work for the hashing step but can lead to unbalanced subproblem partitioning (light buckets).Input Size Scalability.We test all algorithms on input sizes from 10 7 to 10 9 on different distributions.A representative one (Zipfian- We put crosses on RS and IPS 2 Ra because they do not support 128-bit keys. 1.2 is given in Fig. 3b), and others are given in appendix (Figs. 13 to 18).For very small test cases  ≤ 2 × 10 7 , PLSS is the fastest on certain tests.However, in those cases, the running time is below 0.05s.For  ≥ 5 × 10 7 , our algorithms are consistently faster than all baselines.These results indicate that our algorithms perform well on reasonably small size and scale favorably well to large inputs.
Varying Key Lengths.In addition to 64-bit keys, we also tested 32-bit and 128-bit keys for  = 10 9 .We always set the value to be the same type as the key.Full results are given in the appendix (Figs.19 to 24), and the running times on one representative distribution are shown in Fig. 4 (Zipfian-1.2).Firstly, integer sort algorithms are sensitive to key lengths.RS and IPS 2 Ra do not support 128-bit keys, and PLIS's performance on 128-bit keys is usually the slowest based on our testing.On 32-bit keys, integer sort algorithms can achieve much better (relative) performance than on 64-bit keys.Also, integer sort algorithms generally perform poorly on highlyskewed data (see discussion in Sec.4.2).Other algorithms, including semisort (ours and GSSB) and comparison sort (PLSS and IPS 4 o), are less sensitive to key lengths.Hence, the trends on 32-and 128-bit are similar to that on 64-bit.Our new algorithms generally perform well since semisort is simpler than sorting, and we can apply special optimizations (e.g., for heavy keys).In certain cases when not many special optimizations can be used (e.g., uniform-10 9 ), PLSS and IPS 4 o perform similarly to our algorithms.

Collect-Reduce
We test our collect-reduce algorithm (histogram is a special case for collect-reduce) and show the results on Zipfian distribution in Fig. 3c.The full results for other distributions are given in Figs. 25 to 27.Recall that our collect-reduce algorithm is similar to semisort = , but directly combines values for keys.The values of the heavy keys are combined in the Blocked Distributing step (no need to distribute), and the values of the light keys are combined in the Local Refining step.The only existing parallel implementation of collect-reduce that we know of is in ParlayLib [13] (PLCR), and we compare with it.We also show the performance of semisort = as a baseline in Fig. 3c.
The operator that we test for the reduce (on the values) is addition.We use Zipfian distributions with varying parameters as it smoothly covers different amounts of skew in the input.First, our collectreduce is consistently faster than ParlayLib's implementation, and the gap is larger when the distribution is more skewed.Furthermore, when heavy keys occur more, collect-reduce is significantly faster than semisort = .This is because we reduce the values for each bucket in the Blocked Distributing step, and then combine them without moving them.However, when few heavy keys exist, collect-reduce incurs more work than semisort, because some additional work is needed in the Local Refining step to pack the output since some keys are combined, while in semisort the input size equals to output size and no packing is needed.In conclusion, when the input is more skewed (more heavy keys), collect-reduce is faster than semisort = , and vice versa on more evenly-distributed data (more light keys).

Applications
We integrate our algorithms into two real-world applicationsgraph transposing, where the input is edges, and n-grams, where the input is strings-to test our algorithm in more realistic settings.Unlike our previous experiments with synthetic distributions for performance study, here we benchmark these applications on real-world datasets and derive a more realistic understanding of semisorting performance in practice.
searches both "forwards" and "backwards".The backward reachability searches can be performed by running forward reachability query on  ⊺ .In many existing graph libraries, the edges are stored in the Compressed Sparse Row (CSR) format, where for each vertex , the other endpoints of edges from  are stored contiguously.Thus, transposing the graph is exactly semisorting the CSR input using the other endpoint.In some existing parallel graph libraries such as Ligra [64] and GBBS [30], stable comparison sorts are used for graph transposing to preserve the ordering of the first endpoint.
We evaluate transpose on four real-world directed graphs, soc-LiveJournal (LJ) [11], twitter (TW) [50], Cosmo50 (CM) [51,71], and sd_arc (SD) [55], where the largest input has 2.04 billion directed edges.For the social networks (LJ, TW) and web graph (SD), the degree distributions are more skewed.For the -NN graph CM, the degrees are more evenly-distributed.We give more details about these datasets in Tab. 4. We use the initial CSR versions of these graphs and use our semisort < and semisort = algorithms to transpose the graphs.We compare with all the baseline algorithms and show the relative performance in Tab. 4. On all the graphs, the keys (vertex id) are 32-bit.Since the input data are integers, we use our integer version (identity hashing function).
Our semisort-i = is the fastest on three graphs (TW, CM, and SD), and is within 15% slower than the fastest on the other graph (LJ).Our semisort-i < is competitive, and is within 20% slower than the fastest on the other graphs.PLIS has relatively good performance on all graphs; it is the fastest on LJ (the smallest graph) and within 35% on the others.On the average performance across the four graphs, semisort-i = is significantly better than the others (1.15-2.13×faster).semisort-i < and PLIS have similar performance on average (within 1%).They are at least 25% faster than other implementations.N-Gram.Our second application is to process -grams, where an -gram is a consecutive sequence of  items from a given sequence (e.g., text or speech).We use the 2-gram and 3-gram datasets from Wikipedia [8], and clean the data by only keeping alphabetical characters and converting them to lowercase.Each -gram record consists of  consecutive words in the document.We use the first  − 1 words of a record as the key, and use the last word as the value.We note that in our algorithms, we compute the hash values of the words on the fly.Semisorting -grams can be used to identify all possible words after a given context, and to provide recommendations for text inputs, and to learn the pattern of the input sequences.Our results are shown in Tab. 5. On both 2-gram and 3-gram, our semisort = is the fastest, while semisort < (within 25% slower) is competitive.The average performance of semisort = is 15% faster than semisort < , 24% faster than PLSS, and 2.3× faster than IPS 4 o.

Conclusions and Future Work
In this paper, we designed flexible and high-performance algorithms for semisort and related problems.We presented two implementations, semisort = (only the equality-test is required), and semisort < (the less-than-test is also available).Compared to previous semisort algorithms, our new algorithms yield improvements in terms of space-efficiency and I/O-friendliness, ensure stability and determinism, and importantly, increase the flexibility of the interface.On different input distributions, input sizes and key lengths, our implementations achieve high performance, and outperform existing sorting and semisorting algorithms in most of the tests.For example, on 10 9 64-bit keys, on all the tested distributions, (one of) our algorithms are always the fastest among all tested algorithms, and the other one always performs similarly.
Based on our experiments, in-place versions of the sorting algorithms (e.g., IPS 4 o) are competitive and sometimes more efficient than the non-in-place versions (e.g., PLSS).The good performance for the in-place algorithms is due to the I/O savings in the distributing step-they use the same array for both the input and the buckets ( and  in Alg. 1).We note that the new techniques proposed in this paper are independent of this distribution step.An interesting future direction is to redesign this step (e.g., borrowing ideas from IPS 4 o) to improve the overall performance and reduce the extra space usage.

Figure 1 :
Figure 1: Heatmap of the relative performance of implementations normalized to the fastest in each test (each row). = 10 9 .64-bit keys and 64-bit values.The parameters in exponential distributions are multiplied by 10 4 .The algorithm names and details are introduced in Tab. 2.

Table 1 :
Notations and parameters used in our algorithms.
log  ′ sampled keys from  3 Count the occurrences of each key in  4 Initialize the heavy table  5  ←   // This for-loop can also be performed in parallel theoretically 1 if | | <  then return BaseCase(, ℎ, comp) // Base cases Sampling and Bucketing: 2  ← 6 for each distinct key  ∈  do 7 if the occurrences of  in  is at least log  then 8  .insert(,) // Assign bucket id  to heavy key  9  ←  + 1 10   ← number of distinct keys in  Blocked Distributing:

Table 2 :
Name Stable Det.comp Notes Ours =Yes Yes Any = Our semisort = algorithm Ours < Yes Yes Any < Our semisort < algorithm Ours-i = Yes Yes Int = Our integer semisort = algorithm Ours-i < Yes Yes Int < Our integer semisort < algorithm Algorithms tested in our experiments."Det."= determinism."" = key type."Any" = any input key type."Int" = only allows for integer keys."comp" = required comparison function.PLSS has two implementations.We use the faster but unstable version in our experiments.

Table 3 :
3. They are Ours < PLSS IPS4o Ours-i = Ours-i < PLIS GSSB RS IPS 2 Ra Running times with different input distribution with  = 10 9 , 64-bit keys and 64-bit values.Underlined numbers are the fastest running time in each distribution-input type instance."parameter" = distribution parameters (i.e.,  in uniform,  in exponential, and  in zipfian distribution)."Distinct keys", "maximum frequency", and "heavy frequency" are statistics for each test (see details in Sec. 5).The algorithm names are described in Tab. 2. "Avg."means geometric mean numbers across multiple tests.

Table 4 :
require running reachability    dist  max  heavy Ours-i = Ours-i < PLSS IPS 4 o PLIS GSSB RS IPS Running time on graph transposing (in seconds). = number of vertices. = number of edges. dist = number of distinct keys. max = maximum frequency. heavy = ratio of keys with more than 500 log  occurrences."t.o." = did not finish in one minute."s.g." = segmentation fault.  dist  max  heavy Ours = Ours < PLSS IPS