Provably-Efficient and Internally-Deterministic Parallel Union-Find

Determining the degree of inherent parallelism in classical sequential algorithms and leveraging it for fast parallel execution is a key topic in parallel computing, and detailed analyses are known for a wide range of classical algorithms. In this paper, we perform the first such analysis for the fundamental Union-Find problem, in which we are given a graph as a sequence of edges, and must maintain its connectivity structure under edge additions. We prove that classic sequential algorithms for this problem are well-parallelizable under reasonable assumptions, addressing a conjecture by [Blelloch, 2017]. More precisely, we show via a new potential argument that, under uniform random edge ordering, parallel union-find operations are unlikely to interfere: $T$ concurrent threads processing the graph in parallel will encounter memory contention $O(T^2 \cdot \log |V| \cdot \log |E|)$ times in expectation, where $|E|$ and $|V|$ are the number of edges and nodes in the graph, respectively. We leverage this result to design a new parallel Union-Find algorithm that is both internally deterministic, i.e., its results are guaranteed to match those of a sequential execution, but also work-efficient and scalable, as long as the number of threads $T$ is $O(|E|^{\frac{1}{3} - \varepsilon})$, for an arbitrarily small constant $\varepsilon>0$, which holds for most large real-world graphs. We present lower bounds which show that our analysis is close to optimal, and experimental results suggesting that the performance cost of internal determinism is limited.


Introduction
A popular approach to efficient parallelization has been to leverage the inherent parallelism present in many sequential algorithms, and graph problems have been shown to be a particularly fertile ground for this approach.A simple illustration is given by the greedy algorithm for Maximal Independent Set (MIS) on graphs in which we initially fix a random ordering of the graph nodes, after which we process nodes one-at-a-time, in this order, adding the node to the MIS if none of its earlier neighbors has been previously added to the MIS, and discarding it otherwise.When trying to parallelize such sequential algorithms, two key questions are 1) the likelihood of contention, i.e. that two arbitrary nodes in the ordering have data dependencies, and 2) the depth of the longest dependency chain between "dependent" nodes, for any given graph.Intuitively, the total number of nodes divided by the maximal dependency depth gives an estimate for how many nodes can be processed in parallel, on average.Assuming low dependency depth, e.g., sublinear in the number of nodes or edges, a second challenge is to design parallel algorithms which are able to leverage it for fast end-to-end runtimes, usually measured as the total number of thread memory accesses or work, which includes managing any auxiliary data structures. 1 3 −ε ), for any constant ε > 0, which should be reasonable in practice, as graph inputs tend to be large relative to parallelism.
Our main technical contribution is in the potential argument bounding the collision probability between tasks processing two distinct edges in the random ordering.Specifically, we assume a sequential algorithm maintaining a standard "compressed forest" data structure, where elements are nodes, arranged into directed trees, and the tree root is the set representative [6].The addition of a new edge may lead to components being linked, where the link direction is decided by the algorithm.Two edges collide if they cannot be processed in parallel: for instance, this can happen when the two edges processed simultaneously would lead to the same root being linked to different components.
In this context, we first analyze a "sequentialized" variant of the process, in which, at each step, a new edge is added to the data structure, and consider the probability p t of two randomly chosen edges "colliding" after exactly t steps.While we cannot bound p t independently of the graph structure, and p t may fluctuate significantly over steps, we are able to bound the average value of p t , taken over time steps t, via a new potential function, which we show to be well-correlated with the "smoothed" collision probabilities over time, In turn, this implies a bound on the expected number of collisions over a number of parallel steps.We present a complementary lower bound showing that the number of collisions for any Union-Find algorithm is Ω(log |V |) for cycle graphs.
Next, we design a work-efficient algorithm which leverages these analytical observations.We apply the deterministic reservations approach [8], which we customize to our setting.Specifically, for a well-chosen parameter S, the algorithm proceeds in "windows" of S consecutive edges, where in each such stage the threads first attempt to "mark" roots in parallel via deterministic reservations, and then examine whether any of the reservations resulted in data conflicts because of the underlying graph structure.If no such conflict occurs, then the edges can be processed fully in parallel.Otherwise, the threads process the conflict-free prefix.Then, they proceed to execute a new window of size S on the remaining suffix.The collision bound above implies that this algorithm has asymptotically-optimal work if the number of threads is O(|E| Our algorithm is internally deterministic-roughly, given a fixed input ordering, one obtains a unique dependency graph between the tasks corresponding to the edges, and will therefore have the same complexity as an equivalent sequential execution.As a consequence of internal determinism, our algorithm should be generally-useful as an efficient sub-procedure: for example, we can leverage it for a new solution for the Dynamic Spanning Tree and the Minimal Spanning Tree (MST) problems.Our main result is as follows:   3 • polylog(|E|)) parallel depth on a randomly shuffled sequence of edges and for the number of parallel threads T = O(|E| 1 3 −ε ).This also implies a parallel version of Kruskal's MST algorithm, which has the same work and depth bounds in the special case where we have a sorted sequence of edges, and edge weights are generated so that the sorted sequence is a random shuffle.

Related Work
Our work extends the line of research analyzing inherent parallelism in classical sequential algorithms [1,2,3,4,5] to show that Union-Find is also efficiently parallelizable.This partly addresses a question by Blelloch [7], who posed the dependency depth of Union-Find as an open problem.
Prior work proposed a linear-work algorithm for graph connectivity [9]; however, this uses a different decompositionbased parallelization approach, which requires a static graph, and does not allow for incremental edge additions, nor online connectivity queries.The best known parallel algorithm for the a variant of Union-Find in the concurrent-read concurrent-write (CRCW) PRAM model was proposed by Simsiri et al. [10].They investigate a parallel version of batched union-find, assuming that operations are inherently grouped into batches.The algorithm is work-efficient and guarantees polylogarithmic span.It processes batches of find operations with path compaction and asymptotically the same total work as in the sequential setting; however, this part requires significant synchronization between threads, and thus, it may not be very efficient in practice.For union operations, the approach is to reduce the problem to a linear-work parallel connected components algorithm, such as [9].By comparison, our algorithm is internally deterministic, which allows it to be used as a sub-component for e.g.MST algorithms.
Anderson et al. [11] considered incrementally maintaining a spanning tree in a "sliding window" model, where edges may appear and disappear over time, which is different from ours.Our results should also extend to analyzing alternative ways of parallelizing sequential iterative algorithms, such as by defining task priorities and executing them via a relaxed priority queue [12,13], in which case it can upper bound the amount of wasted work due to out-of-order execution; we leave this analysis for future work.
We also mention existing work on efficient concurrent variants of Union-Find algorithms, which modify the original linking approaches to employ atomic operations allowing concurrent access to the Union-Find data structure [14,15,16].Our work is only partly related, as it considers a completely different parallelization approach, with different metrics and progress guarantees.Specifically, the above line of work considers an asynchronous shared-memory model with atomic operations, studying total step complexity.By contrast, we consider a standard parallel model, and study classic notions of task parallelism such as dependency depth and collisions.

Notation and Preliminaries
Arguably, the most common data structure to address the Disjoint-Set Union / Union-Find problem is the compressed forest [6].Here, nodes / set elements form directed trees with roots used as representatives of their sets.As a result, checking whether two elements are in the same set can be implemented by following parent links of these elements in the directed trees, finding the root representatives, and comparing them.Merging two sets means adding a link from one root to another one.
Algorithms based on compressed forest data structures differ in two key ingredients.The first is the linking technique: by choosing which root becomes the common root when merging two trees, it is possible to limit tree depth.Three popular linking strategies are linking by size-linking the smaller tree to the larger one, linking by rank-similar to linking by size but tree sizes are approximated, and linking by random priorities-the root with lower random priority is linked to the root with higher priority.All three linking strategies achieve O(log n) maximum tree depth [6,17], though in the case of linking by random priorities this is in expectation.The second key component is path compaction-traversed paths in trees are shortened after every root search by replacing parents of every visited node with nodes higher in the tree.When combined with any of the linking techniques, this results in O(α(n)) amortized time per operation, where α(n) is the inverse-Ackermann function.
We will associate unite(u, v) operations with (u, v) edges in the union graph G = (V, E), where vertices correspond to Union-Find elements.These edges can also be viewed as tasks; to execute the task means to unite sets corresponding to edge endpoints.This high-level approach allows us to study union-find properties for different graph structures.

Definition of Collisions
The key definition for analysing parallel and concurrent dynamic graph algorithms is that of an edge collision.Intuitively, two edges collide when they cannot be processed in parallel.To define what this means for the Union-Find problem, consider Listing 1.When joining sets (i.e., trees), the classical Union-Find implementation compares their sizes and then links the root of the smaller one to the root of the larger tree (lines 6 and 9).It is easy to see that concurrent links of the same root to different trees result in the loss of a link, and thus, are incorrect.In practice, this is where concurrent Union-Find algorithms employ synchronization primitives (e.g., Compare-And-Set in [16]).We say that two edges collide when they both connect different components and share the "smaller" component (the exact order of the components is determined by the linking strategy).Note that size updates in lines 7 and 10 are much less crucial, since they commute and can be implemented with simple atomic operations such as fetch-and-add.One may ask whether the definition of collision can be simplified to requiring that "two edges share an adjacent tree."This change will not make data structure implementations simpler but will simplify the analysis.Unfortunately, this dramatically increases the number of collisions.Consider Erdos-Rényi random graphs [18] with the number of already inserted edges between where ε is a constant strictly between 0 and 1 4 .It is known that the largest component in this case is of size Ω(|V | 2/3 ), and all other components are of sizes O(log |V |), w.h.p. [18].As a result, the probability of simplified collision for two random edges is Ω(|V | −1/3 ), while in the case of our definition . In fact, this suggests that for our definition the total number of collisions for random graphs is polylogarithmic, whereas for the simplified definition it is polynomial.

The Random Process
Similar to prior work, e.g.[1,8], it is convenient to assume that the operations of the Union-Find algorithm are executed sequentially, i.e., edges are added one by one to the union graph G in some order.This may not be the case in the actual execution, since parallelism allows to add several edges at once, although, in our parallel algorithm, to guarantee Figure 1: An example of two edges colliding in the case of linking by size.The black solid edges are the ones already inserted.The dashed edges are yet to be added to the data structure.Since C 3 is the smaller component, the blue edges will try to write to the same memory location (the root of C 3 ), i.e., they have a collision and cannot be added to the data structure in parallel.
deterministic computations, these edges will still be from the prefix of the unprocessed edge sequence.We call "step t" the moment of time when the t-th edge has been added to the data structure.Specifically, we denote the t-th added edge by ε t and the current set of inserted edges is Edges that are yet to be added (i.e., edges from E \ E t ) are called active.In each step, we assume in our analysis that all present connected components are enumerated in the order defined by the used linking strategy (e.g., in order of increasing sizes when linking by size is used) from 1 to C t .Finally, we call m t i the number of active edges that connect the component of index i to a connected component with a strictly greater index (i.e., to a "larger" component) in step t.Notice that these m t i edges pairwise collide according to our definition of collision.
As in prior works analysing graph algorithms [3,4,1], we assume that the order of edges is uniformly random.Otherwise, there can be examples with Ω(|E|) collisions among subsequent pairs of edges in the sequence, resulting in no potential for parallelism.We will prove that the expected number of collisions among two random active edges is small.More formally, let X t be an indicator random variable for the event "two distinct active edges (i.e. from E \ E t ) chosen uniformly at random in step t collide."Let the probability of collision for two random distinct active edges in step t be p t = E [X t ].We will use conditional expectation E [X t | ε 1 , . . ., ε t ] when previously added edges are known (ε t is a random variable when the order of edges is random).
The next theorem is the core of our performance analysis in Section 4, and is proven in Section 3. It states that the sum of probabilities of collisions over all steps is polylogarithmic.

Model of Parallelism
For our parallel algorithm (Subsection 4.2), we assume concurrent-reads concurrent-write parallel random-access machine (CRCW PRAM) with priority write.Priority write (write_min) is an atomic instruction that writes an input value to some memory address only if it is smaller than the current value at that location.This atomic instruction is widely used in parallel algorithms, especially for the Connected Components and Spanning Tree problems, and can be efficiently implemented [9,3].
We analyse our parallel algorithms using the standard work-depth approach.The work of a parallel algorithm is the total number of executed instructions.The parallel depth (or span) is the length of the longest computational dependency chain.

Collision Analysis
The goal of this section is to introduce an argument for bounding the number of collisions in the case of random edge ordering.First of all, we will prove the following lemma about collision probability when all added edges are known.Please recall that all notation has been defined in Subsection 2.2.Lemma 3.1.For any t ≥ 0, the collision probability in step t for fixed ε 1 , . . ., ε t is Proof.The probability that we select an edge from component i to a larger component is the number of such edges, denoted by m t i , divided by the number of remaining edges in step t, which is |E| − t.The only edges that can cause a collision with the chosen edge are other edges connecting component i to a larger component.Consequently, the probability that another random edge causes a collision with regard to selected edge is , and therefore, the probability that two random edges collide is Overview A natural proof strategy would be to directly bound p t .However, the value of p t can heavily depend on the structure of graph G.For example, it follows by simple calculation that for a star graph, p 0 is at least some constant greater than zero, while for a cycle graph p 0 = o(1), but p |E|−3 is a non-zero constant.Instead, we prove that p t cannot be high for all possible t by employing a type of amortized analysis.Specifically, we present a metric which increases considerably when p t is high, and is bounded at the same time.As a result, this will help us to bound Let depth t (u) be the depth of vertex u in the uncompressed Union-Find forest in step t, where "uncompressed" means that depth is counted ignoring any previous path compaction.Most reasonable linking techniques guarantee that for any vertex u, we have that depth t (u) ≤ C • log |V | for some constant C > 0. We define our potential function as follows: Definition 3.2.For each edge (u, v) ∈ E, we define its rank at time t as φ t (u, v) = depth t (u) + depth t (v).The potential function at t will be Φ In more detail, the potential function is a sum over all edges of their ranks, but the ranks are multiplied by a constantly growing factor |E| |E|−t , and the rank φ i and its multiplier for edge ε i are "frozen" at time i when the edge was processed.Initially, we have Φ 0 = 0, as the graph is empty, and thus, the ranks of all edges are zero.Let us observe how this potential is expected to change after each step, given that all remaining edges have equal probability to be added to the data structure.That is, we follow the sum Next, we connect this potential with the probability of collision.
Proof.Assume that in step t + 1, the newly added edge ε t+1 is (u, v) ∈ E. If the edge is internal, i.e., connects nodes in the same component, the edge does not affect connectivity or edge ranks.The potential function still increases because of the increasing multiplier |E| |E|−t , but we use a trivial lower bound of 0 for this increase.If the edge is external, i.e., connects two different connected components, assume that the index of the smallest component is i.Then, the number of edges connecting this component to larger components is m t i , so the probability that this component is chosen as the smaller one is Due to the fact that the root of this component will be linked to another component, ranks of all active edges adjacent to it are increased by at least 1.We can lower bound the number of such edges by m t i .This means that, with probability at least m t i |E|−t , component i is chosen and then the sum of active edge ranks is increased by at least m t i .Finally, taking into consideration the multiplier of the potential function and summing over all components, we get the required inequality Proof.Follows by combining Lemma 3.1 with the inequality in Lemma 3.3.Note that these expectations are differentthe left one is over the choice of ε t+1 , while the right one is over the choice of two random active edges.
Proof.Summing up inequality from Lemma 3.4 over all steps and taking expectation of the sum, we get the following: By the law of total expectation, this can be simplified to the next inequality: Finally, using the definition of p t and reducing the left sum, we deduce that: This corollary means that in order to upper bound the smoothed collision probabilities Theorem 3.7.The sum of collision probabilities satisfies Proof.Combining Corollary 3.5 and Lemma 3.6, we conclude Surprisingly, the proof of Theorem 3.7 shows

Work-Efficient Union-Find
We start by showing how to apply Theorem 3.7 to some practical Union-Find algorithms in various computation models.Specifically, we will speak about incremental dynamic connectivity and minimum spanning tree problems.

Bounding Contention
The discussion so far assumed an idealized sequential execution.We now wish to analyse concurrent Union-Find algorithms, which can be implemented in practice via locks, hardware transactions, or lock-free primitives [15,19], using a generalized concurrency model.In all these implementations, extra work comes from write contentionsituation when several threads try to modify the same memory location simultaneously.For locks, write contention means the need of waiting until another thread finishes modifying the required memory location, while for hardware transactions and lock-free primitives, it means retries of operations.Moreover, in practice, write contention causes additional L3 cache misses in NUMA (Non-Uniform Memory Access) computer architectures.This is why we will focus on bounding the probability that threads experience write contention when attempting to process edges in parallel.
More formally, our basic concurrency model is: a set of T ≥ 2 threads execute synchronously in iterations, where in each iteration each thread picks a remaining edge uniformly at random, and inserts it into the Union-Find data structure.
The threads may pick colliding edges, in which case we register a write contention event; otherwise, the edges are processed without contention.Either way, we assume that the edges are processed.Our goal is to bound the expected total number of write contention events between threads when processing all edges.In our case, write contention between threads means collision between their edges according to our definition of collision.Theorem 4.1.The expected number of write contention events for two threads in this concurrent model is bounded by Proof.Since edges are inserted randomly into the data structure, this algorithm is almost the same as the random process described in Subsection 2.2.Specifically, iteration t in this algorithm corresponds to step 2t in the random process because two new edges are added every iteration.Moreover, the probability of write contention at iteration t is exactly p 2t (see the definitions in Subsection 2.2).So, the total number of write contention events is The last inequality is from Theorem 3.7, stated in the end of the previous Section.
Theorem 4.2.The expected number of write contention events for T threads in this concurrent model is Proof.The number of possible pairwise contentions every iteration is Θ(T 2 ), so by linearity of expectation, the expected number of collisions at iteration t is O(T 2 • p T •t ).Then, the total number of contention events is bounded by The last two results show that the number of write contention events depends on the size of the graph only polylogarithmically, and thus, may be dominated by other costs of processing a graph.This suggests fairly low contention cost for implementing Union-Find via most concurrency primitives (locks, atomic operations, transactions), which was also previously shown in practice [19].

Parallel Algorithm
Our main goal is to design a parallel iterative algorithm for Union-Find that is work-efficient and internally deterministic.While work-efficient parallel algorithms are well-known for many problems, relatively few algorithms have the second property [8,3,4].Internal determinism is particularly interesting for Union-Find, since running an internallydeterministic parallel Union-Find algorithm on a sorted sequence of edges is similar to Kruskal's algorithm [20] and results in a minimal spanning tree.Blelloch et al. [8] present a practical algorithm for Union-Find but without a theoretical analysis.Their algorithm is deterministic but not internally deterministic, as its output (i.e., edges used for connectivity) depends on some predefined parameter and may diverge from the output of the sequential algorithm.
We aim to close these gaps by providing a practically-efficient algorithm which is provably work-efficient, and internally deterministic.Specifically, the algorithm uses concurrent-reads concurrent-write (CRCW) PRAM with priority write (see Subsection 2.3).

Deterministic Reservations
We build on the algorithm of Blelloch et al. [8], which uses the deterministic reservations approach.Specifically, in each iteration, their algorithm considers the prefix of remaining tasks (i.e.edges) of size S and then proceeds in two phases.In the first phase, these tasks do reservations of memory locations they want to change by using priority write.In the second phase, all tasks in the prefix that succeeded in their reservations are executed.For the Union-Find problem, in the first phase tasks reserve the smaller adjacent connected component for each edge, by making a priority write in its root and then link roots of these smaller components to other components in the case of successful reservation.This behaviour follows our definition of collision: the algorithm adds all edges in the prefix that do not collide with any preceding edges.
There are several challenges when making this algorithm both work-efficient and internally deterministic.First, Blelloch et al. do not analyze how often collisions happen and how many iterations the algorithm requires.Furthermore, parallelism in their algorithm violates linking strategy properties, which may make their algorithm not work-efficient.For example, on a directed path, their algorithm can add all edges in one iteration, but the depth of the resulting Union-Find forest will be linear.Last but not least, the algorithm is not internally deterministic: there are cases where the sequential algorithm uses some edge for connectivity (i.e., union(u, v) returns true), but this algorithm replaces it with an edge located later in the sequence.Blelloch et al. suggest a modification of their algorithm to solve Spanning Tree, which does make it internally deterministic but it requires each edge to reserve both of its endpoints.As a result, for random or star graphs, few operations will succeed in every iteration, since most operations will try to reserve the largest connected component.

Considered Prefix
Processed Prefix Figure 2: An example how our parallel algorithm processes edges.It scans some prefix of the remaining edge sequence and does deterministic reservations.If there is a collision, i.e., for some edge, deterministic reservation failed because of another edge, the algorithm processes not all considered prefix, but only its part before the first collision.

Internally-Deterministic Algorithm
One key difference between the algorithm of Blelloch et al. and the one we are analyzing is that we do not allow the execution of a task unless all tasks before it will be executed by the end of the current iteration (Figure 2).In other words, we execute all tasks in the prefix until a task with failed reservation, i.e., an edge with a collision with one of the previous edges.This modification addresses the latter problem in the algorithm of Blelloch et al. by making it internally deterministic.This appears to make the algorithm less efficient; yet, we will prove that collisions are relatively rare, so the modification does not have significant impact on performance.Similarly to other iterative algorithms [3,4], we assume that the order of edges is uniformly random.
Our parallel algorithm is presented in Listing 2. It starts the same way as the algorithm of Blelloch et al. [8] by making deterministic reservations in the unprocessed prefix of size S (Line 16).Specifically, the algorithm finds the roots of the corresponding trees for each edge and tries to reserve the smaller one according to the used linking strategy.Then, it checks in parallel whether all reservations are successful and, if not, it finds the first unsuccessful reservation using write_min instruction (Line 17).Finally, the algorithm completes all edge additions in the prefix until the first failed reservation (Line 19).In the simplest case, this means just linking the root that was designated as "smaller" in the previous step to the larger one.However, we will propose another approach that maintains all the guarantees of the linking techniques on forest depth.This process is repeated until all edges are processed.
Asymptotic Optimizations A naive implementation would spend O(S •depth) total work and has O(depth) span for each iteration, where depth is the Union-Find forest depth; due to parallelism, even with linking techniques, the forest depth potentially can be up to linear in the simplest algorithm.However, incremental improvements can address this and improve total work to an optimal O(S • α(|V |)) while keeping the span polylogarithmic, where α(•) is the inverse Ackermann function.Both adjustments were first proposed by Simsiri et al. [10] in their work-efficient batched parallel Union-Find algorithm.First of all, Parallel-Link-All can preserve forest depth bounds if a linear-work connected components parallel algorithm is used to group all connected components that will form one component [10].Then, all grouped connected components can be merged in parallel by applying a recursive divide-and-conquer approach.In particular, when merging two Union-Find trees, any linking strategy can be used, since all information about these trees has been calculated recursively.This Parallel-Link-All takes linear work and polylogarithmic span.The second optimization ensures that parallelism in path compaction does not increase total work.Specifically, in lines 4 and 5 Listing 2: Parallel algorithm for Union-Find.In each iteration, a prefix of size S is being processed in parallel unless there is a write collision among its tasks.Parallel-Link-All was presented in the work-efficient Union-Find [10].It groups roots according to connected components and then unites them via a recursive divide-and-conquer strategy.
the same bulk-parallel approach as in [10] is employed.Roughly, the algorithm acts like a parallel BFS (Breadth-First Search) that starts in vertices, for which we want to find the root, and ascends wave-by-wave until it finds all roots.
Then it traverses all nodes again and re-links them directly to the roots.These adjustments allow Union-Find to use both path compaction and linking strategies, and as a result, the average work per edge is the optimal O(α(|V |)).

Algorithm Analysis
Theorem 4.3.The Union-Find algorithm in Listing 2 is internally deterministic.
Proof.We proceed by contradiction.Assume that the sequential algorithm uses a different set of edges to form connected components.Consider the first edge (u, v) for which the result differs.It was either taken by the presented algorithm and not taken by the sequential one or vice versa.
Case 1: This edge was used by our algorithm but not by the sequential one.The fact that it was not used by the sequential algorithm means that it, together with the previous edges, forms a cycle.As it was the first difference, our algorithm acted incorrectly and created a cycle in the Union-Find forest.However, linking techniques impose a total order on all connected components and the algorithm links smaller roots to larger ones.The cycle should have all edges directed in the same direction since in one iteration only one outgoing edge can be added for each connected component.This means that obtaining a cycle is impossible, because in a directed cycle at least one edge (link) contradicts the total order at the start of iteration and our algorithm does not allow this.
Case 2: (u, v) was used by the sequential algorithm and not by our algorithm.Consider the iteration when the edge was processed.We know that, at each iteration, all edges inside the processed prefix either tried to make a reservation and succeeded, or did not try because the vertices are already connected.We know the former is not true, because by assumption, (u, v) was not used, so the latter should be true, which means that u and v were already connected before the iteration.However, we know that the edges added before the iteration cannot connect u and v, otherwise the sequential algorithm would make a cycle.We have obtained a contradiction.
After proving that our algorithm is indeed internally-deterministic, what is left is to bound the number of its iterations.Proof.It is easy to see that the minimum number of required iterations in absence of collisions is |E| S , since at most S edges are processed in each iteration.Every collision in the prefix can cause the algorithm to perform at most one more iteration.So, to bound the number of "extra" iterations we can bound the total number of collisions across all considered prefixes of unprocessed edges.
Let k be an arbitrary index in the edge sequence.We want to bound the probability of collision inside the window denoted by positions [k, k + S) if all edges before k were inserted in the data structure.For this reason, we define Y k to be an indicator random variable for the event "there is at least one collision inside window [k, k+S] if all edges before k were inserted in the data structure".Since the edge sequence is random, the inserted edges and edges inside this window are random as well.For two fixed positions in the window the probability of their collision then is exactly Subsection 2.2).By the union bound, the probability of a collision in the window since there are less than S 2 pairs of possibly colliding positions in the window.Let I be the set of unprocessed prefix start positions (i.e., the set of all observed values of i in Listing 2).Similarly to the previous paragraph, we would want to say that the total expected number is bounded by i∈I S 2 • p i .However, the content of I has a correlation with the sequence of edges.For instance, if in some iteration there was a collision, we know that the unprocessed prefix of the next iteration starts with an edge that had a collision with one of the edges before.This means that the order of edges in the unprocessed tail is not uniformly random anymore and we cannot use the definitions from Subsection 2.2.This issue has two possible solutions.The first one requires a small modification to Listing 2: instead of processing the window [i, stop), the algorithm will process this window and then will separately process the task at position stop.This modification does not asymptotically increase the total work or the span of an iteration.However, the separately processed edge at position stop removes the correlation between the order of remaining edges and the start position of the next prefix.This is because the edges strictly after position stop do not affect the current iteration, and thus, all possible permutations of the tail continue to be equiprobable.
The second solution does not require modifying Listing 2 but complicates the analysis.Consider the toy algorithm in Listing 3. It adds edges one by one and at the same time counts the number of pairwise collisions in all possible windows of size S. Note that this do not have the same problem as the algorithm in Listing 2 and all permutations of the tail of edges continue to be equiprobable, since the algorithm tries all windows of size S.This allows us to say that the probability of a collision in window i is E [Y i ], and thus, by the union bound, the total expected number of collisions is The last equality is Theorem 3.7.Finally, the last observation we have to make is that the number of collisions counted by the algorithm in Listing 3 is not less than the number of collisions that occurred in the algorithm in Listing 2 due to the fact that all windows considered in the latter algorithm were also considered by the former algorithm.Proof.We know that one iteration takes O(S • α(|V |)) total work.As a result, the expected total work is the expected number of iterations multiplied by the iteration work, which is Corollary 4.6.The number of iterations for the presented Union-Find algorithm is O(|E| in expectation when S is chosen optimally.The expected parallel depth (span) of the algorithm is then O(|E| Proof.It is easy to prove by substitution that the asymptotic minimum of the equation in Theorem 4.4 is achieved when The span directly follows from the bound on the number of iterations and the polylogarithmic span of one iteration.
In addition, we observe that when the prefix size is S, up to S processes can be leveraged by the algorithm.When combined with Corollary 4.5, we get the following theorem: It is worth mentioning that for dense graphs this total work becomes linear in the number of edges.This is because the sequential Union-Find data structure has another work bound of processing |E| edges, which is Iteration Dependence Depth Another metric of interest is the iteration dependence depth [3].Yet, the definition of this metric is not obvious for the Union-Find problem.A natural definition could be that there is a dependency between tasks if at some time there is a data race between them.Following this definition, Corollary 4.6 means that the expected iteration dependency depth is O(|E| A more general definition for dependency depth could be that exactly the same set of union(u, v) operations should succeed (i.e., return true).However, for this definition, the dependence depth is constant: failed union(u, v) operations depend on successful union operations that formed the path between u and v, while successful operations do not have any dependencies.

Simple Extension to MST
Lastly, since the algorithm is internally deterministic, it works similarly to a parallel version of Kruskal's algorithm on a sorted sequence of edges.We believe this is the first direct parallelization of Kruskal's algorithm.For example, in the parallel MST algorithm of Katsigiannis et al. [21], the main thread processes the whole edge sequence, while other threads are helping by checking and filtering out internal edges.In this case, the work of successful unite(u, v) operations was still essentially sequential and done by the same thread.Moreover, in Blelloch et al.MST algorithm [8], few edges are processed on average in one iteration for random and star graphs.By contrast, we can parallelize this process here.The key restriction is that, to be provably efficient, our approach requires the sorted sequence of edges to be in fact a random shuffle of edges.

Lower Bounds
In the previous sections, we showed upper bounds for different metrics, leaving open the question of tightness of these bound.We now attempt to close this gap.Our proofs require the reasonable assumption that the Union-Find tie-breaker when there are isomorphic connected components (in particular, of the same rank or size) acts randomly, i.e., compares random priorities, since it does not have any additional information about the graph structure.Figure 3: An example of Union-Find execution on a cycle graph.Solid lines symbolize already added edges, while dotted lines are edges that are not inserted yet.The red edge was randomly chosen, its clockwise next component is smaller than its neighbours, so the only edge this edge can have collision with it is the green edge.

Lower Bound on Collision Count
We start by providing a lower bound on the number of collisions (Theorem 3.7).Specifically, we show that the poly-logarithmic bound we provided cannot be improved by more than a logarithmic factor.
Theorem 5.1.For a cycle graph, we have Proof.Consider a cycle graph.In step t, there are exactly |V | − t connected components and |V | − t active edges.Note that since we added edges in a uniformly random order, any combination of t edges from the cycle is equally probable to be picked.For every connected component, there are two neighboring components (i.e., components adjacent through an active edge) in any step t, when t ≤ |V | − 3.An important observation is that if we contract every currently present connected component in a single vertex, the remaining graph is still a cycle.When we add a random edge, for the clockwise-next endpoint component there is a 1 3 chance that it is smaller than both of its neighbouring components due to symmetry of the graph (unless there are two or less components in total, but this occurs only in the last 3 steps).If it is smaller than its neighbours, then there is exactly one edge (out of |V | − t edges) in the graph that can cause a collision with the chosen one -the edge between this endpoint and its other neighbouring component (Figure 3).So, we get:

Lower Bound on Number of Iterations
A significantly more involved analysis is required for the parallel algorithm from Section 4.2.The challenge is that we cannot use the symmetry argument to show that the probability of collision for an adjacent pair of edges is 1 3 , since the subsequences of edges processed in each iteration do not have fixed positions, and instead depend on when a collision happened in the previous iteration.We begin with a technical lemma, which is proved in Appendix A. Lemma 5.3.Consider a random permutation of 1, 2, . . ., N .Let M be the number of local minima in this permutation, i.e. the number of elements S i smaller than both their neighbours S i−1 and S i+1 (supposing that the end elements S 1 and S N are also adjacent).Then, it holds that Pr[M ≤ N −3 18 ] ≤ 24 N −3 .
The next lemma will be the core of the lower bound proof.Similarly to Theorem 5.1, it analyses the collision probability for a cycle graph.Intuitively, it shows that for a prefix of edges of size Ω( √ N • log N ) the probability of collision is almost 1.The (quite complex) proof is also provided in Appendix A.
Lemma 5.4.Consider a sequence of randomly shuffled cycle graph edges of size N and its prefix of size W ≥ C • √ N • log N , where C is some constant which will be made clear in the proof.Suppose there are M = pN pairs of adjacent colliding edges in this graph, 0 < p ≤ 1  3 is a constant.Then the probability that there is no collision in this prefix is O 1 N 2 .
We can now finally state and prove our main lower bound result: Theorem 5.5.The number of iterations of any parallel algorithm processing conflict-free prefix of edges (in particular, of the algorithm in Listing 2) is Ω with high probability for a cycle graph.
Proof.Again consider a cycle graph.Initially, when the Union-Find data structure is empty, all connected components are of size and rank 1.In this case, we assumed that linking by rank or by size use random priorities to break ties.As a result, the number of vertices adjacent to two colliding edges is exactly the number of local minima in the random priorities permutation.Consequently, by Lemma 5.3, the number of pairs of colliding edges is at least p|E| for some constant p > 0 with probability of failure O 1 |E| .
We will prove that Ω iterations are needed to process the first p 3 |E| edges.When an edge is processed, at most 2 pairs of colliding edges (adjacent to the edge) can stop colliding, since their adjacent connected components change.Thus, after p 3 |E| edges there will still be at least p|E| − 2 p 3 |E| = p 3 |E| pairs of colliding edges (which are still adjacent to connected components of size 1).
Let C be the constant from Lemma 5.4 for the case when there are at least p 3 |E| pairs of colliding edges.Let A i be the event that in the i-th iteration, the processed prefix is of size at least Observe that to process We first remark that Consider the probability of such an event P r[A i | A 1 , . . ., A i−1 ].Since the previous A j (j < i) are assumed to not occur, less than p 3 |E| edges were processed before.The first edge in the next processing window may have had a collision with some edge in the previous iteration.However, no information is known about edges after it or their order, so we can still assume that the order is uniformly random.This lets us to apply Lemma 5.4 for the next W −1 edges after the first one to obtain that the probability of no collision among these edges is O 1 |E| 2 , i.e., with probability at least 1 − O 1 |E| 2 no more than W edges can be processed in i-th iteration due to data dependencies.Since it is true for any fixed possible start position of the processing window, it is also true in general, i.e., P r[A , which completes the proof once we substitute the value of I.
Now we will prove that this sum is O 1 N 2 .To do this, we first show that a small prefix of this sum is O 1 N 2 due to small binomial coefficients and then show that the rest of the sum is O 1 N 2 due to arising collision probabilities.
Consider lW i=0 W i V M −i for some constant 0 < l < 1 2 , which will be chosen later.For i ≤ lW , By Lemma A.1, lW i=0 W i ≤ 2 H(l)W −ln W/2+O (1) .
By combining these inequalities, we get that 2 (H(l)−(1−l) log 1 1−2p )W −ln W/2+O (1) .As a result, we can choose constant l so that H(l) − (1 − l) log 1 1−2p < 0, which would mean that this sum decreases exponentially with W and in particular is What is left is to prove that .If we remove the first three quarters of these multipliers (since the corresponding fractions are ≤ 1), we obtain that . The last step follows from Bernoulli's inequality.When i ≥ lW , V M −i ṼM−i ≤ e −(lW ) 2 /16M = e −l 2 C 2 N log N/16pN = e −l 2 C 2 log N/16p .Now, given that l and p are constants, we can fix constant C so that −l 2 C 2 • ln 2/16p < −2.This would mean that for i ≥ lW , V M −i Summing up, the probability that there is no collision in the prefix T (0, M ) = O 1 N 2 .for some points the adaptive algorithm is better both in terms of the number of iterations and the total work, which may mean that for some graphs it is better to change the prefix size S during the execution, rather than to keep it fixed.
One of the main properties of the presented parallel algorithm is that it is internally deterministic.Yet, it is unclear whether to get this property we sacrifice Union-Find performance.To answer this question, we compare the algorithm to the concurrent Union-Find algorithm of Jayanti et al. [17], in particular to its fast implementation from [19,15], and to Blelloch et al. parallel algorithm [8].We emphasize that this is not a fair comparison, since neither of these algorithms are internally deterministic.When implementing our algorithm, we use the same optimizations as in [10].Specifically, we do not implement the BFS-like algorithm for path compaction and instead compact paths during searches using Compare-And-Set, as in the concurrent algorithm.

Performance Comparison
The performance benchmark results are available in Figure 7. First of all, we can see that our adaptive parallel algorithm matches or outperforms the non-adaptive algorithm with fixed prefix size S.As expected, the heuristic algorithm of Blelloch et al. has better performance than our algorithm, since our algorithm has additional restriction of being internally deterministic but the performance differences are usually small.The well-optimized fully-concurrent approach of Jayanti and Tarjan is superior to parallel algorithms on sparse graphs (USA roads, the cycle graph, and the random sparse graph).The reason is that, for sparse graphs, there is almost no contention between different threads, and thus, the concurrent algorithm has almost no synchronization cost.In dense graphs, the asynchronous nature of the concurrent algorithm does not provide any benefits over the parallel algorithms and incurs additional synchronization work.

|E|− 1 t=0p
t .Specifically, we show that, for any linking strategy that bounds the maximum forest depth to O(log |V |), the sum of collision probabilities over all edges will satisfy |E|−1 t=0 p t = O(log |V | log |E|).

Theorem 1 . 1 .
There exists an internally-deterministic parallel Union-Find algorithm for CRCW PRAM model with priority writes that has O(|E| • α(|V |)) expected total work and O(|E|

2
bool union (u , v ): u := find_root ( u ) v := find_root ( v ) if u == v : return false // Already in the same component if u .size < v .size : // Which component is larger ?u .parent = v // Link u to v v .size += u .size else : v .parent = u // Link v to u u .size += v .size return true Listing 1: union operation in the Union-Find data structure with the linking by size technique.

Theorem 3 . 7 .
For a uniform random edge ordering, and any linking Union-Find strategy that bounds union-find forest depth to O(log |V |), it holds that |E|−1 t=0 p t = O(log |V | • log |E|), where |V | is the number of vertices and |E| is the number of edges.

|E|− 1 t=0p
t = O(log |V | • log |E|) given a random order of edges and any linking Union-Find strategy that bounds forest depth to O(log |V |).

|E|/ 2 t=0p
t ≤ C • log |V |, i.e., for the first half of edges the expected number of collisions is just O(log |V |), which may mean that the factor of log |E| is an artefact of the current analysis.

Theorem 4 . 4 .
The expected number of iterations of the Union-Find algorithm in Listing 2 is |E| S + O(S 2 • log |V | • log |E|), if the order of edges is uniformly random.

Listing 3 : 1 i=0S 2 • 1 3
int c a l c u late_collisions ( edges ): i := 0 collisions := 0 while i < | edges |: collision += # pairwise collisions in [i , i + S ) Union -Find .union ( edges [ i ]) // insert only one edge Toy algorithm that helps to prove Theorem 4.4.It calculates the number of collisions across all possible windows of size S.Both approaches show that only |E|−p t = O(S 2 • log |V | • log |E|) expected collisions happen in the algorithm, and consequently, the number of extra iterations is O(S 2 • log |V | • log |E|).Corollary 4.5.The proposed Union-Find algorithm is work-efficient in expectation when S = O(|E| −ε ) for any constant ε > 0.

Theorem 4 . 7 .
There exists an internally-deterministic parallel Union-Find algorithm for CRCW PRAM model with priority write that has O(|E| • α(|V |)) expected total work on a randomly shuffled sequence of edges and scales perfectly for up to O(|E| 1 3 −ε ) processes.

Corollary 4 . 8 . 1 3
There exists a MST algorithm for CRCW PRAM model with priority write that has O(|E| • α(|V |)) expected total work, when random edge weights are generated independently and the edge sequence is already sorted, and it scales for T = O(|E| −ε ) parallel threads, for any constant ε > 0.

( 1 : 5 . 2 .• 1 |V |−t− 1 . 2 t=0p
the sum of Harmonic progression) This implies the following for the model in Section 4.Theorem The expected write contention for two threads in the concurrent model is Ω(log |V |) for cycle graphs.Proof.As we know from Theorem 5.1, for a cycle graph p t ≥ 1 3 On the other hand, from Theorem 4.1, the expected number of write contentions is |E|/2t .As a result, the required bound is the sum of harmonic series over every second element and is still Θ(log |V |).

Figure 4 :
Figure 4: Performance comparison of parallel and concurrent Union-Find algorithms, relative to an optimized sequential baseline.Our algorithm is competitive with prior proposals with weaker guarantees; all algorithms experience a performance drop at 64 threads, when computation is performed across two NUMA sockets.Please see Appendix for full results.

Figure 6 :
Figure 6: Comparison of the total work in the adaptive parallel algorithm and non-adaptive algorithm for various prefix sizes relative to the amount of work in the sequential algorithm.

Figure 7 :
Figure 7: Performance comparison of the parallel and concurrent Union-Find algorithm.The baseline is the sequential algorithm.All algorithms experience a performance drop for more than 64 threads, because ≤ 32 threads can fit in a single NUMA socket on the benchmark machine.The prefix size S for our parallel algorithm and for Blelloch at al. algorithm was chosen optimally based on experiment results.
, we can bound the potential function instead.We will do this in the next lemma.Note that for randomized linking strategies (e.g., linking by random priorities) the next bound holds in expectation over the randomness used in linking.That is why we "freezed" ranks of edges upon their processing.Observe the edge rank multipliers for edges added in different steps.For the first |E| 2 edges, this multiplier is at most t p t Lemma 3.6.For any linking strategy bounding maximal depth to O(log |V |), we have Φ |E|−1 = O(|E| • log |V | • log |E|).Proof.When forest depth is O(log |V |), it is easy to see that every edge rank is O(log |V |), and thus, the sum of edge ranks is O(|E| • log |V |).The problem is the |E| |E|−t edge rank multiplier, which ranges from 1 to |E|.|E| |E|−|E|/2 ≤ 2. Similarly, for the next |E| 4 edges, this multiplier is at most |E| |E|−3|E|/4 ≤ 4. By continuing this progression, we get the following for forest depth ≤ C • log |V |: